feat(memory): async LLM context summary injection on trim

- Unified flush + context injection into a single async LLM call (flush_from_messages accepts context_summary_callback) - Fixed response parsing bug: handle generator returns and Claude-format dicts from bot.call_with_tools, which previously caused all LLM summaries to silently fail (falling back to rule-based extraction) - Removed standalone context summary prompts and methods; reuse the existing [DAILY]/[MEMORY] summarization pipeline - Updated docs (zh/en/ja) to reflect the new injection behavior
2026-07-17 11:07:11 +08:00 · 2026-04-13 20:13:05 +08:00
parent da97e948ca
commit 33cf1bc4c3
9 changed files with 187 additions and 67 deletions
--- a/agent/memory/manager.py
+++ b/agent/memory/manager.py
@@ -401,24 +401,28 @@ class MemoryManager:
        user_id: Optional[str] = None,
        reason: str = "threshold",
        max_messages: int = 10,
+        context_summary_callback=None,
    ) -> bool:
        """
        Flush conversation summary to daily memory file.
-        
+
        Args:
            messages: Conversation message list
            user_id: Optional user ID
            reason: "threshold" | "overflow" | "daily_summary"
            max_messages: Max recent messages to include (0 = all)
-        
+            context_summary_callback: Optional callback(str) invoked with the
+                daily summary text for in-context injection
+
        Returns:
-            True if content was written
+            True if flush was dispatched
        """
        success = self.flush_manager.flush_from_messages(
            messages=messages,
            user_id=user_id,
            reason=reason,
            max_messages=max_messages,
+            context_summary_callback=context_summary_callback,
        )
        if success:
            self._dirty = True
--- a/agent/memory/summarizer.py
+++ b/agent/memory/summarizer.py
@@ -29,9 +29,9 @@ SUMMARIZE_SYSTEM_PROMPT = """你是一个记忆提取助手。你的任务是从
 ## 第二部分：长期记忆（[MEMORY]）

 提取值得**永久记住**的关键信息，这些信息在未来的对话中仍然有价值：
- 用户的偏好、习惯、风格（如"用户偏好中文回复"、"用户喜欢简洁风格"）
- 重要的决策或约定（如"项目决定使用 PostgreSQL"）
- 关键人物信息（如"张总是用户的上级"）
+- 用户的偏好、习惯、风格
+- 重要的决策或约定
+- 关键人物关系
 - 用户明确要求记住的内容
 - 重要的教训或经验总结

@@ -56,6 +56,7 @@ SUMMARIZE_USER_PROMPT = """请从以下对话记录中提取记忆（按 [DAILY]
 {conversation}"""


+
 class MemoryFlushManager:
    """
    Manages memory flush operations.
@@ -124,21 +125,19 @@ class MemoryFlushManager:
        user_id: Optional[str] = None,
        reason: str = "trim",
        max_messages: int = 0,
+        context_summary_callback: Optional[Callable[[str], None]] = None,
    ) -> bool:
        """
        Asynchronously summarize and flush messages to daily memory.
-        
+
        Deduplication runs synchronously, then LLM summarization + file write
        run in a background thread so the main reply flow is never blocked.
-        
-        Args:
-            messages: Conversation message list (OpenAI/Claude format)
-            user_id: Optional user ID for user-scoped memory
-            reason: Why flush was triggered ("trim" | "overflow" | "daily_summary")
-            max_messages: Max recent messages to summarize (0 = all)
-        
-        Returns:
-            True if flush was dispatched
+
+        If *context_summary_callback* is provided, it is called with the
+        [DAILY] portion of the LLM summary once available. The caller can use
+        this to inject the summary into the live message list for context
+        continuity — one LLM call serves both disk persistence and in-context
+        injection.
        """
        try:
            import hashlib
@@ -153,18 +152,18 @@ class MemoryFlushManager:
                    deduped.append(m)
            if not deduped:
                return False
-            
+
            import copy
            snapshot = copy.deepcopy(deduped)
            thread = threading.Thread(
                target=self._flush_worker,
-                args=(snapshot, user_id, reason, max_messages),
+                args=(snapshot, user_id, reason, max_messages, context_summary_callback),
                daemon=True,
            )
            thread.start()
            logger.info(f"[MemoryFlush] Async flush dispatched (reason={reason}, msgs={len(snapshot)})")
            return True
-            
+
        except Exception as e:
            logger.warning(f"[MemoryFlush] Failed to dispatch flush (reason={reason}): {e}")
            return False
@@ -175,6 +174,7 @@ class MemoryFlushManager:
        user_id: Optional[str],
        reason: str,
        max_messages: int,
+        context_summary_callback: Optional[Callable[[str], None]] = None,
    ):
        """Background worker: summarize with LLM, write daily file + MEMORY.md (Light Dream)."""
        try:
@@ -213,6 +213,13 @@ class MemoryFlushManager:
            if memory_part:
                self._append_to_main_memory(memory_part, user_id)

+            # --- Inject context summary into live messages (if callback provided) ---
+            if context_summary_callback and daily_part:
+                try:
+                    context_summary_callback(daily_part)
+                except Exception as e:
+                    logger.warning(f"[MemoryFlush] Context summary callback failed: {e}")
+
            self.last_flush_timestamp = datetime.now()

        except Exception as e:
@@ -346,6 +353,52 @@ class MemoryFlushManager:
                lines.append(f"助手: {text[:500]}")
        return "\n".join(lines)

+    @staticmethod
+    def _extract_response_text(response) -> str:
+        """
+        Extract text from LLM response regardless of format.
+
+        Handles:
+        - Generator (MiniMax _handle_sync_response yields Claude-format dicts)
+        - Claude format: {"role":"assistant","content":[{"type":"text","text":"..."}]}
+        - OpenAI format: {"choices":[{"message":{"content":"..."}}]}
+        - OpenAI SDK response object with .choices attribute
+        """
+        import types
+
+        # Unwrap generator — consume first yielded item
+        if isinstance(response, types.GeneratorType):
+            try:
+                response = next(response)
+            except StopIteration:
+                return ""
+
+        if not response:
+            return ""
+
+        if isinstance(response, dict):
+            # Check for error
+            if response.get("error"):
+                raise RuntimeError(response.get("message", "LLM call failed"))
+
+            # Claude format: content is a list of blocks
+            content = response.get("content")
+            if isinstance(content, list):
+                for block in content:
+                    if isinstance(block, dict) and block.get("type") == "text":
+                        return block.get("text", "")
+
+            # OpenAI format
+            choices = response.get("choices", [])
+            if choices:
+                return choices[0].get("message", {}).get("content", "")
+
+        # OpenAI SDK response object
+        if hasattr(response, "choices") and response.choices:
+            return response.choices[0].message.content or ""
+
+        return ""
+
    def _call_llm_for_summary(self, conversation_text: str) -> str:
        """Call LLM to generate a concise summary of the conversation."""
        from agent.protocol.models import LLMRequest
@@ -359,27 +412,31 @@ class MemoryFlushManager:
        )
        
        response = self.llm_model.call(request)
-        
-        if isinstance(response, dict):
-            if response.get("error"):
-                raise RuntimeError(response.get("message", "LLM call failed"))
-            # OpenAI format
-            choices = response.get("choices", [])
-            if choices:
-                return choices[0].get("message", {}).get("content", "")
-        
-        # Handle response object with attribute access (e.g. OpenAI SDK response)
-        if hasattr(response, "choices") and response.choices:
-            return response.choices[0].message.content or ""
-        
-        return ""
+        return self._extract_response_text(response)
+
+    @staticmethod
+    def _extract_first_meaningful_line(text: str, max_len: int = 120) -> str:
+        """Extract the first meaningful line from assistant reply, skipping markdown noise."""
+        import re
+        for line in text.split("\n"):
+            line = line.strip()
+            if not line:
+                continue
+            # Skip markdown headings, horizontal rules, code fences, pure emoji/symbols
+            if re.match(r'^(#{1,4}\s|```|---|\*\*\*|[-*]\s*$|[^\w\u4e00-\u9fff]{1,5}$)', line):
+                continue
+            # Strip leading markdown bold/emoji decorations
+            cleaned = re.sub(r'^[\*#>\-\s]+', '', line).strip()
+            cleaned = re.sub(r'^[\U0001f300-\U0001f9ff\u2600-\u27bf\s]+', '', cleaned).strip()
+            if len(cleaned) >= 5:
+                return cleaned[:max_len]
+        return text.split("\n")[0].strip()[:max_len]

    @staticmethod
    def _extract_summary_fallback(messages: List[Dict], max_messages: int = 0) -> str:
        """
-        Rule-based fallback when LLM is unavailable.
-        Groups consecutive user+assistant messages into events instead of
-        listing each message individually.
+        Rule-based summary of discarded messages.
+        Format: "用户问了X; 助手回答了Y" per event, compact and readable.
        """
        msgs = messages if max_messages == 0 else messages[-max_messages * 2:]

@@ -393,19 +450,19 @@ class MemoryFlushManager:
            text = text.strip()

            if role == "user":
-                if len(text) <= 5:
+                if len(text) <= 3:
                    continue
-                current_user_text = text[:150]
+                current_user_text = text[:120]
            elif role == "assistant" and current_user_text:
-                first_line = text.split("\n")[0].strip()
-                if len(first_line) > 10:
-                    events.append(f"- {current_user_text} → {first_line[:150]}")
+                reply_summary = MemoryFlushManager._extract_first_meaningful_line(text)
+                if reply_summary:
+                    events.append(f"- 用户: {current_user_text} → 回复: {reply_summary}")
                else:
-                    events.append(f"- {current_user_text}")
+                    events.append(f"- 用户: {current_user_text}")
                current_user_text = ""

        if current_user_text:
-            events.append(f"- {current_user_text}")
+            events.append(f"- 用户: {current_user_text}")

        return "\n".join(events[:10])
    
--- a/agent/protocol/agent_stream.py
+++ b/agent/protocol/agent_stream.py
@@ -1207,6 +1207,56 @@ class AgentStreamExecutor:
        logger.warning("🔧 Aggressive trim: nothing to trim, will clear history")
        return False

+    def _build_context_summary_callback(self, discarded_turns: list, kept_turns: list):
+        """
+        Build a callback that injects an LLM summary into the first user
+        message of *kept_turns*. Returns None if no valid injection target.
+
+        The callback is passed to flush_from_messages so that the same LLM
+        call that writes daily memory also provides the in-context summary.
+        """
+        if not kept_turns:
+            return None
+
+        # Find the first user text block in kept_turns as injection target
+        target_block = None
+        for turn in kept_turns:
+            for msg in turn["messages"]:
+                if msg.get("role") == "user":
+                    content = msg.get("content", [])
+                    if isinstance(content, list):
+                        for block in content:
+                            if isinstance(block, dict) and block.get("type") == "text":
+                                target_block = block
+                                break
+                    if target_block:
+                        break
+            if target_block:
+                break
+
+        if not target_block:
+            return None
+
+        turn_count = len(discarded_turns)
+        original_text = target_block["text"]
+
+        def _on_summary_ready(summary: str):
+            if not summary or not summary.strip():
+                return
+            target_block["text"] = (
+                f"[System: Previous conversation summary — "
+                f"{turn_count} turns were compacted]\n\n"
+                f"{summary.strip()}\n\n"
+                f"The recent conversation continues below.\n\n---\n\n"
+                f"{original_text}"
+            )
+            logger.info(
+                f"📝 Context summary injected "
+                f"({len(summary)} chars, {turn_count} turns)"
+            )
+
+        return _on_summary_ready
+
    def _trim_messages(self):
        """
        智能清理消息历史，保持对话完整性
@@ -1233,25 +1283,28 @@ class AgentStreamExecutor:
            removed_count = len(turns) // 2
            keep_count = len(turns) - removed_count
            
-            # Flush discarded turns to daily memory
-            if self.agent.memory_manager:
-                discarded_messages = []
-                for turn in turns[:removed_count]:
-                    discarded_messages.extend(turn["messages"])
-                if discarded_messages:
-                    user_id = getattr(self.agent, '_current_user_id', None)
-                    self.agent.memory_manager.flush_memory(
-                        messages=discarded_messages, user_id=user_id,
-                        reason="trim", max_messages=0
-                    )
-            
+            discarded_turns = turns[:removed_count]
            turns = turns[-keep_count:]
-            
+
            logger.info(
                f"💾 上下文轮次超限: {keep_count + removed_count} > {self.max_context_turns}，"
                f"裁剪至 {keep_count} 轮（移除 {removed_count} 轮）"
            )

+            # Flush to daily memory + inject context summary (single async LLM call)
+            if self.agent.memory_manager:
+                discarded_messages = []
+                for turn in discarded_turns:
+                    discarded_messages.extend(turn["messages"])
+                if discarded_messages:
+                    user_id = getattr(self.agent, '_current_user_id', None)
+                    cb = self._build_context_summary_callback(discarded_turns, turns)
+                    self.agent.memory_manager.flush_memory(
+                        messages=discarded_messages, user_id=user_id,
+                        reason="trim", max_messages=0,
+                        context_summary_callback=cb,
+                    )
+
        # Step 3: Token 限制 - 保留完整轮次
        # Get context window from agent (based on model)
        context_window = self.agent._get_model_context_window()
@@ -1327,6 +1380,7 @@ class AgentStreamExecutor:
        # --- Many turns (>=5): discard the older half, keep the newer half ---
        removed_count = len(turns) // 2
        keep_count = len(turns) - removed_count
+        discarded_turns = turns[:removed_count]
        kept_turns = turns[-keep_count:]
        kept_tokens = sum(self._estimate_turn_tokens(t) for t in kept_turns)

@@ -1337,13 +1391,15 @@ class AgentStreamExecutor:

        if self.agent.memory_manager:
            discarded_messages = []
-            for turn in turns[:removed_count]:
+            for turn in discarded_turns:
                discarded_messages.extend(turn["messages"])
            if discarded_messages:
                user_id = getattr(self.agent, '_current_user_id', None)
+                cb = self._build_context_summary_callback(discarded_turns, kept_turns)
                self.agent.memory_manager.flush_memory(
                    messages=discarded_messages, user_id=user_id,
-                    reason="trim", max_messages=0
+                    reason="trim", max_messages=0,
+                    context_summary_callback=cb,
                )

        new_messages = []
--- a/docs/en/memory/context.mdx
+++ b/docs/en/memory/context.mdx
@@ -39,14 +39,15 @@ When conversation turns exceed `agent_max_context_turns`:

 - The **oldest half** of complete turns is trimmed (preserving tool call chain integrity)
 - Trimmed messages are summarized by LLM and **written to the daily memory file**
- Remaining turns stay intact
+- Once the LLM summary is ready, it is also **injected into the first user message** of the retained context, helping the model maintain conversational continuity
+- Summary injection runs asynchronously in the background and takes effect from the next turn onward

 ### 3. Token Budget Trimming

 After turn trimming, if tokens still exceed the budget:

 - **Fewer than 5 turns**: All turns undergo **text compression** — each turn keeps only the first user text and last Agent reply, removing intermediate tool call chains
- **5 or more turns**: The **first half** of turns is trimmed again, with discarded content also written to memory
+- **5 or more turns**: The **first half** of turns is trimmed again, with discarded content written to memory and a context summary injected

 ### 4. Overflow Emergency Handling

--- a/docs/en/memory/index.mdx
+++ b/docs/en/memory/index.mdx
@@ -19,7 +19,7 @@ Stored in `~/cow/memory/` directory, named by date (e.g., `2026-03-08.md`), reco

 The Agent automatically persists conversation content to long-term memory through the following mechanisms:

- **On context trimming** — When conversation turns or tokens exceed the configured limit, the oldest half of the context is trimmed, and the discarded content is summarized by LLM into key information and written to the daily memory file
+- **On context trimming** — When conversation turns or tokens exceed the configured limit, the oldest half of the context is trimmed, and the discarded content is summarized by LLM into key information and written to the daily memory file. The summary is also asynchronously injected into the retained context for conversational continuity
 - **Daily scheduled summary** — A full summary is automatically triggered at 23:55 every day, ensuring memory is preserved even on low-activity days (skipped if content hasn't changed)
 - **On API context overflow** — When the model API returns a context overflow error, the current conversation summary is saved as an emergency measure

--- a/docs/ja/memory/context.mdx
+++ b/docs/ja/memory/context.mdx
@@ -39,14 +39,15 @@ description: 会話コンテキスト — メッセージ管理、圧縮戦略

 - **最も古い半分** の完全なターンがトリミングされます（ツール呼び出しチェーンの完全性を保証）
 - トリミングされたメッセージは LLM によって要約され、**日次記憶ファイルに書き込まれます**
- 残りのターンはそのまま保持されます
+- LLM 要約が完了すると、保持されたコンテキストの最初のユーザーメッセージの先頭に要約が**注入**され、モデルが会話の文脈を維持できるようにします
+- 要約注入はバックグラウンドで非同期に実行され、次のターンから有効になります

 ### 3. トークン予算のトリミング

 ターンのトリミング後、トークン数がまだ予算を超えている場合：

 - **5 ターン未満の場合**：すべてのターンで**テキスト圧縮**を実行 — 各ターンは最初のユーザーテキストと最後の Agent 返信のみを保持し、中間のツール呼び出しチェーンを削除
- **5 ターン以上の場合**：**前半のターン**を再度トリミングし、破棄されたコンテンツも記憶に書き込まれます
+- **5 ターン以上の場合**：**前半のターン**を再度トリミングし、破棄されたコンテンツも記憶に書き込まれ、コンテキスト要約も注入されます

 ### 4. オーバーフロー緊急処理

--- a/docs/ja/memory/index.mdx
+++ b/docs/ja/memory/index.mdx
@@ -19,7 +19,7 @@ description: CowAgent の長期記憶システム — ファイル永続化、

 Agent は以下のメカニズムにより、会話内容を長期記憶に自動的に永続化します：

- **コンテキストトリミング時** — 会話ターン数またはトークン数が設定上限を超えた場合、最も古い半分のコンテキストがトリミングされ、LLM によって要約されて日次記憶ファイルに書き込まれます
+- **コンテキストトリミング時** — 会話ターン数またはトークン数が設定上限を超えた場合、最も古い半分のコンテキストがトリミングされ、LLM によって要約されて日次記憶ファイルに書き込まれます。要約は保持されたコンテキストにも非同期で注入され、会話の連続性を維持します
 - **毎日のスケジュール要約** — 毎日 23:55 に自動的にフル要約がトリガーされ、アクティビティが少ない日でも記憶が保存されます（内容が変更されていない場合はスキップ）
 - **API コンテキストオーバーフロー時** — モデル API がコンテキストオーバーフローエラーを返した場合、緊急措置として現在の会話要約が保存されます

--- a/docs/memory/context.mdx
+++ b/docs/memory/context.mdx
@@ -39,14 +39,15 @@ description: 对话上下文 — 消息管理、压缩策略和上下文操作

 - 裁剪 **最早一半** 的完整轮次（保证工具调用链的完整性）
 - 被裁剪的消息会通过 LLM 总结后**写入当天的日级记忆文件**
- 剩余轮次保持不变
+- LLM 摘要完成后，同时将摘要**注入到保留消息的第一条用户消息开头**，帮助模型在后续对话中保持上下文连贯性
+- 摘要注入在后台异步完成，不阻塞当前回复；注入的摘要在下一轮对话时生效

 ### 3. Token 预算裁剪

 裁剪轮次后，如果 token 数仍超出预算：

 - **轮次 < 5 时**：对所有轮次进行**文本压缩** — 每轮只保留第一条用户文本和最后一条 Agent 回复，去掉中间的工具调用链
- **轮次 ≥ 5 时**：再次裁剪**前半轮次**，被丢弃内容同样写入记忆
+- **轮次 ≥ 5 时**：再次裁剪**前半轮次**，被丢弃内容同样写入记忆并注入上下文摘要

 ### 4. 溢出应急处理

--- a/docs/memory/index.mdx
+++ b/docs/memory/index.mdx
@@ -19,7 +19,7 @@ description: CowAgent 的长期记忆系统 — 文件持久化、自动写入

 Agent 通过以下机制自动将对话内容持久化为长期记忆：

- **上下文裁剪时** — 当对话轮次或 token 超出配置上限时，裁剪最早一半的上下文，使用 LLM 将被裁剪的内容总结为关键信息写入当天记忆文件
+- **上下文裁剪时** — 当对话轮次或 token 超出配置上限时，裁剪最早一半的上下文，使用 LLM 将被裁剪的内容总结为关键信息写入当天记忆文件，并将摘要异步注入到保留的上下文中，帮助模型保持对话连贯性
 - **每日定时总结** — 每天 23:55 自动触发一次全量总结，防止低活跃日无记忆留存（内容无变化时自动跳过）
 - **API 上下文溢出时** — 当模型 API 返回上下文溢出错误时，紧急保存当前对话摘要