feat(memory): async LLM context summary injection on trim

- Unified flush + context injection into a single async LLM call (flush_from_messages accepts context_summary_callback) - Fixed response parsing bug: handle generator returns and Claude-format dicts from bot.call_with_tools, which previously caused all LLM summaries to silently fail (falling back to rule-based extraction) - Removed standalone context summary prompts and methods; reuse the existing [DAILY]/[MEMORY] summarization pipeline - Updated docs (zh/en/ja) to reflect the new injection behavior
2026-07-18 12:07:15 +08:00 · 2026-04-13 20:13:05 +08:00
parent da97e948ca
commit 33cf1bc4c3
9 changed files with 187 additions and 67 deletions
--- a/docs/en/memory/context.mdx
+++ b/docs/en/memory/context.mdx
@@ -39,14 +39,15 @@ When conversation turns exceed `agent_max_context_turns`:

 - The **oldest half** of complete turns is trimmed (preserving tool call chain integrity)
 - Trimmed messages are summarized by LLM and **written to the daily memory file**
- Remaining turns stay intact
+- Once the LLM summary is ready, it is also **injected into the first user message** of the retained context, helping the model maintain conversational continuity
+- Summary injection runs asynchronously in the background and takes effect from the next turn onward

 ### 3. Token Budget Trimming

 After turn trimming, if tokens still exceed the budget:

 - **Fewer than 5 turns**: All turns undergo **text compression** — each turn keeps only the first user text and last Agent reply, removing intermediate tool call chains
- **5 or more turns**: The **first half** of turns is trimmed again, with discarded content also written to memory
+- **5 or more turns**: The **first half** of turns is trimmed again, with discarded content written to memory and a context summary injected

 ### 4. Overflow Emergency Handling

--- a/docs/en/memory/index.mdx
+++ b/docs/en/memory/index.mdx
@@ -19,7 +19,7 @@ Stored in `~/cow/memory/` directory, named by date (e.g., `2026-03-08.md`), reco

 The Agent automatically persists conversation content to long-term memory through the following mechanisms:

- **On context trimming** — When conversation turns or tokens exceed the configured limit, the oldest half of the context is trimmed, and the discarded content is summarized by LLM into key information and written to the daily memory file
+- **On context trimming** — When conversation turns or tokens exceed the configured limit, the oldest half of the context is trimmed, and the discarded content is summarized by LLM into key information and written to the daily memory file. The summary is also asynchronously injected into the retained context for conversational continuity
 - **Daily scheduled summary** — A full summary is automatically triggered at 23:55 every day, ensuring memory is preserved even on low-activity days (skipped if content hasn't changed)
 - **On API context overflow** — When the model API returns a context overflow error, the current conversation summary is saved as an emergency measure

--- a/docs/ja/memory/context.mdx
+++ b/docs/ja/memory/context.mdx
@@ -39,14 +39,15 @@ description: 会話コンテキスト — メッセージ管理、圧縮戦略

 - **最も古い半分** の完全なターンがトリミングされます（ツール呼び出しチェーンの完全性を保証）
 - トリミングされたメッセージは LLM によって要約され、**日次記憶ファイルに書き込まれます**
- 残りのターンはそのまま保持されます
+- LLM 要約が完了すると、保持されたコンテキストの最初のユーザーメッセージの先頭に要約が**注入**され、モデルが会話の文脈を維持できるようにします
+- 要約注入はバックグラウンドで非同期に実行され、次のターンから有効になります

 ### 3. トークン予算のトリミング

 ターンのトリミング後、トークン数がまだ予算を超えている場合：

 - **5 ターン未満の場合**：すべてのターンで**テキスト圧縮**を実行 — 各ターンは最初のユーザーテキストと最後の Agent 返信のみを保持し、中間のツール呼び出しチェーンを削除
- **5 ターン以上の場合**：**前半のターン**を再度トリミングし、破棄されたコンテンツも記憶に書き込まれます
+- **5 ターン以上の場合**：**前半のターン**を再度トリミングし、破棄されたコンテンツも記憶に書き込まれ、コンテキスト要約も注入されます

 ### 4. オーバーフロー緊急処理

--- a/docs/ja/memory/index.mdx
+++ b/docs/ja/memory/index.mdx
@@ -19,7 +19,7 @@ description: CowAgent の長期記憶システム — ファイル永続化、

 Agent は以下のメカニズムにより、会話内容を長期記憶に自動的に永続化します：

- **コンテキストトリミング時** — 会話ターン数またはトークン数が設定上限を超えた場合、最も古い半分のコンテキストがトリミングされ、LLM によって要約されて日次記憶ファイルに書き込まれます
+- **コンテキストトリミング時** — 会話ターン数またはトークン数が設定上限を超えた場合、最も古い半分のコンテキストがトリミングされ、LLM によって要約されて日次記憶ファイルに書き込まれます。要約は保持されたコンテキストにも非同期で注入され、会話の連続性を維持します
 - **毎日のスケジュール要約** — 毎日 23:55 に自動的にフル要約がトリガーされ、アクティビティが少ない日でも記憶が保存されます（内容が変更されていない場合はスキップ）
 - **API コンテキストオーバーフロー時** — モデル API がコンテキストオーバーフローエラーを返した場合、緊急措置として現在の会話要約が保存されます

--- a/docs/memory/context.mdx
+++ b/docs/memory/context.mdx
@@ -39,14 +39,15 @@ description: 对话上下文 — 消息管理、压缩策略和上下文操作

 - 裁剪 **最早一半** 的完整轮次（保证工具调用链的完整性）
 - 被裁剪的消息会通过 LLM 总结后**写入当天的日级记忆文件**
- 剩余轮次保持不变
+- LLM 摘要完成后，同时将摘要**注入到保留消息的第一条用户消息开头**，帮助模型在后续对话中保持上下文连贯性
+- 摘要注入在后台异步完成，不阻塞当前回复；注入的摘要在下一轮对话时生效

 ### 3. Token 预算裁剪

 裁剪轮次后，如果 token 数仍超出预算：

 - **轮次 < 5 时**：对所有轮次进行**文本压缩** — 每轮只保留第一条用户文本和最后一条 Agent 回复，去掉中间的工具调用链
- **轮次 ≥ 5 时**：再次裁剪**前半轮次**，被丢弃内容同样写入记忆
+- **轮次 ≥ 5 时**：再次裁剪**前半轮次**，被丢弃内容同样写入记忆并注入上下文摘要

 ### 4. 溢出应急处理

--- a/docs/memory/index.mdx
+++ b/docs/memory/index.mdx
@@ -19,7 +19,7 @@ description: CowAgent 的长期记忆系统 — 文件持久化、自动写入

 Agent 通过以下机制自动将对话内容持久化为长期记忆：

- **上下文裁剪时** — 当对话轮次或 token 超出配置上限时，裁剪最早一半的上下文，使用 LLM 将被裁剪的内容总结为关键信息写入当天记忆文件
+- **上下文裁剪时** — 当对话轮次或 token 超出配置上限时，裁剪最早一半的上下文，使用 LLM 将被裁剪的内容总结为关键信息写入当天记忆文件，并将摘要异步注入到保留的上下文中，帮助模型保持对话连贯性
 - **每日定时总结** — 每天 23:55 自动触发一次全量总结，防止低活跃日无记忆留存（内容无变化时自动跳过）
 - **API 上下文溢出时** — 当模型 API 返回上下文溢出错误时，紧急保存当前对话摘要