Skip to content

功能建议:增量式会话记忆,实现零成本上下文压缩 || Function suggestion: Incremental session memory to achieve zero-cost context compression #1691

@ascl1u

Description

@ascl1u

What feature would you like to see?

问题 || Problem
目前 kimi-cli 的 /compact 在触发时需要一次完整的 LLM 调用来总结对话。在长会话中,这个调用本身就很昂贵,而且有时候上下文太大导致总结调用也会失败。

Currently, kimi-cli's /compact requires a full LLM call to summarize the conversation at trigger time. In long sessions this call is expensive, and sometimes the context is so large that even the summarization call fails.

方案 || Proposal
在会话过程中,后台周期性地将关键信息提取到一个结构化的 markdown 文件中(会话标题、当前状态、重要文件、工作流程、错误记录等)。当上下文压力触发压缩时,直接使用这个已经构建好的记忆文件作为摘要,裁剪旧消息,保留最近的消息尾部。

During the session, periodically extract key information in the background into a structured markdown file (session title, current state, important files, workflow, errors, etc.). When context pressure triggers compaction, use this pre-built memory file as the summary, prune old messages, and keep a tail of recent ones.

压缩时的 LLM 成本:零。 因为摘要已经在会话过程中增量构建完成了。

LLM cost at compaction time: zero. The summary has already been built incrementally during the session.

实现思路 || Implementation Approach
注册一个 post-sampling hook,在满足阈值条件(token 增长 + 工具调用次数)时触发提取

使用一个轻量级的 forked LLM 调用更新记忆文件,不阻塞主会话

压缩时读取记忆文件,计算保留哪些消息,插入记忆内容作为摘要

裁剪算法需要保证 tool_use/tool_result 配对完整性

Register a post-sampling hook that triggers extraction when thresholds are met (token growth + tool call count)

Use a lightweight forked LLM call to update the memory file without blocking the main session

At compaction time, read the memory file, calculate which messages to keep, insert memory content as summary

The pruning algorithm must preserve tool_use/tool_result pairing invariants

参考 || Reference
我最近发表了一篇关于 Claude Code 上下文压缩系统的详细分析,这个方案基于其中 Session Memory Compact 的设计:[文章链接]

I recently published a detailed analysis of Claude Code's context compaction system. This proposal is based on their Session Memory Compact design: [article link]

我愿意实现这个功能 || I'm willing to implement this
根据 CONTRIBUTING.md 的要求,先开 issue 讨论。如果方向合适,我可以提交 PR。

Per CONTRIBUTING.md, opening this issue for discussion first. Happy to submit a PR if the direction fits.

Additional information

No response


What feature would you like to see?

Problem || Problem
Currently kimi-cli's /compact requires a full LLM call to summarize the conversation when triggered. In long sessions, this call itself is expensive, and sometimes the context is so large that the summary call will fail.

Currently, kimi-cli's /compact requires a full LLM call to summarize the conversation at trigger time. In long sessions this call is expensive, and sometimes the context is so large that even the summarization call fails.

Proposal || Proposal
During the session, the background periodically extracts key information into a structured markdown file (session title, current status, important files, workflow, error logs, etc.). When context pressure triggers compression, this already constructed memory file is used directly as a digest, old messages are trimmed, and the latest message tail is retained.

During the session, periodically extract key information in the background into a structured markdown file (session title, current state, important files, workflow, errors, etc.). When context pressure triggers compaction, use this pre-built memory file as the summary, prune old messages, and keep a tail of recent ones.

LLM cost when compressing: zero. Because the summary has been incrementally built during the session.

LLM cost at compaction time: zero. The summary has already been built incrementally during the session.

Implementation Approach || Implementation Approach
Register a post-sampling hook to trigger extraction when threshold conditions are met (token growth + number of tool calls)

Use a lightweight forked LLM call to update memory files without blocking the main session

Read the memory file during compression, calculate which messages to keep, and insert the memory content as a summary

The clipping algorithm needs to ensure the integrity of the tool_use/tool_result pairing

Register a post-sampling hook that triggers extraction when thresholds are met (token growth + tool call count)

Use a lightweight forked LLM call to update the memory file without blocking the main session

At compaction time, read the memory file, calculate which messages to keep, insert memory content as summary

The pruning algorithm must preserve tool_use/tool_result pairing invariants

Reference || Reference
I recently published a detailed analysis of Claude Code's context compression system. This solution is based on the design of Session Memory Compact: [Article link]

I recently published a detailed analysis of Claude Code's context compaction system. This proposal is based on their Session Memory Compact design: [article link]

I'm willing to implement this function || I'm willing to implement this
According to the requirements of CONTRIBUTING.md, open an issue for discussion first. If the direction is right, I can submit a PR.

Per CONTRIBUTING.md, opening this issue for discussion first. Happy to submit a PR if the direction fits.

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions