Skip to content

Bug: extractMinMessages=2 + autoCaptureSeenTextCount 累积逻辑失效 → 所有单轮对话都掉入 regex fallback,污染全库为脏数据 #417

@lintianyuan666

Description

@lintianyuan666

Plugin Version

1.1.0

OpenClaw Version

2026.3.28

Bug Description

开启 smartExtraction: true 后,正常配置 extractMinMessages: 2,但 auto-capture 几乎所有对话都落入 regex fallback 写入 raw text,导致库内记忆全部是脏数据(l0_abstract == text,无 LLM 蒸馏)。
代码有两条累积路径试图凑够 extractMinMessages,两条都失效:

───

路径A:autoCaptureSeenTextCount diffing(失效)

// index.ts line 2169-2176
const previousSeenCount = autoCaptureSeenTextCount.get(sessionKey) ?? 0;
let newTexts = eligibleTexts; // ← 每次 agent_end 的 eligibleTexts 是"当前事件的消息数",不是"历史累积量"
if (pendingIngressTexts.length > 0) {
newTexts = pendingIngressTexts;
} else if (previousSeenCount > 0 && eligibleTexts.length > previousSeenCount) {
newTexts = eligibleTexts.slice(previousSeenCount); // ← 永远不会触发,因为 eligibleTexts.length === previousSeenCount === 1
}
autoCaptureSeenTextCount.set(sessionKey, eligibleTexts.length); // ← 每次覆盖成"1",diffing 失效

在单轮 DM 场景:

• 事件1:eligibleTexts=1, previousSeenCount=0 → newTexts=1 → smart extraction 跳过(需要≥2)
• 事件2:eligibleTexts=1, previousSeenCount=1 → 1 > 1 为 false → newTexts=1 → 同样跳过

日志佐证:

08:44:28 smart-extractor: extracted 3 candidates ← 历史累积生效过一次(跨会话或特定模式)
08:46:41 regex fallback found 1 capturable text(s) ← 后续全走 regex

───

路径B:pendingIngressTexts 跨消息累积(冷启动失效)

// message_received hook — 累积入口
const conversationKey = buildAutoCaptureConversationKeyFromIngress(channelId, conversationId);
queue.push(normalized); // ← 来自用户发送的 ingress 消息

// agent_end hook — 消费出口
const conversationKey = buildAutoCaptureConversationKeyFromSessionKey(sessionKey); // ← 格式: "agent:::"
const pendingIngressTexts = autoCapturePendingIngressTexts.get(conversationKey) ?? [];

问题:pendingIngressTexts.length > 0 时会用 pending 队列替代当前 texts,但这段代码只在 previousSeenCount > 0 时才可能有意义(否则 pending 队列里的内容永远是那1条刚进门的 ingress 消息)。

且 pending 队列只在 previousSeenCount > 0 && eligibleTexts.length > previousSeenCount 时才被"考虑"——第一次对话永远没有 previousSeenCount,永远用 eligibleTexts,永远凑不到2。

───

结果

对话模式 eligibleTexts smartExtraction regex fallback 结果
单轮 DM(1条 user msg) 1 ❌ 跳过(<2) ✅ 触发 ⚠️ 脏数据
多轮历史累积成功 ≥2 ✅ 触发 ❌ 不触发 ✅ 正常
LLM extraction 失败 ≥2 ❌ 失败 ✅ 触发 ⚠️ 脏数据

───

日志:
memory-pro: smart-extractor: extracted 3 candidate(s) ← smart extraction 成功
memory-pro: smart-extractor: created [cases] Memory-lanceDB-pro dirty data issue
memory-pro: smart-extractor: created [preferences] Model preference: Yunwu GPT-4o
memory-pro: smart-extracted 2 created, 0 merged, 1 skipped ← 正常
regex fallback found 1 capturable text(s) ← 单轮 DM 落入 fallback
memory-lancedb-pro: auto-captured 1 memories for agent main in scope agent:main ← 脏数据

Expected Behavior

改 extractMinMessages 语义
将 extractMinMessages 从"每轮 eligible texts 数量"改为"smart extraction 触发前需要累积的最小 conversation rounds",并在 session 级别真正做累积计数,而不是依赖 per-event 的 diffing hack。

Steps to Reproduce

以上

Error Logs / Screenshots

Embedding Provider

None

OS / Platform

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions