feat(proposal-a): Phase 1 recall governance (Issue #569) [supersedes #597]#698
feat(proposal-a): Phase 1 recall governance (Issue #569) [supersedes #597]#698jlin53882 wants to merge 25 commits intoCortexReach:masterfrom
Conversation
- Add pendingRecall Map for tracking session recalls - Add agent_end hook to store response text for usage scoring - Add before_prompt_build hook (priority 5) to score recall usage - Add session_end hook to clean up pending recalls - Add isRecallUsed function to reflection-slices.ts - Guard: skip scoring for empty responseText (<=24 chars) Implements: recall usage tracking for Proposal A Phase 1
1. Bug 1 (CRITICAL): injectedIds regex in feedback hook never matched
- The feedback hook used a regex /\[([a-f0-9]{8,})\]/gi to parse IDs
from prependContext, but auto-recall injects memories in format
[preferences:global], [facts:dc-channel], NOT [hex-id].
- Fix: read recallIds directly from pendingRecall (which is populated
by auto-recall's before_prompt_build from the previous turn).
Also added code in auto-recall to store selected IDs into
pendingRecall[sessionKey].recallIds before returning.
2. Bug 2 (MAJOR): stripEnvelopeMetadata regex had literal backspace (0x08)
- In src/smart-extractor.ts line 76, a literal backspace character
(byte 0x08) was embedded in the regex pattern between 'agent' and '.',
producing 'agent[0x08].*?' instead of 'agent\b.*?'.
- Fix: replaced the 0x08 byte with the proper \b word boundary.
3. Bug 3 (MAJOR): WeakSet.clear() does not exist
- In index.ts resetRegistration(), _registeredApis.clear() was called,
but WeakSet has no clear() method.
- Fix: removed the .clear() call per the comment's own note.
…g, parseSmartMetadata, importance row update)
Bug 1 (P1): pendingRecall was written with recallIds from Turn N but responseText
from Turn N-1, causing feedback to score the wrong memories.
Fix: before_prompt_build (auto-recall) now CREATES pendingRecall with recallIds.
agent_end now only WRITES responseText to an existing entry (never creates).
Bug 2 (P2): parseSmartMetadata was called with empty placeholder metadata,
returning fallback values instead of real entry data.
Fix: use store.getById(recallId) to get the real entry before parsing.
Bug 3 (P2): patchMetadata only updates the metadata JSON blob, not the
entry.importance ROW column. applyImportanceWeight reads entry.importance,
so importance adjustments never affected ranking.
Fix: use store.update(id, { importance: newValue }) to update the row directly.
Bug 1 [P1]: pendingRecall.delete() moved from session_end to feedback hook finally block — prevents repeated scoring of the same recallIds/ responseText pair when subsequent turns skip auto-recall (greeting, short input). Now deleted immediately after scoring completes. Bug 2 [P2]: confirmed use now resets bad_recall_count to 0 — so penalty threshold (3) only applies to truly consecutive misses, not interleaved confirmed-use/miss patterns. Bug 3 [P3]: retrieveWithTrace now forwards source to hybridRetrieval(), aligning debug/trace retrieval with real manual-recall behavior.
…anup, env-resolve gate, recency double-boost)
P1-1 (isRecallUsed): Add direct injected-ID check
- The function accepted injectedIds but never used them
- Added loop to check if response contains any injected memory ID
- This complements the existing stock-phrase check
P1-2 (rerank env vars): Add rerank-enabled guard
- Only resolve \ placeholders when rerank is actually enabled
- Prevents startup failure when rerankApiKey has unresolved placeholder
but reranking is disabled (rerank='none')
P2 (multi-line wrapper stripping): Strip boilerplate continuation lines
- stripLeadingRuntimeWrappers now also strips lines matching
AUTO_CAPTURE_RUNTIME_WRAPPER_BOILERPLATE_RE (e.g.
'Results auto-announce to your requester.', 'Do not use any memory tools.')
while strippingLeadIn is still true, preventing these lines from
being kept when they appear right after the wrapper prefix line
…d configurable feedback amplitudes
…er prompt extraction, parsePluginConfig feedback, bad_recall_count double-increment)
Bug 1 (P1): isRecallUsed() only checked stock phrases and raw IDs,
but auto-recall injects [category:scope] summary format text.
Fix: store injectedSummaries (item.line) in pendingRecall on auto-recall
injection; pass them to isRecallUsed() which now checks if the response
contains any of the injected summary text verbatim.
Bug 2 (P1): confirm/error keywords were checked against pending.responseText
(previous-turn assistant response) instead of the current-turn user prompt.
Fix: read event.prompt (array of {role, content} messages) in the
before_prompt_build feedback hook and check keywords against the last user
message in that array.
Bug 3 (P2): parsePluginConfig() never copied cfg.feedback to the returned
config object, so all deployments fell back to hardcoded defaults.
Fix: add feedback block to the return object in parsePluginConfig.
Bug 4 (P2): bad_recall_count was incremented in BOTH the auto-recall
injection path AND the feedback hook, causing double-counting that made
the 3-consecutive-miss penalty trigger after only 2 actual misses.
Fix: remove +1 from the feedback hook; counter now only increments once
(in the auto-recall injection path where staleInjected is evaluated).
…ssages user prompt, agentId keying Bug 1 (P1): Score each recall independently instead of one usedRecall for the whole batch. - Build summaryMap: recallId -> injected summary - Call isRecallUsed per recallId with its specific summary - Prevents unused memories from being boosted or used ones penalized Bug 2 (P2): Extract user prompt from event.messages array, not event.prompt. - event.prompt is a plain string (confirmed by codebase usage), not an array - Extract last user message from event.messages (same pattern as agent_end) Bug 3 (P2): pendingRecall key includes agentId to avoid cross-agent overwrite. - Key format: sessionKey:agentId (both in auto-recall and feedback/agent_end hooks)
P1 fix: replace single-char CJK keywords (是/對/不/錯) with longer phrases (是對的/確認/錯誤/更正) to avoid false positives on ordinary conversation. P3 fix: session_end hook was not cleaning pendingRecall at all. Add cleanup of all pendingRecall entries that match the sessionId or sessionKey:agentId composite key pattern.
…ory leak When config.autoCapture === false, the auto-capture session_end (priority 10) was skipped, leaving only the Phase 1 session_end (priority 20) to clean up. The old code only deleted pendingRecall[sessionKey] - a simple key - but not composite keys (sessionKey:agentId). Now uses pattern matching (startsWith) to clean all related keys regardless of format. Fixes: P1 issue from Phase 1 audit
…e, P0-2 race condition known limitation P0-1: Add TTL-based cleanup to prevent unbounded pendingRecall memory growth P0-3: Enforce AND gate on summary path (hasUsageMarker || hasSpecificRecall required) P0-2: Document known limitation - bad_recall_count read-modify-write is not atomic P1-1: Verified autoCapture block boundary - Proposal A hooks are outside autoCapture block
P2: Change suppression threshold from >= 3 to >= 2 to match the scoring path threshold. After 2 bad recalls, both penalty and suppression now activate simultaneously, preventing one extra injection turn. P0-2: bad_recall_count race condition remains a known limitation (out of Phase 1 scope without store-layer compare-and-swap support).
P1 (Codex): Remove AND gate from Summary path in isRecallUsed(). Summary text overlap (>= 10 chars) is now detected independently, without requiring hasUsageMarker or hasSpecificRecall. This fixes false negatives where auto-recall memories are naturally used (without explicit markers) and incorrectly counted as misses. Auto-capture (pendingIngress) remains Phase 2 scope (out of Phase 1).
P1 (Codex): Fix double-counting + delayed suppression. - Change scoring condition from >= 2 to >= 1 to sync with injection's staleInjected increment (which increments bad_recall_count when re-injecting an unconfirmed memory from the previous turn). - Don't increment bad_recall_count in scoring path for silent misses: injection path already handles this via staleInjected. - Only increment in scoring for explicit errorKeywords (user corrections). This ensures the penalty applies on second miss (syncs with injection's first increment), not the third.
P1 (Codex): Don't write last_confirmed_use_at on explicit error. When user explicitly corrects a recall, we shouldn't mark it as "confirmed use" - otherwise staleInjected logic breaks. P2 (Codex): 1. Run TTL cleanup on read path too (not just set path). This handles idle sessions that never trigger set() again. 2. Refine confirm/error keywords to reduce false positives. Removed "ok" and "no" which are too generic.
P1 (Codex 2nd round): 1. errorKeywords now checked BEFORE usedRecall heuristic. If user explicitly corrects, that overrides usage detection. No importance boost is applied for errorKeywords cases. 2. errorKeywords sets last_confirmed_use_at = Date.now(). This prevents injection path's staleInjected from double-counting in the same turn. Next injection will NOT increment bad_recall_count via staleInjected (since last_confirmed_use_at >= last_injected_at). This fixes: - Used but corrected: no boost, single increment, no staleInjected double-count - Used and confirmed: boost + reset - Silent miss: penalty applies at badCount >= 1, injection handles increment
…dback config types P1 (Codex review round 4): - The second `if (config.autoCapture !== false)` block (agent_end auto-capture) was missing its closing brace. Phase 1 hooks were added inside this block but the block was never closed, causing ALL subsequent hooks (self-improvement, reflection, lifecycle, backup) to be conditional on autoCapture. Added closing `}` after the Phase 1 before_prompt_build hook (priority 5) to properly close the autoCapture block. P2 (Codex review round 4): - Feedback config values (boostOnUse, penaltyOnMiss, etc.) were used directly without Number() coercion. If deployment provides values as strings (common with env-driven config), Math.min/Math.max would produce NaN. Added Number() coercion with fallback to default values. Both fixes resolve the two issues flagged by Codex in PR review.
… matching, zero config
- F1: 修復 feedback 預設值 eval 為 0 的問題,改用 (val ?? null ?? default) - MR1: 新增 feedback 設定至 openclaw.plugin.json schema - MR2: Phase 1 hooks 包在 if (config.autoRecall === true) 內通過 test - F3: isRecallUsed reverse-match 加上 >=10 字元 guard 防 false positive - F7: 移除 agent_end 中永遠為 true 的 if (!sessionKey) return Refs: CortexReach#596
- index.ts: 保留 feedback + recallPrefix(兩者都適用) - retriever.ts: 使用 upstream 較簡潔的 comment Conflicts resolved.
fix(reflection): F4 - normalize both sides for legacy filtering
Conflicts resolved: - index.ts: keep PR branch Phase 1 hooks + pendingRecall state maps - openclaw.plugin.json: added feedback config schema - src/retriever.ts: minor diff, keep upstream (Promise.allSettled graceful degradation) - src/auto-capture-cleanup.ts: new file, added - src/reflection-slices.ts: added isRecallUsed + usage tracking - src/reflection-store.ts: small change, added Base: origin/master 02b97bb PR branch: fork/jlin53882 (23 commits from base 16ee3e4)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5c4cf2df02
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const vectorDim = getVectorDimensions( | ||
| config.embedding.model || "text-embedding-3-small", | ||
| config.embedding.dimensions, | ||
| ); |
There was a problem hiding this comment.
Preserve embedding.requestDimensions wiring
This initialization path now derives vector size from config.embedding.dimensions only, which drops the previously supported embedding.requestDimensions behavior even though that option is still exposed in the plugin schema. In deployments that depend on request-time downprojection, this causes the configured request dimension to be ignored and can lead to embedding-size/schema mismatches or silently incorrect vector sizing. Please keep requestDimensions in the parsed config and propagate it through store/embedder setup.
Useful? React with 👍 / 👎.
| if ( | ||
| Array.isArray(config.autoRecallExcludeAgents) && | ||
| config.autoRecallExcludeAgents.length > 0 && | ||
| agentId !== undefined && | ||
| config.autoRecallExcludeAgents.includes(agentId) |
There was a problem hiding this comment.
Restore autoRecallIncludeAgents whitelist gating
The auto-recall gate now enforces only the exclude list, so autoRecallIncludeAgents no longer restricts injection to a whitelist. In multi-agent environments that rely on include-mode isolation, memories will be injected into non-whitelisted agents, which contradicts the documented config semantics and can contaminate background-agent prompts. Reintroduce include-list handling with include-over-exclude precedence.
Useful? React with 👍 / 👎.
| const fb = config.feedback ?? {}; | ||
| // F1 fix: use (val ?? null) so ?? fallback fires on undefined. | ||
| // Pattern: (val ?? null ?? default). undefined→null→default; 0→0 (no trigger). | ||
| const boostOnUse = (fb.boostOnUse ?? null ?? 0.05); |
There was a problem hiding this comment.
Parse feedback config for recall scoring
This hook reads config.feedback, but parsePluginConfig no longer returns a feedback object, so this path always falls back to defaults and ignores user-provided feedback.* values declared in openclaw.plugin.json. That makes the new recall-governance tuning knobs ineffective in production even when configured. Add feedback to PluginConfig parsing so these settings are actually applied.
Useful? React with 👍 / 👎.
| export function resetRegistration() { | ||
| _registeredApis = new WeakSet<OpenClawPluginApi>(); | ||
| _singletonState = null; | ||
| _hookEventDedup.clear(); | ||
| // Note: WeakSets cannot be cleared by design. In test scenarios where the | ||
| // same process reloads the module, a fresh module state means a new WeakSet. | ||
| // For hot-reload scenarios, the module is re-imported fresh. | ||
| // (WeakSet.clear() does not exist, so we do nothing here.) |
There was a problem hiding this comment.
Make resetRegistration clear registered APIs
resetRegistration() is now a no-op, but register() still short-circuits when _registeredApis already contains the API object. Any test or hot-reload path that calls resetRegistration() and reuses the same API instance will fail to re-register hooks/services, which breaks the function’s own contract and causes non-isolated behavior across runs. Reset should reinitialize dedupe state rather than doing nothing.
Useful? React with 👍 / 👎.
CI 失敗分析(與本 PR 無關)兩個失敗都是 upstream 既有问题,與 PR #618 的失敗 root cause 相同:
這些失敗與 PR #698 的實作內容無關,是 merge master 時繼承的 upstream 問題。建議開独立 issue 追蹤。 |
Summary
Phase 1 of Proposal A (recall governance): fixes critical bugs in the feedback loop for auto-recall memories, alignment of suppression and scoring thresholds, and natural usage detection.
This PR supersedes #597. All commits from the original PR have been cleanly rebased onto current
master(02b97bb).See RFC: #569
What changed vs #597
The original #597 was based on an outdated
master(16ee3e4) and contained a destructive merge commit that overwrote upstream changes. This new PR resolves all conflicts cleanly by cherry-picking the 23 meaningful commits onto currentmaster.Rebased branch:
jlin53882:jlin53882-pr597Changes
Phase 1 Recall Governance Hooks
before_prompt_build(priority 5): score recall usage, update importance/bad_recall_countagent_end(priority 15): write response text for scoring in next turnsession_end(priority 20): cleanup pendingRecall entriesBug Fixes
pendingRecallTTL cleanup (10 min max age) - prevents unbounded memory growthisRecallUsed()Summary path AND gate fixed - natural usage detection now works>= 3->>= 2)>= 2->>= 1)last_confirmed_use_at- prevents staleInjected double-countingNew Files
src/reflection-slices.ts:isRecallUsedfunction + usage tracking logicsrc/auto-capture-cleanup.ts: boilerplate wrapper strippingConfig Changes
openclaw.plugin.json: addedfeedbackconfig schema for configurable feedback amplitudesOrthogonal (not core to Phase 1)
src/retriever.ts: Recency double-boost improvementKnown limitations (out of Phase 1 scope)
bad_recall_countread-modify-write is not atomic - requires store-layer compare-and-swapCI Notes
Pre-existing failures on
master(not caused by this PR):cli-smoke: fails at line 316 assertion - pre-existing test bugcore-regression,storage-and-schema,packaging-and-workflow: jiti/mlly import errors - pre-existing environment issueCloses #569