You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kai has built a memory subsystem with richer write-side metadata, write-time dedup, two-stage extraction, anti-fabrication gates, and a replay-sandbox validation discipline. To date all of the validation has been internal: Kai-vs-Kai across prompt versions, threshold values, and pollution conditions. None of it has compared Kai's stack head-to-head against stock Mem0 with sensible defaults.
We genuinely do not know whether the additional machinery produces a measurable retrieval edge over a baseline Mem0 install.
This matters for three reasons:
Confidence in continued investment. The eval discipline has caught real bugs and improved within-system metrics, but those improvements could in principle be reproducing stock Mem0's behavior at higher cost.
Any future open-source extraction decision. If the discipline does not produce a measurable retrieval edge, the open-source story shifts from "better retrieval" to "methodology + reassurance." That changes the value proposition and the audience.
Honest characterization. Any public writeup of Kai's memory approach needs a credible baseline. "Kai improved from p@3=0.73 to 0.81 on probe set X" is meaningless without "stock Mem0 scores YY on the same set."
What Kai's stack adds on top of Mem0
The deltas worth ablating:
Quality-test extraction prompt (counterfactual durability test, controlled tag vocabulary, anti-fabrication rules) vs Mem0's default LLM prompt.
Write-time intent classification (new / update_of / skip_redundant decided by the extractor) vs Mem0's stock "always store, sort by similarity at retrieval."
Per-fact metadata (speaker, confidence, confirmation_quote, intent, prompt_version) vs Mem0's bare content + scope.
Per-speaker recall-weight calibration vs Mem0's flat scoring.
Stage-2 episode extraction (narrative records alongside atomic facts) vs Mem0's atomic-only.
Proposed eval design
Run two replays of the same chat-history window into two sandbox stores:
Sandbox A: full Kai stack at production thresholds.
Sandbox B: stock Mem0 (Mem0's default extractor prompt, no dedup gate, no intent classification, atomic-only, no per-speaker weights).
Score both on:
Retrieval quality: p@1, p@3, MRR against the existing 26-probe baseline set.
In-prompt rate: fraction of probes where the expected fact appears in the top-k retrieved facts that fit the context budget.
Store cleanliness: post-replay pairwise cosine cluster count at T_audit=0.85.
Fabrication audit: sample 20 facts from each store; operator judges how many are factually wrong or unreasonable. This is the load-bearing claim that anti-fabrication is not free.
Optional ablations to isolate which Kai component is doing the work:
Kai stack with the dedup gate disabled.
Kai stack with the per-speaker weights disabled.
Kai stack with the quality-test prompt swapped for Mem0's default.
Confounds to flag up front
Tuning bias. Kai's stack has been tuned on Kai's own chat history. Stock Mem0 has not. Any comparison on Kai's data is biased toward Kai. A clean publication-grade comparison would require a held-out public benchmark (LongMemEval, LOCOMO, MemoryAgentBench). For an internal "should we keep investing" answer, the Kai-data version is sufficient.
Probe-set provenance. The 26-probe baseline was authored against Kai's facts. Stock Mem0 may produce facts in different phrasings that the probes do not expect; this can suppress B's score for reasons unrelated to retrieval quality.
One operator. Personal-assistant chat history is one operator's voice, vocabulary, and topics. Generalization to other deployments cannot be claimed from this eval alone.
Possible outcomes
Kai wins meaningfully. Discipline justifies itself on retrieval edge; open-source extraction has a clean story.
Kai ties on retrieval but wins on fabrication audit. Discipline justifies itself on reliability; open-source story shifts to "methodology + reliability."
Tie on both. Discipline produces reassurance without measurable value. Open-source extraction loses its primary justification. Within-Kai value reduces to the harness as infrastructure for future changes.
Kai loses on retrieval. Stack is producing worse retrieval than stock Mem0. Forces a hard reconsideration of which components to keep.
Outcomes 3 and 4 are real possibilities. The eval should be designed and run with that on the table; it is the value of the exercise that it can produce an unflattering result.
Timing
Blocks on #465 (the paraphrase-dedup threshold sweep) completing first, since the production dedup threshold is one of the variables. Eval design can be drafted in parallel; can run as soon as #465 merges.
The question
Kai has built a memory subsystem with richer write-side metadata, write-time dedup, two-stage extraction, anti-fabrication gates, and a replay-sandbox validation discipline. To date all of the validation has been internal: Kai-vs-Kai across prompt versions, threshold values, and pollution conditions. None of it has compared Kai's stack head-to-head against stock Mem0 with sensible defaults.
We genuinely do not know whether the additional machinery produces a measurable retrieval edge over a baseline Mem0 install.
This matters for three reasons:
What Kai's stack adds on top of Mem0
The deltas worth ablating:
Proposed eval design
Run two replays of the same chat-history window into two sandbox stores:
Score both on:
Optional ablations to isolate which Kai component is doing the work:
Confounds to flag up front
Possible outcomes
Outcomes 3 and 4 are real possibilities. The eval should be designed and run with that on the table; it is the value of the exercise that it can produce an unflattering result.
Timing
Blocks on #465 (the paraphrase-dedup threshold sweep) completing first, since the production dedup threshold is one of the variables. Eval design can be drafted in parallel; can run as soon as #465 merges.
Related: #436 (memory quality epic), #437, #464, #465, #466.