Skip to content

Compare Kai memory stack to stock Mem0 baseline #468

@dcellison

Description

@dcellison

The question

Kai has built a memory subsystem with richer write-side metadata, write-time dedup, two-stage extraction, anti-fabrication gates, and a replay-sandbox validation discipline. To date all of the validation has been internal: Kai-vs-Kai across prompt versions, threshold values, and pollution conditions. None of it has compared Kai's stack head-to-head against stock Mem0 with sensible defaults.

We genuinely do not know whether the additional machinery produces a measurable retrieval edge over a baseline Mem0 install.

This matters for three reasons:

  1. Confidence in continued investment. The eval discipline has caught real bugs and improved within-system metrics, but those improvements could in principle be reproducing stock Mem0's behavior at higher cost.
  2. Any future open-source extraction decision. If the discipline does not produce a measurable retrieval edge, the open-source story shifts from "better retrieval" to "methodology + reassurance." That changes the value proposition and the audience.
  3. Honest characterization. Any public writeup of Kai's memory approach needs a credible baseline. "Kai improved from p@3=0.73 to 0.81 on probe set X" is meaningless without "stock Mem0 scores YY on the same set."

What Kai's stack adds on top of Mem0

The deltas worth ablating:

  • Quality-test extraction prompt (counterfactual durability test, controlled tag vocabulary, anti-fabrication rules) vs Mem0's default LLM prompt.
  • Write-time intent classification (new / update_of / skip_redundant decided by the extractor) vs Mem0's stock "always store, sort by similarity at retrieval."
  • Write-time semantic-similarity dedup gate (the configurable threshold tuned in Memory: pre-write similarity gate for paraphrase deduplication #465) vs Mem0's default no-write-gate posture.
  • Per-fact metadata (speaker, confidence, confirmation_quote, intent, prompt_version) vs Mem0's bare content + scope.
  • Per-speaker recall-weight calibration vs Mem0's flat scoring.
  • Stage-2 episode extraction (narrative records alongside atomic facts) vs Mem0's atomic-only.

Proposed eval design

Run two replays of the same chat-history window into two sandbox stores:

  • Sandbox A: full Kai stack at production thresholds.
  • Sandbox B: stock Mem0 (Mem0's default extractor prompt, no dedup gate, no intent classification, atomic-only, no per-speaker weights).

Score both on:

  • Retrieval quality: p@1, p@3, MRR against the existing 26-probe baseline set.
  • In-prompt rate: fraction of probes where the expected fact appears in the top-k retrieved facts that fit the context budget.
  • Store cleanliness: post-replay pairwise cosine cluster count at T_audit=0.85.
  • Fabrication audit: sample 20 facts from each store; operator judges how many are factually wrong or unreasonable. This is the load-bearing claim that anti-fabrication is not free.

Optional ablations to isolate which Kai component is doing the work:

  • Kai stack with the dedup gate disabled.
  • Kai stack with the per-speaker weights disabled.
  • Kai stack with the quality-test prompt swapped for Mem0's default.

Confounds to flag up front

  • Tuning bias. Kai's stack has been tuned on Kai's own chat history. Stock Mem0 has not. Any comparison on Kai's data is biased toward Kai. A clean publication-grade comparison would require a held-out public benchmark (LongMemEval, LOCOMO, MemoryAgentBench). For an internal "should we keep investing" answer, the Kai-data version is sufficient.
  • Probe-set provenance. The 26-probe baseline was authored against Kai's facts. Stock Mem0 may produce facts in different phrasings that the probes do not expect; this can suppress B's score for reasons unrelated to retrieval quality.
  • One operator. Personal-assistant chat history is one operator's voice, vocabulary, and topics. Generalization to other deployments cannot be claimed from this eval alone.

Possible outcomes

  1. Kai wins meaningfully. Discipline justifies itself on retrieval edge; open-source extraction has a clean story.
  2. Kai ties on retrieval but wins on fabrication audit. Discipline justifies itself on reliability; open-source story shifts to "methodology + reliability."
  3. Tie on both. Discipline produces reassurance without measurable value. Open-source extraction loses its primary justification. Within-Kai value reduces to the harness as infrastructure for future changes.
  4. Kai loses on retrieval. Stack is producing worse retrieval than stock Mem0. Forces a hard reconsideration of which components to keep.

Outcomes 3 and 4 are real possibilities. The eval should be designed and run with that on the table; it is the value of the exercise that it can produce an unflattering result.

Timing

Blocks on #465 (the paraphrase-dedup threshold sweep) completing first, since the production dedup threshold is one of the variables. Eval design can be drafted in parallel; can run as soon as #465 merges.

Related: #436 (memory quality epic), #437, #464, #465, #466.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions