Skip to content

Memory quality: replace write-time string filters with structural controls #436

@dcellison

Description

@dcellison

Why

The mitigations layered into the extractor and episode classifier over the recent sweep (#426, #427, #429) all share one shape: they add negative examples and exclusion phrases to the prompt to suppress noise classes we have already observed in production. This pattern is brittle by construction. It catches the specific phrasings already seen and misses the next variant of the same noise class. Each new exclusion bloats the prompt, dilutes the model's attention to the positive task, and lengthens the prompt's growth curve without bound (a noise class has infinite phrasings; the exclusion list cannot).

There is also a deeper conflict. Aggressive write-time filtering can drop legitimate user input, which violates the standing invariant that the memory system must never lose what the user said. Every additional write-time gate is another chance to silently discard something we will later wish we had kept. The current trajectory keeps stacking that risk.

The structural answer is to change what kind of filtering happens at write-time and add machinery at retrieval-time that does not exist today. At write-time we keep a quality floor (refuse candidates that are obviously not useful to retain: workflow noise, tautologies, ephemeral state, procedural fragments that are not claims about the world) but we stop trying to predict downstream utility for candidates that pass that floor. Predicting whether a fact will be useful in three months is the part we cannot do reliably at write-time, and that is where the current pattern-matching defenses brittle out. At retrieval-time we add the selectivity machinery (provenance weighting, confidence scoring, decay) that uses signal genuinely unavailable upstream: the actual query, the fact's age, its retrieval history. Where write-time gating remains, it tests a property the model can actually reason about ("would this help a future conversation?") rather than matches against a growing list of forbidden strings.

This is not "let everything in." The upfront floor is real and is part of the design (Sub 3). What changes is that the floor catches a bounded class (obvious-junk) and the gray zone (uncertain utility) gets handled by ordering rather than dropping. The user-input-preservation invariant applies to the substance of what the user said; it does not require keeping literal stenographic noise.

The expected price of this inversion is a slightly larger Qdrant index and slightly higher retrieval cost. Both are cheap relative to the cost of dropping useful input or shipping prompts that no longer pay attention to their core task.

Scope

Four sub-issues, prioritized as listed. Each is independently shippable, each delivers measurable improvement on its own, and the ordering reflects load-bearing-ness rather than dependency: any can technically ship first, but the prioritization reflects which moves the noise/quality numbers most.

The epic closes when all four ship and a re-run of the Layer 1 eval harness shows the joint effect is non-regressive on retrieval quality, with measurable drops in the current noise metrics.

Tracking

How the four subs cover the quality space

Each sub addresses a different region of the quality space at a different point in time. Together they form an upfront floor and a post-hoc floor with soft selectivity in between, so the index never collects garbage and the gray zone is handled by ordering rather than dropping.

  • Upfront quality floor (Sub 3). Candidates that fail the "would this help a future conversation?" test are refused at extraction. Workflow noise, tautologies, ephemeral state, and other obvious-junk classes never enter the index. The exclusion lists currently doing this work get replaced with a single positive-criterion test that generalizes to phrasings we have not yet seen.
  • Redundancy gate (Sub 2). Candidates that are geometrically close to something already stored are merged or dropped at write time. The index does not accumulate paraphrases of the same claim. This is a write-time floor too, just on a different axis (similarity, not quality).
  • Soft selectivity (Sub 1). Candidates that pass both upfront floors get tagged with provenance and confidence at write, then downweighted at retrieval. Nothing is dropped here; the gray zone (real facts that vary in source-quality) is handled by ordering.
  • Post-hoc floor (Sub 4). Facts that looked acceptable upfront but turn out never to retrieve over time are demoted or culled by a janitor pass using signal genuinely unavailable at write time (retrieval history, age).

The user-input-preservation invariant constrains the upfront-quality floor (Sub 3) but does not require absolute retention. The invariant protects the substance of what the user said. Literal stenographic noise, workflow events, and procedural fragments are not user input in the sense the invariant cares about, and refusing them at the door is consistent with the rule.

Sub 1: per-fact provenance plus retrieval-time downweight

Tag every Qdrant fact at write time with two metadata fields: provenance (one of user_stated, assistant_derived, episode_summary) and confidence (a float 0.0 to 1.0 set by the extractor based on its own assessment of how clearly the source utterance asserted the fact). Both fields ride along on the existing Mem0 metadata channel; we have used that channel before for source attribution so the path is known to work, though it should be re-verified at spec time given Mem0 internals can shift.

At retrieval time, post-process the cosine score by a weight function over those two fields. Initial design: a simple multiplicative weight per provenance class (user_stated = 1.0, assistant_derived = ~0.7, episode_summary = ~0.85, all to be calibrated), composed with the confidence value. Final retrieval ordering uses the weighted score, but raw cosine remains stored for diagnostics. Calibration is done against the existing 26-probe Layer 1 eval harness: pick weights that maximize p@3 and MRR on that probe set without regressing p@1.

This is the load-bearing change in the epic for two reasons. First, it addresses the broadest class: assistant-derived facts dominate current index growth and most of the recently-observed noise sits in that bucket. Second, it does so without dropping anything at write-time, which keeps the user-input-preservation invariant intact and means the change is reversible just by setting the weights to 1.0. The 2026-04-27 ablation eval already established that wholesale removal of assistant-derived facts is too aggressive (three of four treatment probes lost their best answer when assistant facts were stripped); downweighting is the recommended middle path that preserves the answer pool while letting user-stated facts win on tie.

Open design question for the spec: where does confidence come from? Two options: the extractor returns it explicitly per candidate, or we derive it from a structural signal (presence of hedging language, first-person vs third-person, etc.). First option is more honest but requires prompt changes to elicit; second is cheaper but indirect. Spec should pick one and justify.

Sub 2: pre-write similarity gate

Before adding any candidate fact to Qdrant, embed it (using the same model the index uses) and search for nearest neighbors under the same user_id. If the cosine similarity to the nearest neighbor exceeds a threshold (initial proposal: 0.92, to be tuned), the candidate is either merged into the existing fact or dropped silently with a counter increment.

Merge is the preferred outcome and is the harder part of the design. Two facts that are 0.94 cosine-similar are usually two phrasings of the same underlying claim, but one is typically better-phrased than the other (more specific, less hedged, more anchored to a concrete subject). The consolidation step that runs today already has this judgment encoded for its own purpose; the similarity-gate spec should reuse that consolidator path rather than reinvent the picking logic. Drop-with-counter is the fallback for cases where the consolidator declines to pick a winner.

This kills the most common failure mode visible in the production index today: many paraphrases of the same fact accumulating over weeks because each conversation re-states the same claim in slightly different words and the extractor has no way to know it has been heard before. The defense is geometric, not lexical, so it is robust to phrasing variation by construction; you cannot evade it by rewording.

Interaction with Sub 1: when merge happens, the surviving fact's provenance and confidence should be the higher-quality of the two (user_stated beats assistant_derived; higher confidence wins ties). Spec must lock the merge-metadata rule explicitly; otherwise quiet drift can degrade the provenance signal Sub 1 relies on.

Threshold tuning is the other risk. 0.92 is a guess; the right value depends on embedding model behavior and per-operator conversational style. Methodology and metrics (gate-trigger rate, false-merge rate sampled from a small audited window of merges) to be locked at spec time. Tuning procedure must be documented in the wiki when the sub-issue ships, because it is a per-deployment concern not a one-time decision.

Sub 3: replace negative-example stanzas in extraction prompt with a positive-criterion test

This sub is the upfront quality floor for the index. Everything else in the epic (Sub 1's downweighting, Sub 2's similarity gate, Sub 4's janitor) operates on candidates that have already passed Sub 3. The point of Sub 3 is to make sure obvious-junk does not enter the store at all, so the soft-selectivity machinery does not have to carry the load of suppressing it later.

The current extraction prompt has accumulated exclusion lists describing patterns and phrasings that should not be extracted. These lists have grown across #426, #427, and #429. Each entry exists because a specific noise class slipped through and someone added a stanza to catch it. The pattern continues for as long as new noise classes appear, which is forever.

Replace those stanzas with one question per candidate, applied positively: "Would this fact help a future conversation that does not include the current turn? If no, do not extract it." The exclusion lists are removed, not augmented; if the new prompt fails the eval gate, the old lists are restored as a fallback rather than left in alongside.

The argument for the swap is twofold. Operationally, the prompt becomes shorter and the model's attention is concentrated on a single test rather than scanning a growing list. Generationally, the test generalizes: a phrasing the prompt has never seen before is evaluated by the same criterion as the ones it has. The negative-list approach is bounded by the author's enumeration of failure modes; the positive criterion is bounded by the property we actually care about, which is the same property regardless of phrasing.

The risk is that Haiku is worse at the positive question than at the pattern match. The positive question requires counterfactual reasoning ("would a future conversation that lacks this turn benefit?"); the pattern match requires recognition. If Haiku's recall drops because the new question is interpreted conservatively, the change is a regression even if it is theoretically cleaner.

Mitigation: gate the change behind the Layer 1 eval harness with an explicit non-regression bar. Methodology is a paired A/B over the same input set: run both prompts on the same conversation transcripts, compare the resulting fact sets to a labeled baseline, and ship only if the positive-criterion prompt is at least as good on p@1, p@3, and MRR. If recall drops, sub 3 is paused not shipped, and the exclusion-list approach is left in place with the failure documented in the sub-issue close-out.

Sub 4: janitor pass for unretrieved-fact decay

Background sweep (cron-driven, weekly initial cadence) that walks the Qdrant index and applies a decay condition to each fact: (age > N months) AND (retrieval_count == 0) triggers either a demotion (provenance class downgraded one notch, confidence multiplied by a decay factor) or a cull (delete from index). Demotion is the lower-risk default for the first version; cull is added as a separate operation behind a flag once demotion behavior is observed in production.

This requires a prerequisite that is missing today: per-fact retrieval-event logging. Every retrieval needs to increment a counter on the facts it returned (or at least on the facts that ranked above the prompt-injection cutoff). The counter lives in the fact metadata alongside provenance and confidence from Sub 1. This logging change may warrant being carved into a small dedicated sub-issue (Sub 4a) that lands before Sub 4 proper, so the janitor has data to act on by the time it ships.

Interaction with Sub 1: provenance must modulate decay. A user_stated fact that has gone unretrieved for six months is probably still important context the operator told us once and may need again; the same age threshold on assistant_derived is much closer to garbage. Spec should set per-provenance decay parameters rather than a single global N.

Lowest urgency of the four because the index today is small enough that decay is not yet load-bearing; even an unbounded index would not cause noticeable retrieval problems for many months. Highest long-term value because it is the only mechanism in the epic that uses signal genuinely unavailable at write-time. Defer past subs 1 to 3 unless intermediate review surfaces a reason to prioritize earlier (for instance, if the index grows faster than predicted after Sub 2 ships and we need decay to recover headroom).

Out of scope

  • Replacing Mem0 with a direct Qdrant integration. Adjacent topic and increasingly tractable, but a separate epic if it ever happens.
  • Retrieval-side query rewriting, HyDE, or multi-query expansion. Different lever, different epic.
  • Episode pipeline structural changes beyond what is needed to attach provenance correctly to episode-derived summaries.
  • Changing the embedding model. Independent decision with its own re-baseline cost.
  • Operator-tunable retrieval weights via /memory UI. Possible follow-up after Sub 1 lands and we have evidence the right weights vary per operator.

Acceptance

The epic closes when:

  • All four sub-issues are merged.
  • A clean Layer 1 eval-harness run shows non-regression on the 26-probe baseline (p@1, p@3, MRR within prior CI).
  • Production noise metrics (redundant-fact additions per week, sub-bar facts retrieved per turn if measurable) show measurable improvement against the pre-epic baseline, captured in a short follow-up note posted to the epic with the numbers.

Risks

  • Provenance tagging requires Mem0 to round-trip metadata correctly. We have done this before for source attribution; verify the path is still intact at spec time for Sub 1, and treat metadata loss as a Sub 1 blocker rather than a Sub 1 follow-up.
  • Similarity-gate threshold tuning is a per-deployment concern, not a one-time decision. The default we ship may not generalize to operators with different conversational styles. Document the tuning procedure in the wiki when Sub 2 ships and surface gate-trigger metrics in /memory so operators can see the gate's behavior on their own data.
  • Positive-criterion prompt may reduce extraction recall if Haiku interprets the question conservatively. The eval-harness gate is the safeguard; if recall drops, Sub 3 is paused and the exclusion lists are restored, with failure mode documented in the close-out.
  • Retrieval-count logging for Sub 4 introduces a write on every retrieval, which is a hot-path mutation. Performance impact must be measured before that sub-issue ships; if non-trivial, batch the increment or move to a sampling approach.
  • Provenance-weighted retrieval can mask extractor bugs. If assistant_derived facts are reliably wrong, downweighting them hides the bug rather than fixing it. The eval harness must keep monitoring raw extraction quality independent of the retrieval weighting, so we notice if the upstream signal degrades.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions