Skip to content

agentmemory-inspired improvements: diversify, scrub, dedup, bench, vectors+RRF#21

Open
arreyder wants to merge 8 commits into
mainfrom
feature/session-diversified-ranking
Open

agentmemory-inspired improvements: diversify, scrub, dedup, bench, vectors+RRF#21
arreyder wants to merge 8 commits into
mainfrom
feature/session-diversified-ranking

Conversation

@arreyder
Copy link
Copy Markdown
Owner

@arreyder arreyder commented Apr 18, 2026

Stacked improvements inspired by the agentmemory project. Each commit is independently reviewable.

Commits

1. Cap results per session (#13)

  • New `diversifyBySession` helper caps per-session hits in ranked lists
  • Broker packet builder enforces cap=2 after scoring
  • `search_memories` gains `session_cap` arg (default 3, 0 disables)

2. Scrub secrets before storage (#15)

  • New `internal/privacy` package with patterns for AWS/GitHub/Anthropic/OpenAI/Slack tokens, bearer tokens, RSA/EC private keys, URLs with embedded credentials, and ``/`` tag blocks
  • Matches replaced with `[REDACTED:]`; tally merged into doc metadata as `scrub_count`/`scrub_kinds`
  • Wired into all write paths; opt-out via `SOLR_MEM_PRIVACY_SCRUB=off`

3. Dedup memories by content-hash (#14)

  • New `content_hash` schema field; SHA-256 over normalized title+content+tags
  • `dedup_window_seconds` (default 300) and `on_duplicate` (skip/merge/force) on `store_memory` and `bulk_store_memories`

4. Retrieval benchmark harness (#20)

  • `cmd/solr-mem-bench` seeds a namespaced `bench-*` corpus (safe against live collections) and reports R@K + MRR
  • 30-doc synthetic corpus + 25 queries, `make bench` for end-to-end runs

5. Hybrid retrieval foundation (#16 pt. 1 + #17 foundation)

  • `internal/retrieval`: Reciprocal Rank Fusion, configurable k/weights/top-K
  • `internal/embed`: pluggable Provider interface + OpenAI text-embedding-3-small
  • Schema: `knn_vector` field type + `embedding` field (1536 dim cosine)
  • Write-time embedding in `store_memory`/`bulk_store_memories` when provider configured

6. Wire KNN + RRF into search_memories (#17 closed, #16 query path)

  • BM25 and KNN run in parallel; RRF fuses them
  • New args: `semantic`, `knn_topk`
  • Filters shared across streams; session-diversification runs post-fusion
  • Embed failures soft-fall-through to BM25

7. Backfill existing memories (#16 completed)

  • `cmd/solr-mem-backfill` scans memories missing the embedding field, computes vectors, writes back via atomic update
  • Required post-merge so pre-existing memories appear in KNN hits
  • `make backfill` / `make backfill-dry`, bounded concurrency, resumable

Schema migrations

  • `content_hash` (commit 3)
  • `knn_vector` field type + `embedding` field (commit 5)

Re-upload the configset to Solr before running. Existing documents retain BM25 behavior; run `make backfill` (after setting OPENAI_API_KEY) to populate their embedding field so they become visible to KNN.

Runtime configuration

  • `SOLR_MEM_PRIVACY_SCRUB=off` — disable secret scrubbing (default on)
  • `OPENAI_API_KEY` — enable embedding-powered semantic search (default off / BM25-only)
  • `OPENAI_EMBEDDING_MODEL` — override embedding model (default text-embedding-3-small)

Deploy sequence

  1. Merge PR
  2. Re-upload `solr/managed-schema.xml` to the memories configset and reload
  3. Set `OPENAI_API_KEY` on the server (optional — BM25 works without it)
  4. Run `make backfill` to embed pre-existing memories
  5. Re-run `make bench` to measure the lift

Test plan

  • `go build ./...` clean
  • 70 unit tests across 6 packages, all pass (`go test ./...`)
  • Manual: `make bench` baseline (BM25 only), then again with `OPENAI_API_KEY` set after backfill — compare R@K
  • Manual: `store_memory` a duplicate within 300s → confirm skip
  • Manual: `store_memory` with a fake GitHub token → confirm metadata records scrub
  • Manual: `search_memories` with `semantic=false` → confirm BM25-only path still works
  • Manual: `make backfill-dry` to confirm counts match expectations before live run

arreyder and others added 2 commits April 18, 2026 10:12
Adds a generic session-diversify helper and wires it into both the
broker packet builder (hardcoded cap=2) and search_memories
(session_cap arg, default 3, 0 disables). Prevents one chatty session
from starving other relevant context from top-K results.

Closes #13.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds internal/privacy.Scrub with patterns for AWS keys, GitHub tokens,
Anthropic/OpenAI keys, Slack tokens, bearer tokens, RSA/EC private-key
blocks, URLs with embedded credentials, and <private>/<secret> tag
blocks. Matches are replaced with [REDACTED:<kind>] markers and the
tally is merged into the memory's metadata as scrub_count/scrub_kinds.

Wired into store_memory, bulk_store_memories, update_memory, and
observe_work. Can be disabled with SOLR_MEM_PRIVACY_SCRUB=off for
trusted corpora.

Closes #15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arreyder arreyder changed the title Cap results per session in broker packets and search_memories agentmemory-inspired improvements: session-diversified ranking + secret scrubbing Apr 18, 2026
Adds a stable SHA-256 hash over normalized title+content+tags
(whitespace-collapsed, tag-order-agnostic, case-normalized), stored in
a new content_hash field on the memories collection.

On store_memory and bulk_store_memories, a Solr lookup in the last N
seconds (default 300) with the same hash causes the insert to be
skipped and the existing ID returned. on_duplicate=merge bumps
updated_at on the existing doc; on_duplicate=force bypasses the check.

Existing memories without a content_hash are never matched, so no
backfill is required — they simply can't be deduped against until
touched again.

Closes #14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arreyder arreyder changed the title agentmemory-inspired improvements: session-diversified ranking + secret scrubbing agentmemory-inspired improvements: diversify, scrub, dedup Apr 18, 2026
New binary at cmd/solr-mem-bench that seeds a namespaced bench-* corpus
into any memories collection (safe to run against a live one — only
touches bench-* IDs), runs a shipped query set with gold labels, and
reports R@1/R@3/R@5/R@10 and MRR plus a per-query breakdown as
Markdown.

Ships 30-doc synthetic corpus + 25 queries covering easy keyword
lookups and harder semantic paraphrases. The paraphrase queries are
where BM25 is expected to struggle, giving us a baseline to measure
hybrid retrieval (#16 + #17) against.

New Makefile target: make bench. 11 unit tests cover R@K, MRR, and
Aggregate; no live Solr required for the test suite.

Closes #20.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arreyder arreyder changed the title agentmemory-inspired improvements: diversify, scrub, dedup agentmemory-inspired improvements: diversify, scrub, dedup, bench Apr 18, 2026
arreyder and others added 2 commits April 18, 2026 11:38
Foundation for hybrid retrieval:

- internal/retrieval: Reciprocal Rank Fusion over ranked ID streams
  with configurable k (default 60), per-stream weights, and top-K cap.
  10 unit tests covering agreement boost, weighting, tie-break, and
  cross-stream behavior.

- internal/embed: pluggable embedding Provider interface with an
  OpenAI text-embedding-3-small implementation. FromEnv() returns nil
  when no API key is set; callers treat nil as "embeddings disabled"
  rather than erroring. 7 unit tests (httptest-based, no network).

- solr/managed-schema.xml: new knn_vector field type (1536 dim cosine,
  matching OpenAI default) and an embedding field on memories.
  Existing docs don't need reindexing — they simply won't show up in
  vector search. BM25 still works for all docs.

- store_memory / bulk_store_memories compute embeddings at write time
  when a provider is configured. Failures log and fall through to a
  vector-less write; they never fail the overall store.

Query-side KNN integration and RRF wiring in search_memories are
scoped to a follow-up commit on this branch. This commit ships only
the pieces that are safely inert without the query-side changes.

Addresses #16 (pt. 1) and #17.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
search_memories now runs BM25 and KNN in parallel when an embedding
provider is configured, then fuses with RRF (default k=60). Filters
(agent_id, tags, session_id, etc.) are shared across both streams so
semantic hits respect the same scoping as keyword hits. The KNN pool
is widened 3× the user limit by default to give fusion more overlap.

New args on search_memories:
- semantic: bool, default true when a provider is configured; false
  forces BM25-only
- knn_topk: int, size of the KNN pool before fusion

Session-diversification (cap N per session_id) runs after fusion, so
the post-RRF output still respects the per-session cap. Existing
BM25-only behavior is unchanged when no provider is configured or
semantic=false.

Embed failures log and fall through to BM25 — semantic is always
opt-in progress, never a regression risk. Existing docs without an
embedding won't appear in KNN hits but still appear in BM25.

Closes #17. Completes the query-side half of #16 (backfill tooling
for existing memories is a separate issue).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arreyder arreyder changed the title agentmemory-inspired improvements: diversify, scrub, dedup, bench agentmemory-inspired improvements: diversify, scrub, dedup, bench, vectors+RRF Apr 18, 2026
New cmd/solr-mem-backfill binary that scans memories missing the
embedding field and writes vectors back via atomic update. Needed so
existing memories (written before this branch's vector support) show
up in KNN hits after the PR merges; without it, semantic search
silently ignores pre-existing docs.

Flow:
- Query "-_exists_:embedding" (paginated, default 50 per batch)
- Embed title+content via the configured provider (OPENAI_API_KEY)
- Atomic-update the doc with the new embedding field
- Track a per-run "seen" set so failed embeds can't cause infinite
  loops (failed docs never get the field written, so they'd otherwise
  keep re-appearing in the query)

Options: -batch-size, -concurrency (default 4 parallel embed calls),
-dry-run, -force (re-embed existing), -max-docs, -pause-ms (rate
headroom for bursty providers).

Makefile: make backfill / make backfill-dry targets, BACKFILL_URL
override.

Tests: 7 unit tests using httptest as a fake Solr and a stub embed
provider. Cover happy path, dry-run, max-docs cap, error resilience
(proves the seen-set prevents infinite loops on persistent failures),
and early termination. No live Solr or API key required.
New provider hitting Ollama's /api/embeddings endpoint — local,
free, runs well on Apple Silicon. Discovers its output dimension
from the first successful call rather than hardcoding (different
models produce different dims: nomic-embed-text=768,
mxbai-embed-large=1024, all-minilm=384).

FromEnv now detects OLLAMA_EMBEDDING_URL first, then OPENAI_API_KEY.
Local/free wins over remote/paid when both are configured so an
accidentally-present OpenAI key doesn't silently incur charges.

Server startup log distinguishes known-dim (OpenAI) from
inferred-dim (Ollama) cases. Schema comment now lists the common
dims per provider so users know which vectorDimension to set before
deploying — this is the one static choice in the whole pipeline.

8 new unit tests (httptest-based): happy path, model override,
trailing-slash normalization, empty input, error status, the
quirky-200-with-empty-embedding case (Ollama returns this when a
model isn't pulled), and FromEnv precedence.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant