agentmemory-inspired improvements: diversify, scrub, dedup, bench, vectors+RRF by arreyder · Pull Request #21 · arreyder/solr-mem

arreyder · 2026-04-18T15:12:49Z

Stacked improvements inspired by the agentmemory project. Each commit is independently reviewable.

Commits

1. Cap results per session (#13)

New `diversifyBySession` helper caps per-session hits in ranked lists
Broker packet builder enforces cap=2 after scoring
`search_memories` gains `session_cap` arg (default 3, 0 disables)

2. Scrub secrets before storage (#15)

New `internal/privacy` package with patterns for AWS/GitHub/Anthropic/OpenAI/Slack tokens, bearer tokens, RSA/EC private keys, URLs with embedded credentials, and ``/`` tag blocks
Matches replaced with `[REDACTED:]`; tally merged into doc metadata as `scrub_count`/`scrub_kinds`
Wired into all write paths; opt-out via `SOLR_MEM_PRIVACY_SCRUB=off`

3. Dedup memories by content-hash (#14)

New `content_hash` schema field; SHA-256 over normalized title+content+tags
`dedup_window_seconds` (default 300) and `on_duplicate` (skip/merge/force) on `store_memory` and `bulk_store_memories`

4. Retrieval benchmark harness (#20)

`cmd/solr-mem-bench` seeds a namespaced `bench-*` corpus (safe against live collections) and reports R@K + MRR
30-doc synthetic corpus + 25 queries, `make bench` for end-to-end runs

5. Hybrid retrieval foundation (#16 pt. 1 + #17 foundation)

`internal/retrieval`: Reciprocal Rank Fusion, configurable k/weights/top-K
`internal/embed`: pluggable Provider interface + OpenAI text-embedding-3-small
Schema: `knn_vector` field type + `embedding` field (1536 dim cosine)
Write-time embedding in `store_memory`/`bulk_store_memories` when provider configured

6. Wire KNN + RRF into search_memories (#17 closed, #16 query path)

BM25 and KNN run in parallel; RRF fuses them
New args: `semantic`, `knn_topk`
Filters shared across streams; session-diversification runs post-fusion
Embed failures soft-fall-through to BM25

7. Backfill existing memories (#16 completed)

`cmd/solr-mem-backfill` scans memories missing the embedding field, computes vectors, writes back via atomic update
Required post-merge so pre-existing memories appear in KNN hits
`make backfill` / `make backfill-dry`, bounded concurrency, resumable

Schema migrations

`content_hash` (commit 3)
`knn_vector` field type + `embedding` field (commit 5)

Re-upload the configset to Solr before running. Existing documents retain BM25 behavior; run `make backfill` (after setting OPENAI_API_KEY) to populate their embedding field so they become visible to KNN.

Runtime configuration

`SOLR_MEM_PRIVACY_SCRUB=off` — disable secret scrubbing (default on)
`OPENAI_API_KEY` — enable embedding-powered semantic search (default off / BM25-only)
`OPENAI_EMBEDDING_MODEL` — override embedding model (default text-embedding-3-small)

Deploy sequence

Merge PR
Re-upload `solr/managed-schema.xml` to the memories configset and reload
Set `OPENAI_API_KEY` on the server (optional — BM25 works without it)
Run `make backfill` to embed pre-existing memories
Re-run `make bench` to measure the lift

Test plan

`go build ./...` clean
70 unit tests across 6 packages, all pass (`go test ./...`)
Manual: `make bench` baseline (BM25 only), then again with `OPENAI_API_KEY` set after backfill — compare R@K
Manual: `store_memory` a duplicate within 300s → confirm skip
Manual: `store_memory` with a fake GitHub token → confirm metadata records scrub
Manual: `search_memories` with `semantic=false` → confirm BM25-only path still works
Manual: `make backfill-dry` to confirm counts match expectations before live run

Adds a generic session-diversify helper and wires it into both the broker packet builder (hardcoded cap=2) and search_memories (session_cap arg, default 3, 0 disables). Prevents one chatty session from starving other relevant context from top-K results. Closes #13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds internal/privacy.Scrub with patterns for AWS keys, GitHub tokens, Anthropic/OpenAI keys, Slack tokens, bearer tokens, RSA/EC private-key blocks, URLs with embedded credentials, and <private>/<secret> tag blocks. Matches are replaced with [REDACTED:<kind>] markers and the tally is merged into the memory's metadata as scrub_count/scrub_kinds. Wired into store_memory, bulk_store_memories, update_memory, and observe_work. Can be disabled with SOLR_MEM_PRIVACY_SCRUB=off for trusted corpora. Closes #15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a stable SHA-256 hash over normalized title+content+tags (whitespace-collapsed, tag-order-agnostic, case-normalized), stored in a new content_hash field on the memories collection. On store_memory and bulk_store_memories, a Solr lookup in the last N seconds (default 300) with the same hash causes the insert to be skipped and the existing ID returned. on_duplicate=merge bumps updated_at on the existing doc; on_duplicate=force bypasses the check. Existing memories without a content_hash are never matched, so no backfill is required — they simply can't be deduped against until touched again. Closes #14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New binary at cmd/solr-mem-bench that seeds a namespaced bench-* corpus into any memories collection (safe to run against a live one — only touches bench-* IDs), runs a shipped query set with gold labels, and reports R@1/R@3/R@5/R@10 and MRR plus a per-query breakdown as Markdown. Ships 30-doc synthetic corpus + 25 queries covering easy keyword lookups and harder semantic paraphrases. The paraphrase queries are where BM25 is expected to struggle, giving us a baseline to measure hybrid retrieval (#16 + #17) against. New Makefile target: make bench. 11 unit tests cover R@K, MRR, and Aggregate; no live Solr required for the test suite. Closes #20. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Foundation for hybrid retrieval: - internal/retrieval: Reciprocal Rank Fusion over ranked ID streams with configurable k (default 60), per-stream weights, and top-K cap. 10 unit tests covering agreement boost, weighting, tie-break, and cross-stream behavior. - internal/embed: pluggable embedding Provider interface with an OpenAI text-embedding-3-small implementation. FromEnv() returns nil when no API key is set; callers treat nil as "embeddings disabled" rather than erroring. 7 unit tests (httptest-based, no network). - solr/managed-schema.xml: new knn_vector field type (1536 dim cosine, matching OpenAI default) and an embedding field on memories. Existing docs don't need reindexing — they simply won't show up in vector search. BM25 still works for all docs. - store_memory / bulk_store_memories compute embeddings at write time when a provider is configured. Failures log and fall through to a vector-less write; they never fail the overall store. Query-side KNN integration and RRF wiring in search_memories are scoped to a follow-up commit on this branch. This commit ships only the pieces that are safely inert without the query-side changes. Addresses #16 (pt. 1) and #17. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

search_memories now runs BM25 and KNN in parallel when an embedding provider is configured, then fuses with RRF (default k=60). Filters (agent_id, tags, session_id, etc.) are shared across both streams so semantic hits respect the same scoping as keyword hits. The KNN pool is widened 3× the user limit by default to give fusion more overlap. New args on search_memories: - semantic: bool, default true when a provider is configured; false forces BM25-only - knn_topk: int, size of the KNN pool before fusion Session-diversification (cap N per session_id) runs after fusion, so the post-RRF output still respects the per-session cap. Existing BM25-only behavior is unchanged when no provider is configured or semantic=false. Embed failures log and fall through to BM25 — semantic is always opt-in progress, never a regression risk. Existing docs without an embedding won't appear in KNN hits but still appear in BM25. Closes #17. Completes the query-side half of #16 (backfill tooling for existing memories is a separate issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New cmd/solr-mem-backfill binary that scans memories missing the embedding field and writes vectors back via atomic update. Needed so existing memories (written before this branch's vector support) show up in KNN hits after the PR merges; without it, semantic search silently ignores pre-existing docs. Flow: - Query "-_exists_:embedding" (paginated, default 50 per batch) - Embed title+content via the configured provider (OPENAI_API_KEY) - Atomic-update the doc with the new embedding field - Track a per-run "seen" set so failed embeds can't cause infinite loops (failed docs never get the field written, so they'd otherwise keep re-appearing in the query) Options: -batch-size, -concurrency (default 4 parallel embed calls), -dry-run, -force (re-embed existing), -max-docs, -pause-ms (rate headroom for bursty providers). Makefile: make backfill / make backfill-dry targets, BACKFILL_URL override. Tests: 7 unit tests using httptest as a fake Solr and a stub embed provider. Cover happy path, dry-run, max-docs cap, error resilience (proves the seen-set prevents infinite loops on persistent failures), and early termination. No live Solr or API key required.

New provider hitting Ollama's /api/embeddings endpoint — local, free, runs well on Apple Silicon. Discovers its output dimension from the first successful call rather than hardcoding (different models produce different dims: nomic-embed-text=768, mxbai-embed-large=1024, all-minilm=384). FromEnv now detects OLLAMA_EMBEDDING_URL first, then OPENAI_API_KEY. Local/free wins over remote/paid when both are configured so an accidentally-present OpenAI key doesn't silently incur charges. Server startup log distinguishes known-dim (OpenAI) from inferred-dim (Ollama) cases. Schema comment now lists the common dims per provider so users know which vectorDimension to set before deploying — this is the one static choice in the whole pipeline. 8 new unit tests (httptest-based): happy path, model override, trailing-slash normalization, empty input, error status, the quirky-200-with-empty-embedding case (Ollama returns this when a model isn't pulled), and FromEnv precedence.

arreyder and others added 2 commits April 18, 2026 10:12

arreyder changed the title ~~Cap results per session in broker packets and search_memories~~ agentmemory-inspired improvements: session-diversified ranking + secret scrubbing Apr 18, 2026

arreyder changed the title ~~agentmemory-inspired improvements: session-diversified ranking + secret scrubbing~~ agentmemory-inspired improvements: diversify, scrub, dedup Apr 18, 2026

arreyder changed the title ~~agentmemory-inspired improvements: diversify, scrub, dedup~~ agentmemory-inspired improvements: diversify, scrub, dedup, bench Apr 18, 2026

arreyder and others added 2 commits April 18, 2026 11:38

arreyder changed the title ~~agentmemory-inspired improvements: diversify, scrub, dedup, bench~~ agentmemory-inspired improvements: diversify, scrub, dedup, bench, vectors+RRF Apr 18, 2026

arreyder added 2 commits April 18, 2026 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agentmemory-inspired improvements: diversify, scrub, dedup, bench, vectors+RRF#21

agentmemory-inspired improvements: diversify, scrub, dedup, bench, vectors+RRF#21
arreyder wants to merge 8 commits into
mainfrom
feature/session-diversified-ranking

arreyder commented Apr 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arreyder commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commits

1. Cap results per session (#13)

2. Scrub secrets before storage (#15)

3. Dedup memories by content-hash (#14)

4. Retrieval benchmark harness (#20)

5. Hybrid retrieval foundation (#16 pt. 1 + #17 foundation)

6. Wire KNN + RRF into search_memories (#17 closed, #16 query path)

7. Backfill existing memories (#16 completed)

Schema migrations

Runtime configuration

Deploy sequence

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arreyder commented Apr 18, 2026 •

edited

Loading