agentmemory-inspired improvements: diversify, scrub, dedup, bench, vectors+RRF#21
Open
arreyder wants to merge 8 commits into
Open
agentmemory-inspired improvements: diversify, scrub, dedup, bench, vectors+RRF#21arreyder wants to merge 8 commits into
arreyder wants to merge 8 commits into
Conversation
Adds a generic session-diversify helper and wires it into both the broker packet builder (hardcoded cap=2) and search_memories (session_cap arg, default 3, 0 disables). Prevents one chatty session from starving other relevant context from top-K results. Closes #13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds internal/privacy.Scrub with patterns for AWS keys, GitHub tokens, Anthropic/OpenAI keys, Slack tokens, bearer tokens, RSA/EC private-key blocks, URLs with embedded credentials, and <private>/<secret> tag blocks. Matches are replaced with [REDACTED:<kind>] markers and the tally is merged into the memory's metadata as scrub_count/scrub_kinds. Wired into store_memory, bulk_store_memories, update_memory, and observe_work. Can be disabled with SOLR_MEM_PRIVACY_SCRUB=off for trusted corpora. Closes #15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a stable SHA-256 hash over normalized title+content+tags (whitespace-collapsed, tag-order-agnostic, case-normalized), stored in a new content_hash field on the memories collection. On store_memory and bulk_store_memories, a Solr lookup in the last N seconds (default 300) with the same hash causes the insert to be skipped and the existing ID returned. on_duplicate=merge bumps updated_at on the existing doc; on_duplicate=force bypasses the check. Existing memories without a content_hash are never matched, so no backfill is required — they simply can't be deduped against until touched again. Closes #14. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New binary at cmd/solr-mem-bench that seeds a namespaced bench-* corpus into any memories collection (safe to run against a live one — only touches bench-* IDs), runs a shipped query set with gold labels, and reports R@1/R@3/R@5/R@10 and MRR plus a per-query breakdown as Markdown. Ships 30-doc synthetic corpus + 25 queries covering easy keyword lookups and harder semantic paraphrases. The paraphrase queries are where BM25 is expected to struggle, giving us a baseline to measure hybrid retrieval (#16 + #17) against. New Makefile target: make bench. 11 unit tests cover R@K, MRR, and Aggregate; no live Solr required for the test suite. Closes #20. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Foundation for hybrid retrieval: - internal/retrieval: Reciprocal Rank Fusion over ranked ID streams with configurable k (default 60), per-stream weights, and top-K cap. 10 unit tests covering agreement boost, weighting, tie-break, and cross-stream behavior. - internal/embed: pluggable embedding Provider interface with an OpenAI text-embedding-3-small implementation. FromEnv() returns nil when no API key is set; callers treat nil as "embeddings disabled" rather than erroring. 7 unit tests (httptest-based, no network). - solr/managed-schema.xml: new knn_vector field type (1536 dim cosine, matching OpenAI default) and an embedding field on memories. Existing docs don't need reindexing — they simply won't show up in vector search. BM25 still works for all docs. - store_memory / bulk_store_memories compute embeddings at write time when a provider is configured. Failures log and fall through to a vector-less write; they never fail the overall store. Query-side KNN integration and RRF wiring in search_memories are scoped to a follow-up commit on this branch. This commit ships only the pieces that are safely inert without the query-side changes. Addresses #16 (pt. 1) and #17. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
search_memories now runs BM25 and KNN in parallel when an embedding provider is configured, then fuses with RRF (default k=60). Filters (agent_id, tags, session_id, etc.) are shared across both streams so semantic hits respect the same scoping as keyword hits. The KNN pool is widened 3× the user limit by default to give fusion more overlap. New args on search_memories: - semantic: bool, default true when a provider is configured; false forces BM25-only - knn_topk: int, size of the KNN pool before fusion Session-diversification (cap N per session_id) runs after fusion, so the post-RRF output still respects the per-session cap. Existing BM25-only behavior is unchanged when no provider is configured or semantic=false. Embed failures log and fall through to BM25 — semantic is always opt-in progress, never a regression risk. Existing docs without an embedding won't appear in KNN hits but still appear in BM25. Closes #17. Completes the query-side half of #16 (backfill tooling for existing memories is a separate issue). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New cmd/solr-mem-backfill binary that scans memories missing the embedding field and writes vectors back via atomic update. Needed so existing memories (written before this branch's vector support) show up in KNN hits after the PR merges; without it, semantic search silently ignores pre-existing docs. Flow: - Query "-_exists_:embedding" (paginated, default 50 per batch) - Embed title+content via the configured provider (OPENAI_API_KEY) - Atomic-update the doc with the new embedding field - Track a per-run "seen" set so failed embeds can't cause infinite loops (failed docs never get the field written, so they'd otherwise keep re-appearing in the query) Options: -batch-size, -concurrency (default 4 parallel embed calls), -dry-run, -force (re-embed existing), -max-docs, -pause-ms (rate headroom for bursty providers). Makefile: make backfill / make backfill-dry targets, BACKFILL_URL override. Tests: 7 unit tests using httptest as a fake Solr and a stub embed provider. Cover happy path, dry-run, max-docs cap, error resilience (proves the seen-set prevents infinite loops on persistent failures), and early termination. No live Solr or API key required.
New provider hitting Ollama's /api/embeddings endpoint — local, free, runs well on Apple Silicon. Discovers its output dimension from the first successful call rather than hardcoding (different models produce different dims: nomic-embed-text=768, mxbai-embed-large=1024, all-minilm=384). FromEnv now detects OLLAMA_EMBEDDING_URL first, then OPENAI_API_KEY. Local/free wins over remote/paid when both are configured so an accidentally-present OpenAI key doesn't silently incur charges. Server startup log distinguishes known-dim (OpenAI) from inferred-dim (Ollama) cases. Schema comment now lists the common dims per provider so users know which vectorDimension to set before deploying — this is the one static choice in the whole pipeline. 8 new unit tests (httptest-based): happy path, model override, trailing-slash normalization, empty input, error status, the quirky-200-with-empty-embedding case (Ollama returns this when a model isn't pulled), and FromEnv precedence.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked improvements inspired by the agentmemory project. Each commit is independently reviewable.
Commits
1. Cap results per session (#13)
2. Scrub secrets before storage (#15)
3. Dedup memories by content-hash (#14)
4. Retrieval benchmark harness (#20)
5. Hybrid retrieval foundation (#16 pt. 1 + #17 foundation)
6. Wire KNN + RRF into search_memories (#17 closed, #16 query path)
7. Backfill existing memories (#16 completed)
Schema migrations
Re-upload the configset to Solr before running. Existing documents retain BM25 behavior; run `make backfill` (after setting OPENAI_API_KEY) to populate their embedding field so they become visible to KNN.
Runtime configuration
Deploy sequence
Test plan