feat(recall): add min_score threshold and adaptive floor filtering (#73)#101
feat(recall): add min_score threshold and adaptive floor filtering (#73)#101jack-arturo merged 6 commits intomainfrom
Conversation
- Add scripts/bench/health_check.py for recall health diagnostics (score distribution, entity quality, cross-query overlap, latency) - Fix datetime tz-awareness bug in LoCoMo date matching - Fix restore_and_eval.sh to produce .json result files - Add bench-health Makefile target - Record Voyage 4 baselines: locomo-mini 76.97%, locomo-full 80.06% - Create benchmarks/EXPERIMENT_LOG.md for tracking experiments Made-with: Cursor
- Add RECALL_MIN_SCORE config (default 0.0) and min_score query param - Add RECALL_ADAPTIVE_FLOOR config (default true) for automatic steep dropoff detection that prunes low-quality result tails - Apply filtering at both per-query and final assembly stages - Include score_filter metadata in response when active - No benchmark regression: locomo-mini 76.97% (unchanged) The adaptive floor detects gaps >15% of max score in the top half of results and cuts below the dropoff. Combined with min_score param, clients can now control noise filtering. Full impact expected after decay fix (#78) improves score differentiation. Made-with: Cursor
Made-with: Cursor
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Organization UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughImplements minimum relevance score filtering and adaptive score floor detection for the recall API to eliminate low-quality results. Adds configuration constants, updates the recall pipeline to apply filtering during per-query and final result assembly, adds a GitHub Actions workflow for documentation updates, and updates benchmark infrastructure to support testing these changes. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
automem/api/recall.py (1)
1450-1515:⚠️ Potential issue | 🟠 Major
score_filter.filtered_countunderreports filtering when adaptive floor is active.The response currently reports
filtered_count=0for adaptive-floor-only filtering, and undercounts when both filters apply. This makes API metadata inaccurate.🛠️ Suggested fix
- # Apply explicit min_score on final assembled results (catches expansions) - pre_filter_count = len(results) + # Apply explicit min_score on final assembled results (catches expansions) + adaptive_filtered_count = 0 + min_score_filtered_count = 0 + pre_min_score_count = len(results) if min_score is not None and min_score > 0: results = [r for r in results if float(r.get("final_score", 0.0)) >= min_score] + min_score_filtered_count = pre_min_score_count - len(results) @@ - if min_score or score_floor_applied: + # If adaptive floor was applied earlier, capture its impact too. + if score_floor_applied is not None: + # Recompute relative to the list before min_score filtering. + adaptive_filtered_count = max(0, pre_min_score_count - len(results) - min_score_filtered_count) + + if (min_score is not None and min_score > 0) or score_floor_applied is not None: response["score_filter"] = { "min_score": min_score, "adaptive_floor": score_floor_applied, - "filtered_count": pre_filter_count - len(results) if min_score else 0, + "filtered_count": adaptive_filtered_count + min_score_filtered_count, + "adaptive_filtered_count": adaptive_filtered_count, + "min_score_filtered_count": min_score_filtered_count, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@automem/api/recall.py` around lines 1450 - 1515, The response's score_filter.filtered_count currently sets filtered_count to 0 when only the adaptive floor (score_floor_applied) removed results because it uses "if min_score else 0"; update the logic so filtered_count always equals pre_filter_count - len(results) whenever either min_score or score_floor_applied is true. In other words, in recall.py where the response["score_filter"] is built (symbols: score_filter, min_score, score_floor_applied, pre_filter_count, results), replace the conditional expression for filtered_count with filtered_count = pre_filter_count - len(results) so adaptive-floor-only and combined filters are reported correctly.
🧹 Nitpick comments (3)
scripts/bench/health_check.py (3)
212-213: Silent exception swallowing hides failures.As flagged by static analysis, the
try-except-continuepattern silently drops errors without any indication. For a health check tool, logging or at least counting these failures would improve observability.📝 Proposed fix to track failed queries
+ except Exception as e: + # Track failure but continue checking other queries + query_results[query[:40]] = [] # Empty result indicates failure continue - except Exception: - continueOr add a counter for failed queries in the return value.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/bench/health_check.py` around lines 212 - 213, The except Exception: continue in health_check.py silently swallows errors — update the error handling in the try/except block (the block catching Exception) to record failures instead of just continuing: increment a failed_queries counter (or add it to the function's return dict), and log the exception details using the existing logger (e.g., logger.exception or logger.error with the exception) so failures are observable; ensure the function signature/return value is updated to include the failed count where callers can access it.
73-75: Consider logging the exception for debugging.Catching a blind
Exceptionand only storingstr(e)loses the traceback. For a diagnostic tool, having more context in case of failures would be helpful for troubleshooting.💡 Optional: Add logging for better diagnostics
+import logging + +logger = logging.getLogger(__name__) + # In the exception handler: except Exception as e: + logger.debug("Query failed: %s", query, exc_info=True) per_query.append({"query": query, "error": str(e)}) continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/bench/health_check.py` around lines 73 - 75, The except block that appends per_query.append({"query": query, "error": str(e)}) discards traceback info; add structured logging there by ensuring a logger exists (e.g., logger = logging.getLogger(__name__) at module top) and call logger.exception(...) or logger.error(..., exc_info=True) inside the except to record the full traceback alongside keeping the existing per_query entry (update the except in the function that iterates queries to call logger.exception(f"Error executing query: {query}") or similar).
14-14: Unused import:mathThe
mathmodule is imported but never used in this file.🧹 Remove unused import
import argparse import json -import math import os🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/bench/health_check.py` at line 14, Remove the unused import of the math module by deleting the "import math" statement at the top of the file; verify there are no remaining references to math in functions or variables (e.g., any health check helpers) and run the linter/formatter to ensure the change is clean.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/docs-dispatch.yml:
- Around line 30-35: The MAP assignment uses command substitution and then
checks $? (MAP=$(curl -sf ...); if [ $? -ne 0 ] ...), which is unreachable under
GitHub Actions' default errexit; change the flow so the curl invocation is
executed inside a conditional (i.e., use an if/then that runs curl into MAP and
handles the failure branch directly) instead of relying on checking $?, and
remove the separate if [ $? -ne 0 ] check; update references to MAP and the
failure-handling branch to still echo the skip message and set affected=none
when curl fails.
In `@benchmarks/EXPERIMENT_LOG.md`:
- Around line 13-23: The LoCoMo-mini question count is inconsistent: update
either the tier entry "`locomo-mini` (2 convos, 198 Qs)" or the Results cell
"76.97% (234/304)" so both show the same question total; specifically, edit the
two occurrences— the tier line containing `locomo-mini` and the Results row cell
under "LoCoMo-mini" (currently "76.97% (234/304)")—to use the correct,
reconciled question count and adjust the numerator/percentage if needed so the
percentage and raw counts match the chosen total.
In `@scripts/bench/health_check.py`:
- Line 124: The p95_ms calculation can index past the end for small samples;
replace the direct index int(len(latencies)*0.95) with a clamped index, e.g.
compute idx = min(len(latencies) - 1, int(len(latencies) * 0.95)) (keeping the
existing guard for empty lists) and use sorted(latencies)[idx] for "p95_ms" so
the code using the latencies variable never raises IndexError; update the
expression where "p95_ms" is set to use this clamped index.
In `@tests/benchmarks/test_locomo.py`:
- Around line 578-584: The comparison currently strips timezone info via
q_date.replace(tzinfo=None) and m_date.replace(tzinfo=None) which removes
offsets without converting; instead, normalize both q_date and m_date to UTC
first (e.g., q_date.astimezone(datetime.timezone.utc) and
m_date.astimezone(datetime.timezone.utc)) and then drop tzinfo before computing
days_diff so equivalent instants in different zones compare correctly; update
the loop handling of question_dates, memory_dates, q_date and m_date accordingly
while keeping the tolerance_days comparison logic unchanged.
---
Outside diff comments:
In `@automem/api/recall.py`:
- Around line 1450-1515: The response's score_filter.filtered_count currently
sets filtered_count to 0 when only the adaptive floor (score_floor_applied)
removed results because it uses "if min_score else 0"; update the logic so
filtered_count always equals pre_filter_count - len(results) whenever either
min_score or score_floor_applied is true. In other words, in recall.py where the
response["score_filter"] is built (symbols: score_filter, min_score,
score_floor_applied, pre_filter_count, results), replace the conditional
expression for filtered_count with filtered_count = pre_filter_count -
len(results) so adaptive-floor-only and combined filters are reported correctly.
---
Nitpick comments:
In `@scripts/bench/health_check.py`:
- Around line 212-213: The except Exception: continue in health_check.py
silently swallows errors — update the error handling in the try/except block
(the block catching Exception) to record failures instead of just continuing:
increment a failed_queries counter (or add it to the function's return dict),
and log the exception details using the existing logger (e.g., logger.exception
or logger.error with the exception) so failures are observable; ensure the
function signature/return value is updated to include the failed count where
callers can access it.
- Around line 73-75: The except block that appends per_query.append({"query":
query, "error": str(e)}) discards traceback info; add structured logging there
by ensuring a logger exists (e.g., logger = logging.getLogger(__name__) at
module top) and call logger.exception(...) or logger.error(..., exc_info=True)
inside the except to record the full traceback alongside keeping the existing
per_query entry (update the except in the function that iterates queries to call
logger.exception(f"Error executing query: {query}") or similar).
- Line 14: Remove the unused import of the math module by deleting the "import
math" statement at the top of the file; verify there are no remaining references
to math in functions or variables (e.g., any health check helpers) and run the
linter/formatter to ensure the change is clean.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/workflows/docs-dispatch.yml.gitignoreMakefileautomem/api/recall.pyautomem/config.pybenchmarks/EXPERIMENT_LOG.mdbenchmarks/baselines/health_baseline.jsonbenchmarks/baselines/locomo-mini_baseline.jsonscripts/bench/health_check.pyscripts/bench/restore_and_eval.shtests/benchmarks/test_locomo.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
scripts/bench/health_check.py (2)
315-317: Consider passing token as parameter instead of mutating global state.Using
global API_TOKENworks but is a code smell. For a standalone CLI script this is acceptable, but if the module grows or is imported elsewhere, passing the token through function parameters would be cleaner.This is a minor observation—the current approach is fine for this utility script's scope.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/bench/health_check.py` around lines 315 - 317, The current code sets a module-level API_TOKEN by mutating the global variable API_TOKEN when args.api_token is present; instead, refactor to pass the token as an explicit parameter to the functions that need it (e.g., the functions that currently read API_TOKEN such as health_check(), make_request(), or any helper that uses API_TOKEN), remove the global assignment (delete the global API_TOKEN = args.api_token block), and update the CLI entrypoint (where args are parsed) to call those functions with args.api_token (or None) so token state is passed explicitly rather than relying on global mutation.
221-222: Silent exception swallowing reduces observability.Unlike
check_score_distributionwhich records errors inper_query, this function silently continues when a request fails. Consider logging the exception or tracking failed queries in the output for debugging.🔧 Suggested improvement for error visibility
+ failed_queries: List[str] = [] for query in DIVERSE_QUERIES[:3]: try: resp = requests.get( f"{base_url}/recall", params={"query": query, "limit": 5}, headers=get_headers(), timeout=30, ) resp.raise_for_status() data = resp.json() - except Exception: + except Exception as e: + failed_queries.append(f"{query[:40]}: {e}") continueThen include
"failed_queries": failed_queriesin the returned dict.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/bench/health_check.py` around lines 221 - 222, The except Exception: block that currently just does "continue" should record and surface failures: catch the exception, log it (e.g., logging.exception or logger.error with traceback) and append the current query identifier to a failed_queries list (create it if missing) instead of silently skipping; also ensure the function returns the failed_queries list by adding "failed_queries": failed_queries to the output dict, mirroring how check_score_distribution records errors. Reference the existing except Exception: block and the failed_queries variable name to locate where to add logging and appending and update the returned dict accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@scripts/bench/health_check.py`:
- Around line 315-317: The current code sets a module-level API_TOKEN by
mutating the global variable API_TOKEN when args.api_token is present; instead,
refactor to pass the token as an explicit parameter to the functions that need
it (e.g., the functions that currently read API_TOKEN such as health_check(),
make_request(), or any helper that uses API_TOKEN), remove the global assignment
(delete the global API_TOKEN = args.api_token block), and update the CLI
entrypoint (where args are parsed) to call those functions with args.api_token
(or None) so token state is passed explicitly rather than relying on global
mutation.
- Around line 221-222: The except Exception: block that currently just does
"continue" should record and surface failures: catch the exception, log it
(e.g., logging.exception or logger.error with traceback) and append the current
query identifier to a failed_queries list (create it if missing) instead of
silently skipping; also ensure the function returns the failed_queries list by
adding "failed_queries": failed_queries to the output dict, mirroring how
check_score_distribution records errors. Reference the existing except
Exception: block and the failed_queries variable name to locate where to add
logging and appending and update the returned dict accordingly.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/docs-dispatch.ymlautomem/api/recall.pyscripts/bench/health_check.py
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/docs-dispatch.yml
- automem/api/recall.py
|
Resolving CodeRabbit comment #2873293369 ✅ Addressed and implemented Note: Addressed in fcbe5c5: updated Tier 1 LoCoMo-mini benchmark row to 304 questions so it matches the 234/304 result denominator. Resolved via CodeRabbit MCP |
|
Resolving CodeRabbit comment #2873293381 ✅ Addressed and implemented Note: Addressed in fcbe5c5: normalized extracted dates to UTC before comparison in match_dates_fuzzy and compute tolerance using total_seconds/86400. Resolved via CodeRabbit MCP |
|
Conversation resolved: Resolved by fcbe5c5: locomo-mini tier question count aligned with benchmark denominator. Resolved via CodeRabbit MCP |
|
Conversation resolved: Resolved by fcbe5c5: date matching now normalizes parsed datetimes to UTC before tolerance comparison. Resolved via CodeRabbit MCP |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
🤖 I have created a release *beep* *boop* --- ## [0.13.0](v0.12.0...v0.13.0) (2026-03-02) ### Features * **bench:** benchmark testing infrastructure for rapid iteration ([#97](#97)) ([80a6f93](80a6f93)) * **recall:** add min_score threshold and adaptive floor filtering ([#73](#73)) ([#101](#101)) ([8df3c08](8df3c08)) * **viewer:** add standalone graph-viewer runtime files ([5bcb6db](5bcb6db)) * **viewer:** consolidate stable core and split-ready compatibility ([#94](#94)) ([958da72](958da72)) * **viewer:** externalize visualizer with /viewer compatibility routes ([29bafcf](29bafcf)) * **viewer:** merge visualizer stable core branch ([96b27bf](96b27bf)) ### Bug Fixes * FalkorDB data not persisting across restarts ([3bbc834](3bbc834)) * FalkorDB data not persisting across restarts ([#99](#99)) ([8490d36](8490d36)) * **mcp-sse:** sync tool schemas for SSE/MCP parity ([#104](#104)) ([d99b86d](d99b86d)) ### Documentation * **bench:** add PR [#73](#73), [#80](#80), and [#87](#87) experiment results ([#103](#103)) ([8533fac](8533fac)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Summary
min_scorequery parameter to/recallfor explicit noise filteringRECALL_MIN_SCOREenv var (default0.0) as server-wide defaultRECALL_ADAPTIVE_FLOOR(defaulttrue) which detects steep score dropoffs and prunes low-quality result tails automaticallyscore_filtermetadata in response when filtering is activeBenchmark Results
No regression. The min_score infrastructure is ready but its impact depends on #78 (decay fix) to improve score differentiation. Current scores cluster tightly (0.35-0.39 range) so threshold filtering has limited effect until that's addressed.
Health Check
Score distribution remains DEGRADED (low variance) — expected, as this PR adds filtering infrastructure, not scoring improvements.
Test Plan
min_score=0.5correctly filters all results below thresholdCloses #73
Made with Cursor