Skip to content

feat(recall): add min_score threshold and adaptive floor filtering (#73)#101

Merged
jack-arturo merged 6 commits intomainfrom
exp/73-min-score-threshold
Mar 2, 2026
Merged

feat(recall): add min_score threshold and adaptive floor filtering (#73)#101
jack-arturo merged 6 commits intomainfrom
exp/73-min-score-threshold

Conversation

@jack-arturo
Copy link
Copy Markdown
Member

Summary

  • Adds min_score query parameter to /recall for explicit noise filtering
  • Adds RECALL_MIN_SCORE env var (default 0.0) as server-wide default
  • Adds RECALL_ADAPTIVE_FLOOR (default true) which detects steep score dropoffs and prunes low-quality result tails automatically
  • Includes score_filter metadata in response when filtering is active

Benchmark Results

Metric Baseline This PR Delta
LoCoMo-mini (2 convos) 76.97% 76.97% +0.0%

No regression. The min_score infrastructure is ready but its impact depends on #78 (decay fix) to improve score differentiation. Current scores cluster tightly (0.35-0.39 range) so threshold filtering has limited effect until that's addressed.

Health Check

Score distribution remains DEGRADED (low variance) — expected, as this PR adds filtering infrastructure, not scoring improvements.

Test Plan

  • Unit tests pass (150/150, 2 pre-existing env-related failures)
  • LoCoMo-mini benchmark: no regression
  • Direct API testing: min_score=0.5 correctly filters all results below threshold
  • Adaptive floor: tested with diverse queries, correctly detects steep dropoffs

Closes #73

Made with Cursor

- Add scripts/bench/health_check.py for recall health diagnostics
  (score distribution, entity quality, cross-query overlap, latency)
- Fix datetime tz-awareness bug in LoCoMo date matching
- Fix restore_and_eval.sh to produce .json result files
- Add bench-health Makefile target
- Record Voyage 4 baselines: locomo-mini 76.97%, locomo-full 80.06%
- Create benchmarks/EXPERIMENT_LOG.md for tracking experiments

Made-with: Cursor
- Add RECALL_MIN_SCORE config (default 0.0) and min_score query param
- Add RECALL_ADAPTIVE_FLOOR config (default true) for automatic steep
  dropoff detection that prunes low-quality result tails
- Apply filtering at both per-query and final assembly stages
- Include score_filter metadata in response when active
- No benchmark regression: locomo-mini 76.97% (unchanged)

The adaptive floor detects gaps >15% of max score in the top half of
results and cuts below the dropoff. Combined with min_score param,
clients can now control noise filtering. Full impact expected after
decay fix (#78) improves score differentiation.

Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b43f24 and fcbe5c5.

📒 Files selected for processing (3)
  • benchmarks/EXPERIMENT_LOG.md
  • scripts/bench/health_check.py
  • tests/benchmarks/test_locomo.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/benchmarks/test_locomo.py
  • benchmarks/EXPERIMENT_LOG.md
  • scripts/bench/health_check.py

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • New Features

    • Added minimum score filtering and adaptive floor scoring logic to recall results for improved result quality control.
  • Bug Fixes

    • Fixed date comparison logic in benchmark tests with UTC normalization.
    • Improved percentile calculation robustness in health checks.
  • Chores

    • Added GitHub Actions workflow for documentation synchronization.
    • Updated .gitignore to exclude benchmark artifacts.
  • Documentation

    • Updated benchmark experiment log with revised test configuration.

Walkthrough

Implements minimum relevance score filtering and adaptive score floor detection for the recall API to eliminate low-quality results. Adds configuration constants, updates the recall pipeline to apply filtering during per-query and final result assembly, adds a GitHub Actions workflow for documentation updates, and updates benchmark infrastructure to support testing these changes.

Changes

Cohort / File(s) Summary
Recall Filtering Implementation
automem/config.py, automem/api/recall.py
Added RECALL_MIN_SCORE and RECALL_ADAPTIVE_FLOOR configuration constants. Implemented min_score parameter parsing and adaptive floor logic in recall pipeline, applying filtering during per-query processing and final result assembly; exposed score_filter metadata in responses.
Benchmark Infrastructure
benchmarks/EXPERIMENT_LOG.md, scripts/bench/health_check.py, tests/benchmarks/test_locomo.py, .gitignore
Updated locomo-mini benchmark test data (2 convos, 304 Qs). Refined percentile calculation and conditionals in health check. Fixed date comparison logic in test to use UTC-normalized timestamps. Added /benchmarks/results/ and benchmarks/baselines/locomo_baseline.json to gitignore.
CI/CD & Documentation Dispatch
.github/workflows/docs-dispatch.yml
Added new GitHub Actions workflow to monitor changes on main, determine affected documentation via file-mapping, and dispatch repository_dispatch event to verygoodplugins/automem-website with payload containing source repo, SHA, changed files, and comparison URLs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main features added: min_score threshold and adaptive floor filtering for recall, matching the core changes in the PR.
Description check ✅ Passed The description is well-related to the changeset, covering the added min_score parameter, environment variables, adaptive floor logic, benchmark results, and test coverage.
Linked Issues check ✅ Passed The PR comprehensively addresses issue #73: implements min_score parameter with env var default, adaptive floor detection for steep dropoffs, filtering in recall assembly, and score_filter metadata in responses.
Out of Scope Changes check ✅ Passed Changes to .github/workflows/docs-dispatch.yml, .gitignore, benchmark logs, and health check utilities are tangential but reasonable supporting updates; core changes directly address #73 requirements.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch exp/73-min-score-threshold

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the enhancement New feature or request label Mar 2, 2026
@jack-arturo
Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 2, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
automem/api/recall.py (1)

1450-1515: ⚠️ Potential issue | 🟠 Major

score_filter.filtered_count underreports filtering when adaptive floor is active.

The response currently reports filtered_count=0 for adaptive-floor-only filtering, and undercounts when both filters apply. This makes API metadata inaccurate.

🛠️ Suggested fix
-    # Apply explicit min_score on final assembled results (catches expansions)
-    pre_filter_count = len(results)
+    # Apply explicit min_score on final assembled results (catches expansions)
+    adaptive_filtered_count = 0
+    min_score_filtered_count = 0
+    pre_min_score_count = len(results)
     if min_score is not None and min_score > 0:
         results = [r for r in results if float(r.get("final_score", 0.0)) >= min_score]
+        min_score_filtered_count = pre_min_score_count - len(results)
@@
-    if min_score or score_floor_applied:
+    # If adaptive floor was applied earlier, capture its impact too.
+    if score_floor_applied is not None:
+        # Recompute relative to the list before min_score filtering.
+        adaptive_filtered_count = max(0, pre_min_score_count - len(results) - min_score_filtered_count)
+
+    if (min_score is not None and min_score > 0) or score_floor_applied is not None:
         response["score_filter"] = {
             "min_score": min_score,
             "adaptive_floor": score_floor_applied,
-            "filtered_count": pre_filter_count - len(results) if min_score else 0,
+            "filtered_count": adaptive_filtered_count + min_score_filtered_count,
+            "adaptive_filtered_count": adaptive_filtered_count,
+            "min_score_filtered_count": min_score_filtered_count,
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@automem/api/recall.py` around lines 1450 - 1515, The response's
score_filter.filtered_count currently sets filtered_count to 0 when only the
adaptive floor (score_floor_applied) removed results because it uses "if
min_score else 0"; update the logic so filtered_count always equals
pre_filter_count - len(results) whenever either min_score or score_floor_applied
is true. In other words, in recall.py where the response["score_filter"] is
built (symbols: score_filter, min_score, score_floor_applied, pre_filter_count,
results), replace the conditional expression for filtered_count with
filtered_count = pre_filter_count - len(results) so adaptive-floor-only and
combined filters are reported correctly.
🧹 Nitpick comments (3)
scripts/bench/health_check.py (3)

212-213: Silent exception swallowing hides failures.

As flagged by static analysis, the try-except-continue pattern silently drops errors without any indication. For a health check tool, logging or at least counting these failures would improve observability.

📝 Proposed fix to track failed queries
+        except Exception as e:
+            # Track failure but continue checking other queries
+            query_results[query[:40]] = []  # Empty result indicates failure
             continue
-        except Exception:
-            continue

Or add a counter for failed queries in the return value.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/bench/health_check.py` around lines 212 - 213, The except Exception:
continue in health_check.py silently swallows errors — update the error handling
in the try/except block (the block catching Exception) to record failures
instead of just continuing: increment a failed_queries counter (or add it to the
function's return dict), and log the exception details using the existing logger
(e.g., logger.exception or logger.error with the exception) so failures are
observable; ensure the function signature/return value is updated to include the
failed count where callers can access it.

73-75: Consider logging the exception for debugging.

Catching a blind Exception and only storing str(e) loses the traceback. For a diagnostic tool, having more context in case of failures would be helpful for troubleshooting.

💡 Optional: Add logging for better diagnostics
+import logging
+
+logger = logging.getLogger(__name__)
+
 # In the exception handler:
         except Exception as e:
+            logger.debug("Query failed: %s", query, exc_info=True)
             per_query.append({"query": query, "error": str(e)})
             continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/bench/health_check.py` around lines 73 - 75, The except block that
appends per_query.append({"query": query, "error": str(e)}) discards traceback
info; add structured logging there by ensuring a logger exists (e.g., logger =
logging.getLogger(__name__) at module top) and call logger.exception(...) or
logger.error(..., exc_info=True) inside the except to record the full traceback
alongside keeping the existing per_query entry (update the except in the
function that iterates queries to call logger.exception(f"Error executing query:
{query}") or similar).

14-14: Unused import: math

The math module is imported but never used in this file.

🧹 Remove unused import
 import argparse
 import json
-import math
 import os
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/bench/health_check.py` at line 14, Remove the unused import of the
math module by deleting the "import math" statement at the top of the file;
verify there are no remaining references to math in functions or variables
(e.g., any health check helpers) and run the linter/formatter to ensure the
change is clean.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/docs-dispatch.yml:
- Around line 30-35: The MAP assignment uses command substitution and then
checks $? (MAP=$(curl -sf ...); if [ $? -ne 0 ] ...), which is unreachable under
GitHub Actions' default errexit; change the flow so the curl invocation is
executed inside a conditional (i.e., use an if/then that runs curl into MAP and
handles the failure branch directly) instead of relying on checking $?, and
remove the separate if [ $? -ne 0 ] check; update references to MAP and the
failure-handling branch to still echo the skip message and set affected=none
when curl fails.

In `@benchmarks/EXPERIMENT_LOG.md`:
- Around line 13-23: The LoCoMo-mini question count is inconsistent: update
either the tier entry "`locomo-mini` (2 convos, 198 Qs)" or the Results cell
"76.97% (234/304)" so both show the same question total; specifically, edit the
two occurrences— the tier line containing `locomo-mini` and the Results row cell
under "LoCoMo-mini" (currently "76.97% (234/304)")—to use the correct,
reconciled question count and adjust the numerator/percentage if needed so the
percentage and raw counts match the chosen total.

In `@scripts/bench/health_check.py`:
- Line 124: The p95_ms calculation can index past the end for small samples;
replace the direct index int(len(latencies)*0.95) with a clamped index, e.g.
compute idx = min(len(latencies) - 1, int(len(latencies) * 0.95)) (keeping the
existing guard for empty lists) and use sorted(latencies)[idx] for "p95_ms" so
the code using the latencies variable never raises IndexError; update the
expression where "p95_ms" is set to use this clamped index.

In `@tests/benchmarks/test_locomo.py`:
- Around line 578-584: The comparison currently strips timezone info via
q_date.replace(tzinfo=None) and m_date.replace(tzinfo=None) which removes
offsets without converting; instead, normalize both q_date and m_date to UTC
first (e.g., q_date.astimezone(datetime.timezone.utc) and
m_date.astimezone(datetime.timezone.utc)) and then drop tzinfo before computing
days_diff so equivalent instants in different zones compare correctly; update
the loop handling of question_dates, memory_dates, q_date and m_date accordingly
while keeping the tolerance_days comparison logic unchanged.

---

Outside diff comments:
In `@automem/api/recall.py`:
- Around line 1450-1515: The response's score_filter.filtered_count currently
sets filtered_count to 0 when only the adaptive floor (score_floor_applied)
removed results because it uses "if min_score else 0"; update the logic so
filtered_count always equals pre_filter_count - len(results) whenever either
min_score or score_floor_applied is true. In other words, in recall.py where the
response["score_filter"] is built (symbols: score_filter, min_score,
score_floor_applied, pre_filter_count, results), replace the conditional
expression for filtered_count with filtered_count = pre_filter_count -
len(results) so adaptive-floor-only and combined filters are reported correctly.

---

Nitpick comments:
In `@scripts/bench/health_check.py`:
- Around line 212-213: The except Exception: continue in health_check.py
silently swallows errors — update the error handling in the try/except block
(the block catching Exception) to record failures instead of just continuing:
increment a failed_queries counter (or add it to the function's return dict),
and log the exception details using the existing logger (e.g., logger.exception
or logger.error with the exception) so failures are observable; ensure the
function signature/return value is updated to include the failed count where
callers can access it.
- Around line 73-75: The except block that appends per_query.append({"query":
query, "error": str(e)}) discards traceback info; add structured logging there
by ensuring a logger exists (e.g., logger = logging.getLogger(__name__) at
module top) and call logger.exception(...) or logger.error(..., exc_info=True)
inside the except to record the full traceback alongside keeping the existing
per_query entry (update the except in the function that iterates queries to call
logger.exception(f"Error executing query: {query}") or similar).
- Line 14: Remove the unused import of the math module by deleting the "import
math" statement at the top of the file; verify there are no remaining references
to math in functions or variables (e.g., any health check helpers) and run the
linter/formatter to ensure the change is clean.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 80a6f93 and 82b1c3d.

📒 Files selected for processing (11)
  • .github/workflows/docs-dispatch.yml
  • .gitignore
  • Makefile
  • automem/api/recall.py
  • automem/config.py
  • benchmarks/EXPERIMENT_LOG.md
  • benchmarks/baselines/health_baseline.json
  • benchmarks/baselines/locomo-mini_baseline.json
  • scripts/bench/health_check.py
  • scripts/bench/restore_and_eval.sh
  • tests/benchmarks/test_locomo.py

Comment thread .github/workflows/docs-dispatch.yml Outdated
Comment thread benchmarks/EXPERIMENT_LOG.md Outdated
Comment thread scripts/bench/health_check.py Outdated
Comment thread tests/benchmarks/test_locomo.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
scripts/bench/health_check.py (2)

315-317: Consider passing token as parameter instead of mutating global state.

Using global API_TOKEN works but is a code smell. For a standalone CLI script this is acceptable, but if the module grows or is imported elsewhere, passing the token through function parameters would be cleaner.

This is a minor observation—the current approach is fine for this utility script's scope.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/bench/health_check.py` around lines 315 - 317, The current code sets
a module-level API_TOKEN by mutating the global variable API_TOKEN when
args.api_token is present; instead, refactor to pass the token as an explicit
parameter to the functions that need it (e.g., the functions that currently read
API_TOKEN such as health_check(), make_request(), or any helper that uses
API_TOKEN), remove the global assignment (delete the global API_TOKEN =
args.api_token block), and update the CLI entrypoint (where args are parsed) to
call those functions with args.api_token (or None) so token state is passed
explicitly rather than relying on global mutation.

221-222: Silent exception swallowing reduces observability.

Unlike check_score_distribution which records errors in per_query, this function silently continues when a request fails. Consider logging the exception or tracking failed queries in the output for debugging.

🔧 Suggested improvement for error visibility
+    failed_queries: List[str] = []
     for query in DIVERSE_QUERIES[:3]:
         try:
             resp = requests.get(
                 f"{base_url}/recall",
                 params={"query": query, "limit": 5},
                 headers=get_headers(),
                 timeout=30,
             )
             resp.raise_for_status()
             data = resp.json()
-        except Exception:
+        except Exception as e:
+            failed_queries.append(f"{query[:40]}: {e}")
             continue

Then include "failed_queries": failed_queries in the returned dict.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/bench/health_check.py` around lines 221 - 222, The except Exception:
block that currently just does "continue" should record and surface failures:
catch the exception, log it (e.g., logging.exception or logger.error with
traceback) and append the current query identifier to a failed_queries list
(create it if missing) instead of silently skipping; also ensure the function
returns the failed_queries list by adding "failed_queries": failed_queries to
the output dict, mirroring how check_score_distribution records errors.
Reference the existing except Exception: block and the failed_queries variable
name to locate where to add logging and appending and update the returned dict
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/bench/health_check.py`:
- Around line 315-317: The current code sets a module-level API_TOKEN by
mutating the global variable API_TOKEN when args.api_token is present; instead,
refactor to pass the token as an explicit parameter to the functions that need
it (e.g., the functions that currently read API_TOKEN such as health_check(),
make_request(), or any helper that uses API_TOKEN), remove the global assignment
(delete the global API_TOKEN = args.api_token block), and update the CLI
entrypoint (where args are parsed) to call those functions with args.api_token
(or None) so token state is passed explicitly rather than relying on global
mutation.
- Around line 221-222: The except Exception: block that currently just does
"continue" should record and surface failures: catch the exception, log it
(e.g., logging.exception or logger.error with traceback) and append the current
query identifier to a failed_queries list (create it if missing) instead of
silently skipping; also ensure the function returns the failed_queries list by
adding "failed_queries": failed_queries to the output dict, mirroring how
check_score_distribution records errors. Reference the existing except
Exception: block and the failed_queries variable name to locate where to add
logging and appending and update the returned dict accordingly.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82b1c3d and 7b43f24.

📒 Files selected for processing (3)
  • .github/workflows/docs-dispatch.yml
  • automem/api/recall.py
  • scripts/bench/health_check.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/docs-dispatch.yml
  • automem/api/recall.py

@jack-arturo
Copy link
Copy Markdown
Member Author

Resolving CodeRabbit comment #2873293369

Addressed and implemented

Note: Addressed in fcbe5c5: updated Tier 1 LoCoMo-mini benchmark row to 304 questions so it matches the 234/304 result denominator.

Resolved via CodeRabbit MCP

@jack-arturo
Copy link
Copy Markdown
Member Author

Resolving CodeRabbit comment #2873293381

Addressed and implemented

Note: Addressed in fcbe5c5: normalized extracted dates to UTC before comparison in match_dates_fuzzy and compute tolerance using total_seconds/86400.

Resolved via CodeRabbit MCP

@jack-arturo
Copy link
Copy Markdown
Member Author

Conversation resolved: Resolved by fcbe5c5: locomo-mini tier question count aligned with benchmark denominator.

Resolved via CodeRabbit MCP

@jack-arturo
Copy link
Copy Markdown
Member Author

Conversation resolved: Resolved by fcbe5c5: date matching now normalizes parsed datetimes to UTC before tolerance comparison.

Resolved via CodeRabbit MCP

@jack-arturo
Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 2, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@jack-arturo jack-arturo merged commit 8df3c08 into main Mar 2, 2026
7 checks passed
@jack-arturo jack-arturo deleted the exp/73-min-score-threshold branch March 2, 2026 16:34
jack-arturo added a commit that referenced this pull request Mar 2, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.13.0](v0.12.0...v0.13.0)
(2026-03-02)


### Features

* **bench:** benchmark testing infrastructure for rapid iteration
([#97](#97))
([80a6f93](80a6f93))
* **recall:** add min_score threshold and adaptive floor filtering
([#73](#73))
([#101](#101))
([8df3c08](8df3c08))
* **viewer:** add standalone graph-viewer runtime files
([5bcb6db](5bcb6db))
* **viewer:** consolidate stable core and split-ready compatibility
([#94](#94))
([958da72](958da72))
* **viewer:** externalize visualizer with /viewer compatibility routes
([29bafcf](29bafcf))
* **viewer:** merge visualizer stable core branch
([96b27bf](96b27bf))


### Bug Fixes

* FalkorDB data not persisting across restarts
([3bbc834](3bbc834))
* FalkorDB data not persisting across restarts
([#99](#99))
([8490d36](8490d36))
* **mcp-sse:** sync tool schemas for SSE/MCP parity
([#104](#104))
([d99b86d](d99b86d))


### Documentation

* **bench:** add PR
[#73](#73),
[#80](#80), and
[#87](#87) experiment
results ([#103](#103))
([8533fac](8533fac))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

recall: Add minimum relevance score threshold to filter low-quality results

1 participant