fix(benchmarks): correct evaluator bugs, add agent guidelines, establish honest baseline by jack-arturo · Pull Request #117 · verygoodplugins/automem

jack-arturo · 2026-03-10T10:07:25Z

Summary

AGENTS.md: Added benchmarking section so all agents (Cursor, Codex, Claude) know about the tiered bench system, key commands, and the A/B workflow for recall changes.
Temporal evaluator fix: match_dates_fuzzy was comparing question dates vs memory dates, but temporal questions ("When did X happen?") have no dates in the question—the date is in the expected answer. Also added session_datetime decomposition into matchable word tokens for Strategy 2. Temporal: 22.2% → 92.1%.
Category 5 skip: Complex Reasoning was scoring 100% because the dataset has no answer field (only adversarial_answer), so "" in <any string> was always True. Now skipped and reported as N/A (requires LLM judge per LoCoMo paper).
README updated: Replaced inflated 90.53% with honest 89.27% (cats 1-4). Added note explaining evaluator bugs and pointing to EXPERIMENT_LOG.md.
Category Breakdown table added to EXPERIMENT_LOG.md so per-category regressions are visible at a glance.

Honest baseline (locomo-mini, Voyage 4, cats 1-4)

Category	Before	After
Overall	76.97% (fake)	89.27%
Single-hop	76.7%	79.1%
Temporal	22.2% (bug)	92.1%
Multi-hop	46.2%	46.2%
Open Domain	96.5%	96.5%
Complex	100% (bug)	N/A

Test plan

make bench-eval BENCH=locomo-mini passes with correct scores (89.27%)
Category 5 shows N/A (71 skipped)
No regressions in categories 1-4
Pre-commit hooks pass (black, isort, flake8, conventional commit)

Made with Cursor

The new snapshot-based benchmark system (PR #97) wasn't documented in AGENTS.md, causing AI agents to miss the tiered eval pipeline and reference stale Oct 2025 numbers instead of current Voyage 4 baselines. Adds: tiered benchmark table, key commands, recall/retrieval change workflow, and directory layout. Updates project structure to reference benchmarks/ and scripts/bench/. Made-with: Cursor

Re-ran locomo-mini on current main (795368a) after merging #73, #78, #115, #116. Score unchanged at 76.97% — baseline is stable. This serves as B₀ for the upcoming relation tier refactor. Made-with: Cursor

Two bugs in check_answer_in_memories caused 22% temporal accuracy (should be 92%): 1. match_dates_fuzzy compared question dates vs memory dates, but temporal questions ("When did X happen?") contain no dates — the date is in the expected answer. Now compares answer dates vs memory session_datetime with 1-day tolerance. 2. Strategy 2 (semantic search) only checked memory content text, ignoring session_datetime metadata. Added _session_datetime_to_words helper that decomposes ISO timestamps into matchable tokens (year, month name, day) and injects them into searchable text. Result: locomo-mini 76.97% → 91.78% (+14.81pp). Temporal category 22.2% → 92.1% (+69.8pp). Beats CORE SOTA by 3.54pp. Also adds Category Breakdown table to EXPERIMENT_LOG.md so per-category regressions are visible at a glance. Made-with: Cursor

Category 5 (Complex Reasoning) questions either have no `answer` field or only trivial yes/no answers. The evaluator was getting 100% because `"" in <any string>` is always True, and `"no" in "know"` matches as substring. These are designed for LLM-judge evaluation per the LoCoMo paper. Now skips all category 5 uniformly and reports as N/A. Overall honest score: 89.27% (208/233 on cats 1-4), still beats CORE by 1.03pp. Adds footnotes to EXPERIMENT_LOG.md explaining the two evaluator bugs (temporal false negatives + complex false positives) so future readers understand why earlier rows show different numbers. Made-with: Cursor

Replace inflated 90.53% with honest 89.27% (categories 1-4, Voyage 4). Add note explaining the two evaluator bugs that affected prior numbers. Point to benchmarks/EXPERIMENT_LOG.md as source of truth for baselines. Update make commands to reference new snapshot-based bench system. Made-with: Cursor

coderabbitai · 2026-03-10T10:07:42Z

📝 Walkthrough

Summary by CodeRabbit

Documentation
- Added a comprehensive Benchmarking section, new benchmarks directory layout, experiment log guidance, and commands for running and comparing benchmarks.
- README updated with pointers to current baselines, revised benchmark phrasing, and new benchmark run instructions.
Bug Fixes
- Corrected benchmark reporting (score updated from 90.53% to 89.27%) and documented evaluator history/bugs.
- Clarified Complex Reasoning is N/A without an external judge.
Tests
- Improved evaluation logic with better temporal/date matching, per-category breakdowns, and explicit handling/reporting for skipped Complex Reasoning items.

Walkthrough

Updates benchmarking documentation and tooling (new benchmarks/ layout, commands, and EXPERIMENT_LOG.md entries), adjusts README benchmark scores, and modifies the LoCoMo test harness to add ISO datetime → words conversion and to skip Category 5 (Complex Reasoning) evaluations with updated reporting.

Changes

Cohort / File(s)	Summary
Documentation & Benchmarks `AGENTS.md`, `README.md`, `benchmarks/EXPERIMENT_LOG.md`, `benchmarks/...`, `tests/benchmarks/*`	Adds a comprehensive Benchmarking section and directory layout (benchmarks/), documents workflow and new commands (bench-eval, bench-compare, bench-ingest, bench-health), updates README LoCoMo scores and commands, and appends EXPERIMENT_LOG.md with 2026-03-10 results and a Category Breakdown for LoCoMo-mini.
Test Harness / Evaluator `tests/benchmarks/test_locomo.py`	Introduces `_session_datetime_to_words(iso_str: str)` static helper; updates temporal and multi-hop matching to include word-based dates and fuzzy-date comparisons; adds explicit skipping of Category 5 (Complex Reasoning) items across evaluation paths and adjusts reporting/aggregation to account for skipped items.

Sequence Diagram(s)

(omitted — changes are primarily documentation and focused evaluator logic; no multi-component sequential flow meeting diagram criteria)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Experiment/multi hop v2 #13 — Modifies the same LoCoMo evaluator (test_locomo.py); contains complementary helpers and evaluation behavior changes touching date handling and multi-hop logic.
chore(bench): add health check, baselines, and fix eval harness #102 — Alters LoCoMo evaluation/date-matching and benchmark harness behavior in test_locomo.py and related logs.
feat: Multi-hop recall with graph expansion and multi-query support #8 — Makes changes to tests/benchmarks/test_locomo.py affecting date-handling and answer-matching used by the LoCoMo evaluator.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: fixing evaluator bugs, adding agent guidelines to AGENTS.md, and establishing a corrected benchmark baseline.
Description check	✅ Passed	The description comprehensively relates to the changeset, explaining the evaluator fixes, documentation updates, and baseline corrections across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch docs/benchmark-agent-guidelines

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/benchmarks/test_locomo.py (2)

1037-1049: ⚠️ Potential issue | 🟡 Minor

Include decomposed session-date tokens in the joined multi-hop text.

This branch still appends raw ISO timestamps only, so normalize_answer() collapses them into opaque tokens like 20230508t... and the new _session_datetime_to_words() helper never helps the word-overlap fallback here. Temporal multi-hop questions can still miss unless match_dates_fuzzy() happens to parse the answer.

Proposed fix

                     for mem in evidence_memories:
                         content = mem.get("content", "")
                         metadata = mem.get("metadata", {})
                         session_dt = metadata.get("session_datetime", "")
                         joined_text_parts.append(str(content))
                         if session_dt:
                             joined_text_parts.append(str(session_dt))
+                            joined_text_parts.append(
+                                self._session_datetime_to_words(session_dt)
+                            )
                     joined_text = " \n ".join(joined_text_parts).lower()

Also applies to: 1051-1054

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/benchmarks/test_locomo.py` around lines 1037 - 1049, The joined
multi-hop text currently appends raw ISO session timestamps (session_dt) into
joined_text_parts so normalize_answer() collapses them; modify the block that
builds joined_text_parts (iterating evidence_memories) to call the existing
helper _session_datetime_to_words(session_dt) and append or extend its returned
human-readable tokens alongside the raw session_dt (or instead of it) before
joining, so joined_text (and joined_norm via normalize_answer) contains
decomposed date/time words useful for word-overlap fallback and
match_dates_fuzzy(); ensure this change is applied in the same multi-hop branch
that creates joined_text and in the similar block around lines 1051-1054.

1570-1618: ⚠️ Potential issue | 🟠 Major

Don't compare the cats 1–4 score to CORE's full LoCoMo score.

With Category 5 skipped, overall_accuracy is now a 233-question metric, but Lines 1608-1616 still compare it to CORE's published 304-question score and can print “BEATS CORE”. That comparison is no longer apples-to-apples and will misstate the result in both CLI output and the returned comparison payload.

Proposed fix

-        # Comparison with CORE
-        core_sota = 0.8824
-        improvement = overall_accuracy - core_sota
         print(f"\n🏆 Comparison with CORE (SOTA):")
-        print(f"  CORE: {core_sota:.2%}")
-        print(f"  AutoMem: {overall_accuracy:.2%}")
-        if improvement > 0:
-            print(f"  🎉 AutoMem BEATS CORE by {improvement:.2%}!")
-        elif improvement < 0:
-            print(f"  📉 AutoMem is {abs(improvement):.2%} behind CORE")
-        else:
-            print(f"  🤝 AutoMem matches CORE")
+        if cat5_skipped:
+            core_sota = None
+            improvement = None
+            print("  Not directly comparable: AutoMem score excludes Category 5 questions.")
+        else:
+            core_sota = 0.8824
+            improvement = overall_accuracy - core_sota
+            print(f"  CORE: {core_sota:.2%}")
+            print(f"  AutoMem: {overall_accuracy:.2%}")
+            if improvement > 0:
+                print(f"  🎉 AutoMem BEATS CORE by {improvement:.2%}!")
+            elif improvement < 0:
+                print(f"  📉 AutoMem is {abs(improvement):.2%} behind CORE")
+            else:
+                print(f"  🤝 AutoMem matches CORE")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/benchmarks/test_locomo.py` around lines 1570 - 1618, The code currently
compares overall_accuracy (which excludes Category 5 when cat5_skipped > 0) to
core_sota, producing an invalid apples-to-apples comparison; change the logic so
that when cat5_skipped is non-zero you do NOT compare overall_accuracy to
core_sota directly — instead either compute a Category-1–4-only accuracy (sum
correct and total from category_results for categories 1–4 and compare that to a
CORE 1–4 baseline if you have per-category CORE data) or skip the CORE
comparison and mark it N/A in both CLI output and the returned comparison
payload; update the block referencing overall_accuracy, core_sota, cat5_skipped,
and category_results to implement this conditional behavior and ensure the
comparison and printed message reflect the adjusted or omitted comparison
accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1427-1429: The print uses an unnecessary f-string causing F541;
change the call print(f"\n📊 Conversation Results:") to a regular string literal
(e.g., print("\n📊 Conversation Results:")) to remove the empty f-string and
re-run linting; update the nearby prints that use f-strings with interpolations
(the subsequent print using accuracy, correct_count, total_count, and skip_note)
should remain unchanged.

---

Outside diff comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1037-1049: The joined multi-hop text currently appends raw ISO
session timestamps (session_dt) into joined_text_parts so normalize_answer()
collapses them; modify the block that builds joined_text_parts (iterating
evidence_memories) to call the existing helper
_session_datetime_to_words(session_dt) and append or extend its returned
human-readable tokens alongside the raw session_dt (or instead of it) before
joining, so joined_text (and joined_norm via normalize_answer) contains
decomposed date/time words useful for word-overlap fallback and
match_dates_fuzzy(); ensure this change is applied in the same multi-hop branch
that creates joined_text and in the similar block around lines 1051-1054.
- Around line 1570-1618: The code currently compares overall_accuracy (which
excludes Category 5 when cat5_skipped > 0) to core_sota, producing an invalid
apples-to-apples comparison; change the logic so that when cat5_skipped is
non-zero you do NOT compare overall_accuracy to core_sota directly — instead
either compute a Category-1–4-only accuracy (sum correct and total from
category_results for categories 1–4 and compare that to a CORE 1–4 baseline if
you have per-category CORE data) or skip the CORE comparison and mark it N/A in
both CLI output and the returned comparison payload; update the block
referencing overall_accuracy, core_sota, cat5_skipped, and category_results to
implement this conditional behavior and ensure the comparison and printed
message reflect the adjusted or omitted comparison accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 79dea21a-fc66-427c-a413-5fd00d43101b

📥 Commits

Reviewing files that changed from the base of the PR and between 795368a and b527d76.

📒 Files selected for processing (4)

AGENTS.md
README.md
benchmarks/EXPERIMENT_LOG.md
tests/benchmarks/test_locomo.py

- Remove empty f-string (F541) in evaluate_conversation print - Add _session_datetime_to_words to multi-hop joined text so word-overlap fallback can match date components (same fix already applied to Strategy 1 and Strategy 2) - Add caveat to CORE comparison when cat5 is excluded, noting that CORE's 88.24% includes cat5 via GPT-4 judge Made-with: Cursor

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/benchmarks/test_locomo.py (1)
1319-1326: Keep total_questions as the raw QA count.

These return objects now use total_questions for scored-only items, while qa_results still includes skipped cat-5 rows. That silently changes the payload contract for any downstream consumer expecting total_questions == len(qa_results). Consider adding explicit scored_questions / skipped_questions fields instead of repurposing total_questions.

Also applies to: 1431-1438
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/benchmarks/test_locomo.py` around lines 1319 - 1326, The return payload
currently repurposes "total_questions" to mean only scored items which breaks
the contract with "qa_results" (which still contains skipped cat-5 rows); change
the return object to preserve "total_questions" as the raw QA count (i.e.,
len(qa_results)) and add explicit fields "scored_questions" and
"skipped_questions" (or "skipped_count") that reflect scored vs skipped counts,
updating the code that computes total_count and correct_count accordingly so
"scored_questions" == total_count and "skipped_questions" == total_questions -
scored_questions; apply the same change to the other similar return block (the
one around lines 1431-1438) and ensure callers that expect the old semantics are
updated to use the new "scored_questions" field.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1607-1622: The comparison computes improvement = overall_accuracy
- core_sota and prints a lead/lag even when cat5_skipped is True
(overall_accuracy excludes cat-5), creating an invalid apples-to-oranges claim;
update the logic around core_sota/improvement (symbols: overall_accuracy,
core_sota, improvement, cat5_skipped, and comparison.improvement) so that if
cat5_skipped is truthy you do not compute or return a numeric improvement —
instead set improvement (and comparison.improvement) to None or omit it and
change the print block to only report CORE vs AutoMem percentages plus a warning
about excluded cat-5 Qs (no lead/lag message); apply the same guard to the
second occurrence referenced in the comment.

---

Nitpick comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1319-1326: The return payload currently repurposes
"total_questions" to mean only scored items which breaks the contract with
"qa_results" (which still contains skipped cat-5 rows); change the return object
to preserve "total_questions" as the raw QA count (i.e., len(qa_results)) and
add explicit fields "scored_questions" and "skipped_questions" (or
"skipped_count") that reflect scored vs skipped counts, updating the code that
computes total_count and correct_count accordingly so "scored_questions" ==
total_count and "skipped_questions" == total_questions - scored_questions; apply
the same change to the other similar return block (the one around lines
1431-1438) and ensure callers that expect the old semantics are updated to use
the new "scored_questions" field.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b2be523-9d80-4250-bce9-1292f4870152

📥 Commits

Reviewing files that changed from the base of the PR and between b527d76 and 9c759a1.

📒 Files selected for processing (1)

tests/benchmarks/test_locomo.py

🤖 I have created a release *beep* *boop* --- ## [0.15.0](v0.14.0...v0.15.0) (2026-03-12) ### Features * **bench:** add opt-in LoCoMo cat5 judge ([#119](#119)) ([8b7915c](8b7915c)) * **memory:** optimize relationship taxonomy ([#121](#121)) ([605633e](605633e)) ### Bug Fixes * **benchmarks:** correct evaluator bugs, add agent guidelines, establish honest baseline ([#117](#117)) ([d9cbecb](d9cbecb)) * **recall:** priority_ids parameter only boosts relevance ([#79](#79)) ([#125](#125)) ([5d3708c](5d3708c)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

jack-arturo added 5 commits March 10, 2026 14:45

docs(bench): log pre-refactor baseline confirmation (76.97%)

e96fa5c

Re-ran locomo-mini on current main (795368a) after merging #73, #78, #115, #116. Score unchanged at 76.97% — baseline is stable. This serves as B₀ for the upcoming relation tier refactor. Made-with: Cursor

coderabbitai Bot added the enhancement New feature or request label Mar 10, 2026

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

Comment thread tests/benchmarks/test_locomo.py

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

Comment thread tests/benchmarks/test_locomo.py

jack-arturo merged commit d9cbecb into main Mar 10, 2026
7 checks passed

jack-arturo deleted the docs/benchmark-agent-guidelines branch March 10, 2026 10:31

jack-arturo mentioned this pull request Mar 10, 2026

chore(main): release 0.15.0 #118

Merged

coderabbitai Bot mentioned this pull request Mar 10, 2026

feat(bench): add opt-in LoCoMo cat5 judge #119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(benchmarks): correct evaluator bugs, add agent guidelines, establish honest baseline#117

fix(benchmarks): correct evaluator bugs, add agent guidelines, establish honest baseline#117
jack-arturo merged 6 commits intomainfrom
docs/benchmark-agent-guidelines

jack-arturo commented Mar 10, 2026

Uh oh!

coderabbitai Bot commented Mar 10, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jack-arturo commented Mar 10, 2026

Summary

Honest baseline (locomo-mini, Voyage 4, cats 1-4)

Test plan

Uh oh!

coderabbitai Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Mar 10, 2026 •

edited

Loading