fix(benchmarks): correct evaluator bugs, add agent guidelines, establish honest baseline#117
Conversation
The new snapshot-based benchmark system (PR #97) wasn't documented in AGENTS.md, causing AI agents to miss the tiered eval pipeline and reference stale Oct 2025 numbers instead of current Voyage 4 baselines. Adds: tiered benchmark table, key commands, recall/retrieval change workflow, and directory layout. Updates project structure to reference benchmarks/ and scripts/bench/. Made-with: Cursor
Two bugs in check_answer_in_memories caused 22% temporal accuracy
(should be 92%):
1. match_dates_fuzzy compared question dates vs memory dates, but
temporal questions ("When did X happen?") contain no dates — the
date is in the expected answer. Now compares answer dates vs memory
session_datetime with 1-day tolerance.
2. Strategy 2 (semantic search) only checked memory content text,
ignoring session_datetime metadata. Added _session_datetime_to_words
helper that decomposes ISO timestamps into matchable tokens (year,
month name, day) and injects them into searchable text.
Result: locomo-mini 76.97% → 91.78% (+14.81pp). Temporal category
22.2% → 92.1% (+69.8pp). Beats CORE SOTA by 3.54pp.
Also adds Category Breakdown table to EXPERIMENT_LOG.md so per-category
regressions are visible at a glance.
Made-with: Cursor
Category 5 (Complex Reasoning) questions either have no `answer` field or only trivial yes/no answers. The evaluator was getting 100% because `"" in <any string>` is always True, and `"no" in "know"` matches as substring. These are designed for LLM-judge evaluation per the LoCoMo paper. Now skips all category 5 uniformly and reports as N/A. Overall honest score: 89.27% (208/233 on cats 1-4), still beats CORE by 1.03pp. Adds footnotes to EXPERIMENT_LOG.md explaining the two evaluator bugs (temporal false negatives + complex false positives) so future readers understand why earlier rows show different numbers. Made-with: Cursor
Replace inflated 90.53% with honest 89.27% (categories 1-4, Voyage 4). Add note explaining the two evaluator bugs that affected prior numbers. Point to benchmarks/EXPERIMENT_LOG.md as source of truth for baselines. Update make commands to reference new snapshot-based bench system. Made-with: Cursor
📝 WalkthroughSummary by CodeRabbit
WalkthroughUpdates benchmarking documentation and tooling (new benchmarks/ layout, commands, and EXPERIMENT_LOG.md entries), adjusts README benchmark scores, and modifies the LoCoMo test harness to add ISO datetime → words conversion and to skip Category 5 (Complex Reasoning) evaluations with updated reporting. Changes
Sequence Diagram(s)(omitted — changes are primarily documentation and focused evaluator logic; no multi-component sequential flow meeting diagram criteria) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/benchmarks/test_locomo.py (2)
1037-1049:⚠️ Potential issue | 🟡 MinorInclude decomposed session-date tokens in the joined multi-hop text.
This branch still appends raw ISO timestamps only, so
normalize_answer()collapses them into opaque tokens like20230508t...and the new_session_datetime_to_words()helper never helps the word-overlap fallback here. Temporal multi-hop questions can still miss unlessmatch_dates_fuzzy()happens to parse the answer.Proposed fix
for mem in evidence_memories: content = mem.get("content", "") metadata = mem.get("metadata", {}) session_dt = metadata.get("session_datetime", "") joined_text_parts.append(str(content)) if session_dt: joined_text_parts.append(str(session_dt)) + joined_text_parts.append( + self._session_datetime_to_words(session_dt) + ) joined_text = " \n ".join(joined_text_parts).lower()Also applies to: 1051-1054
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/benchmarks/test_locomo.py` around lines 1037 - 1049, The joined multi-hop text currently appends raw ISO session timestamps (session_dt) into joined_text_parts so normalize_answer() collapses them; modify the block that builds joined_text_parts (iterating evidence_memories) to call the existing helper _session_datetime_to_words(session_dt) and append or extend its returned human-readable tokens alongside the raw session_dt (or instead of it) before joining, so joined_text (and joined_norm via normalize_answer) contains decomposed date/time words useful for word-overlap fallback and match_dates_fuzzy(); ensure this change is applied in the same multi-hop branch that creates joined_text and in the similar block around lines 1051-1054.
1570-1618:⚠️ Potential issue | 🟠 MajorDon't compare the cats 1–4 score to CORE's full LoCoMo score.
With Category 5 skipped,
overall_accuracyis now a 233-question metric, but Lines 1608-1616 still compare it to CORE's published 304-question score and can print “BEATS CORE”. That comparison is no longer apples-to-apples and will misstate the result in both CLI output and the returnedcomparisonpayload.Proposed fix
- # Comparison with CORE - core_sota = 0.8824 - improvement = overall_accuracy - core_sota print(f"\n🏆 Comparison with CORE (SOTA):") - print(f" CORE: {core_sota:.2%}") - print(f" AutoMem: {overall_accuracy:.2%}") - if improvement > 0: - print(f" 🎉 AutoMem BEATS CORE by {improvement:.2%}!") - elif improvement < 0: - print(f" 📉 AutoMem is {abs(improvement):.2%} behind CORE") - else: - print(f" 🤝 AutoMem matches CORE") + if cat5_skipped: + core_sota = None + improvement = None + print(" Not directly comparable: AutoMem score excludes Category 5 questions.") + else: + core_sota = 0.8824 + improvement = overall_accuracy - core_sota + print(f" CORE: {core_sota:.2%}") + print(f" AutoMem: {overall_accuracy:.2%}") + if improvement > 0: + print(f" 🎉 AutoMem BEATS CORE by {improvement:.2%}!") + elif improvement < 0: + print(f" 📉 AutoMem is {abs(improvement):.2%} behind CORE") + else: + print(f" 🤝 AutoMem matches CORE")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/benchmarks/test_locomo.py` around lines 1570 - 1618, The code currently compares overall_accuracy (which excludes Category 5 when cat5_skipped > 0) to core_sota, producing an invalid apples-to-apples comparison; change the logic so that when cat5_skipped is non-zero you do NOT compare overall_accuracy to core_sota directly — instead either compute a Category-1–4-only accuracy (sum correct and total from category_results for categories 1–4 and compare that to a CORE 1–4 baseline if you have per-category CORE data) or skip the CORE comparison and mark it N/A in both CLI output and the returned comparison payload; update the block referencing overall_accuracy, core_sota, cat5_skipped, and category_results to implement this conditional behavior and ensure the comparison and printed message reflect the adjusted or omitted comparison accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1427-1429: The print uses an unnecessary f-string causing F541;
change the call print(f"\n📊 Conversation Results:") to a regular string literal
(e.g., print("\n📊 Conversation Results:")) to remove the empty f-string and
re-run linting; update the nearby prints that use f-strings with interpolations
(the subsequent print using accuracy, correct_count, total_count, and skip_note)
should remain unchanged.
---
Outside diff comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1037-1049: The joined multi-hop text currently appends raw ISO
session timestamps (session_dt) into joined_text_parts so normalize_answer()
collapses them; modify the block that builds joined_text_parts (iterating
evidence_memories) to call the existing helper
_session_datetime_to_words(session_dt) and append or extend its returned
human-readable tokens alongside the raw session_dt (or instead of it) before
joining, so joined_text (and joined_norm via normalize_answer) contains
decomposed date/time words useful for word-overlap fallback and
match_dates_fuzzy(); ensure this change is applied in the same multi-hop branch
that creates joined_text and in the similar block around lines 1051-1054.
- Around line 1570-1618: The code currently compares overall_accuracy (which
excludes Category 5 when cat5_skipped > 0) to core_sota, producing an invalid
apples-to-apples comparison; change the logic so that when cat5_skipped is
non-zero you do NOT compare overall_accuracy to core_sota directly — instead
either compute a Category-1–4-only accuracy (sum correct and total from
category_results for categories 1–4 and compare that to a CORE 1–4 baseline if
you have per-category CORE data) or skip the CORE comparison and mark it N/A in
both CLI output and the returned comparison payload; update the block
referencing overall_accuracy, core_sota, cat5_skipped, and category_results to
implement this conditional behavior and ensure the comparison and printed
message reflect the adjusted or omitted comparison accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 79dea21a-fc66-427c-a413-5fd00d43101b
📒 Files selected for processing (4)
AGENTS.mdREADME.mdbenchmarks/EXPERIMENT_LOG.mdtests/benchmarks/test_locomo.py
- Remove empty f-string (F541) in evaluate_conversation print - Add _session_datetime_to_words to multi-hop joined text so word-overlap fallback can match date components (same fix already applied to Strategy 1 and Strategy 2) - Add caveat to CORE comparison when cat5 is excluded, noting that CORE's 88.24% includes cat5 via GPT-4 judge Made-with: Cursor
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/benchmarks/test_locomo.py (1)
1319-1326: Keeptotal_questionsas the raw QA count.These return objects now use
total_questionsfor scored-only items, whileqa_resultsstill includes skipped cat-5 rows. That silently changes the payload contract for any downstream consumer expectingtotal_questions == len(qa_results). Consider adding explicitscored_questions/skipped_questionsfields instead of repurposingtotal_questions.Also applies to: 1431-1438
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/benchmarks/test_locomo.py` around lines 1319 - 1326, The return payload currently repurposes "total_questions" to mean only scored items which breaks the contract with "qa_results" (which still contains skipped cat-5 rows); change the return object to preserve "total_questions" as the raw QA count (i.e., len(qa_results)) and add explicit fields "scored_questions" and "skipped_questions" (or "skipped_count") that reflect scored vs skipped counts, updating the code that computes total_count and correct_count accordingly so "scored_questions" == total_count and "skipped_questions" == total_questions - scored_questions; apply the same change to the other similar return block (the one around lines 1431-1438) and ensure callers that expect the old semantics are updated to use the new "scored_questions" field.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1607-1622: The comparison computes improvement = overall_accuracy
- core_sota and prints a lead/lag even when cat5_skipped is True
(overall_accuracy excludes cat-5), creating an invalid apples-to-oranges claim;
update the logic around core_sota/improvement (symbols: overall_accuracy,
core_sota, improvement, cat5_skipped, and comparison.improvement) so that if
cat5_skipped is truthy you do not compute or return a numeric improvement —
instead set improvement (and comparison.improvement) to None or omit it and
change the print block to only report CORE vs AutoMem percentages plus a warning
about excluded cat-5 Qs (no lead/lag message); apply the same guard to the
second occurrence referenced in the comment.
---
Nitpick comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1319-1326: The return payload currently repurposes
"total_questions" to mean only scored items which breaks the contract with
"qa_results" (which still contains skipped cat-5 rows); change the return object
to preserve "total_questions" as the raw QA count (i.e., len(qa_results)) and
add explicit fields "scored_questions" and "skipped_questions" (or
"skipped_count") that reflect scored vs skipped counts, updating the code that
computes total_count and correct_count accordingly so "scored_questions" ==
total_count and "skipped_questions" == total_questions - scored_questions; apply
the same change to the other similar return block (the one around lines
1431-1438) and ensure callers that expect the old semantics are updated to use
the new "scored_questions" field.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4b2be523-9d80-4250-bce9-1292f4870152
📒 Files selected for processing (1)
tests/benchmarks/test_locomo.py
🤖 I have created a release *beep* *boop* --- ## [0.15.0](v0.14.0...v0.15.0) (2026-03-12) ### Features * **bench:** add opt-in LoCoMo cat5 judge ([#119](#119)) ([8b7915c](8b7915c)) * **memory:** optimize relationship taxonomy ([#121](#121)) ([605633e](605633e)) ### Bug Fixes * **benchmarks:** correct evaluator bugs, add agent guidelines, establish honest baseline ([#117](#117)) ([d9cbecb](d9cbecb)) * **recall:** priority_ids parameter only boosts relevance ([#79](#79)) ([#125](#125)) ([5d3708c](5d3708c)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Summary
match_dates_fuzzywas comparing question dates vs memory dates, but temporal questions ("When did X happen?") have no dates in the question—the date is in the expected answer. Also addedsession_datetimedecomposition into matchable word tokens for Strategy 2. Temporal: 22.2% → 92.1%.answerfield (onlyadversarial_answer), so"" in <any string>was always True. Now skipped and reported as N/A (requires LLM judge per LoCoMo paper).EXPERIMENT_LOG.md.EXPERIMENT_LOG.mdso per-category regressions are visible at a glance.Honest baseline (locomo-mini, Voyage 4, cats 1-4)
Test plan
make bench-eval BENCH=locomo-minipasses with correct scores (89.27%)Made with Cursor