Skip to content

fix(benchmarks): correct evaluator bugs, add agent guidelines, establish honest baseline#117

Merged
jack-arturo merged 6 commits intomainfrom
docs/benchmark-agent-guidelines
Mar 10, 2026
Merged

fix(benchmarks): correct evaluator bugs, add agent guidelines, establish honest baseline#117
jack-arturo merged 6 commits intomainfrom
docs/benchmark-agent-guidelines

Conversation

@jack-arturo
Copy link
Copy Markdown
Member

Summary

  • AGENTS.md: Added benchmarking section so all agents (Cursor, Codex, Claude) know about the tiered bench system, key commands, and the A/B workflow for recall changes.
  • Temporal evaluator fix: match_dates_fuzzy was comparing question dates vs memory dates, but temporal questions ("When did X happen?") have no dates in the question—the date is in the expected answer. Also added session_datetime decomposition into matchable word tokens for Strategy 2. Temporal: 22.2% → 92.1%.
  • Category 5 skip: Complex Reasoning was scoring 100% because the dataset has no answer field (only adversarial_answer), so "" in <any string> was always True. Now skipped and reported as N/A (requires LLM judge per LoCoMo paper).
  • README updated: Replaced inflated 90.53% with honest 89.27% (cats 1-4). Added note explaining evaluator bugs and pointing to EXPERIMENT_LOG.md.
  • Category Breakdown table added to EXPERIMENT_LOG.md so per-category regressions are visible at a glance.

Honest baseline (locomo-mini, Voyage 4, cats 1-4)

Category Before After
Overall 76.97% (fake) 89.27%
Single-hop 76.7% 79.1%
Temporal 22.2% (bug) 92.1%
Multi-hop 46.2% 46.2%
Open Domain 96.5% 96.5%
Complex 100% (bug) N/A

Test plan

  • make bench-eval BENCH=locomo-mini passes with correct scores (89.27%)
  • Category 5 shows N/A (71 skipped)
  • No regressions in categories 1-4
  • Pre-commit hooks pass (black, isort, flake8, conventional commit)

Made with Cursor

The new snapshot-based benchmark system (PR #97) wasn't documented in
AGENTS.md, causing AI agents to miss the tiered eval pipeline and
reference stale Oct 2025 numbers instead of current Voyage 4 baselines.

Adds: tiered benchmark table, key commands, recall/retrieval change
workflow, and directory layout. Updates project structure to reference
benchmarks/ and scripts/bench/.

Made-with: Cursor
Re-ran locomo-mini on current main (795368a) after merging #73, #78,
#115, #116. Score unchanged at 76.97% — baseline is stable. This
serves as B₀ for the upcoming relation tier refactor.

Made-with: Cursor
Two bugs in check_answer_in_memories caused 22% temporal accuracy
(should be 92%):

1. match_dates_fuzzy compared question dates vs memory dates, but
   temporal questions ("When did X happen?") contain no dates — the
   date is in the expected answer. Now compares answer dates vs memory
   session_datetime with 1-day tolerance.

2. Strategy 2 (semantic search) only checked memory content text,
   ignoring session_datetime metadata. Added _session_datetime_to_words
   helper that decomposes ISO timestamps into matchable tokens (year,
   month name, day) and injects them into searchable text.

Result: locomo-mini 76.97% → 91.78% (+14.81pp). Temporal category
22.2% → 92.1% (+69.8pp). Beats CORE SOTA by 3.54pp.

Also adds Category Breakdown table to EXPERIMENT_LOG.md so per-category
regressions are visible at a glance.

Made-with: Cursor
Category 5 (Complex Reasoning) questions either have no `answer` field
or only trivial yes/no answers. The evaluator was getting 100% because
`"" in <any string>` is always True, and `"no" in "know"` matches as
substring. These are designed for LLM-judge evaluation per the LoCoMo
paper.

Now skips all category 5 uniformly and reports as N/A. Overall honest
score: 89.27% (208/233 on cats 1-4), still beats CORE by 1.03pp.

Adds footnotes to EXPERIMENT_LOG.md explaining the two evaluator bugs
(temporal false negatives + complex false positives) so future readers
understand why earlier rows show different numbers.

Made-with: Cursor
Replace inflated 90.53% with honest 89.27% (categories 1-4, Voyage 4).
Add note explaining the two evaluator bugs that affected prior numbers.
Point to benchmarks/EXPERIMENT_LOG.md as source of truth for baselines.
Update make commands to reference new snapshot-based bench system.

Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 10, 2026

📝 Walkthrough

Summary by CodeRabbit

  • Documentation

    • Added a comprehensive Benchmarking section, new benchmarks directory layout, experiment log guidance, and commands for running and comparing benchmarks.
    • README updated with pointers to current baselines, revised benchmark phrasing, and new benchmark run instructions.
  • Bug Fixes

    • Corrected benchmark reporting (score updated from 90.53% to 89.27%) and documented evaluator history/bugs.
    • Clarified Complex Reasoning is N/A without an external judge.
  • Tests

    • Improved evaluation logic with better temporal/date matching, per-category breakdowns, and explicit handling/reporting for skipped Complex Reasoning items.

Walkthrough

Updates benchmarking documentation and tooling (new benchmarks/ layout, commands, and EXPERIMENT_LOG.md entries), adjusts README benchmark scores, and modifies the LoCoMo test harness to add ISO datetime → words conversion and to skip Category 5 (Complex Reasoning) evaluations with updated reporting.

Changes

Cohort / File(s) Summary
Documentation & Benchmarks
AGENTS.md, README.md, benchmarks/EXPERIMENT_LOG.md, benchmarks/..., tests/benchmarks/*
Adds a comprehensive Benchmarking section and directory layout (benchmarks/), documents workflow and new commands (bench-eval, bench-compare, bench-ingest, bench-health), updates README LoCoMo scores and commands, and appends EXPERIMENT_LOG.md with 2026-03-10 results and a Category Breakdown for LoCoMo-mini.
Test Harness / Evaluator
tests/benchmarks/test_locomo.py
Introduces _session_datetime_to_words(iso_str: str) static helper; updates temporal and multi-hop matching to include word-based dates and fuzzy-date comparisons; adds explicit skipping of Category 5 (Complex Reasoning) items across evaluation paths and adjusts reporting/aggregation to account for skipped items.

Sequence Diagram(s)

(omitted — changes are primarily documentation and focused evaluator logic; no multi-component sequential flow meeting diagram criteria)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: fixing evaluator bugs, adding agent guidelines to AGENTS.md, and establishing a corrected benchmark baseline.
Description check ✅ Passed The description comprehensively relates to the changeset, explaining the evaluator fixes, documentation updates, and baseline corrections across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch docs/benchmark-agent-guidelines

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the enhancement New feature or request label Mar 10, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/benchmarks/test_locomo.py (2)

1037-1049: ⚠️ Potential issue | 🟡 Minor

Include decomposed session-date tokens in the joined multi-hop text.

This branch still appends raw ISO timestamps only, so normalize_answer() collapses them into opaque tokens like 20230508t... and the new _session_datetime_to_words() helper never helps the word-overlap fallback here. Temporal multi-hop questions can still miss unless match_dates_fuzzy() happens to parse the answer.

Proposed fix
                     for mem in evidence_memories:
                         content = mem.get("content", "")
                         metadata = mem.get("metadata", {})
                         session_dt = metadata.get("session_datetime", "")
                         joined_text_parts.append(str(content))
                         if session_dt:
                             joined_text_parts.append(str(session_dt))
+                            joined_text_parts.append(
+                                self._session_datetime_to_words(session_dt)
+                            )
                     joined_text = " \n ".join(joined_text_parts).lower()

Also applies to: 1051-1054

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/benchmarks/test_locomo.py` around lines 1037 - 1049, The joined
multi-hop text currently appends raw ISO session timestamps (session_dt) into
joined_text_parts so normalize_answer() collapses them; modify the block that
builds joined_text_parts (iterating evidence_memories) to call the existing
helper _session_datetime_to_words(session_dt) and append or extend its returned
human-readable tokens alongside the raw session_dt (or instead of it) before
joining, so joined_text (and joined_norm via normalize_answer) contains
decomposed date/time words useful for word-overlap fallback and
match_dates_fuzzy(); ensure this change is applied in the same multi-hop branch
that creates joined_text and in the similar block around lines 1051-1054.

1570-1618: ⚠️ Potential issue | 🟠 Major

Don't compare the cats 1–4 score to CORE's full LoCoMo score.

With Category 5 skipped, overall_accuracy is now a 233-question metric, but Lines 1608-1616 still compare it to CORE's published 304-question score and can print “BEATS CORE”. That comparison is no longer apples-to-apples and will misstate the result in both CLI output and the returned comparison payload.

Proposed fix
-        # Comparison with CORE
-        core_sota = 0.8824
-        improvement = overall_accuracy - core_sota
         print(f"\n🏆 Comparison with CORE (SOTA):")
-        print(f"  CORE: {core_sota:.2%}")
-        print(f"  AutoMem: {overall_accuracy:.2%}")
-        if improvement > 0:
-            print(f"  🎉 AutoMem BEATS CORE by {improvement:.2%}!")
-        elif improvement < 0:
-            print(f"  📉 AutoMem is {abs(improvement):.2%} behind CORE")
-        else:
-            print(f"  🤝 AutoMem matches CORE")
+        if cat5_skipped:
+            core_sota = None
+            improvement = None
+            print("  Not directly comparable: AutoMem score excludes Category 5 questions.")
+        else:
+            core_sota = 0.8824
+            improvement = overall_accuracy - core_sota
+            print(f"  CORE: {core_sota:.2%}")
+            print(f"  AutoMem: {overall_accuracy:.2%}")
+            if improvement > 0:
+                print(f"  🎉 AutoMem BEATS CORE by {improvement:.2%}!")
+            elif improvement < 0:
+                print(f"  📉 AutoMem is {abs(improvement):.2%} behind CORE")
+            else:
+                print(f"  🤝 AutoMem matches CORE")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/benchmarks/test_locomo.py` around lines 1570 - 1618, The code currently
compares overall_accuracy (which excludes Category 5 when cat5_skipped > 0) to
core_sota, producing an invalid apples-to-apples comparison; change the logic so
that when cat5_skipped is non-zero you do NOT compare overall_accuracy to
core_sota directly — instead either compute a Category-1–4-only accuracy (sum
correct and total from category_results for categories 1–4 and compare that to a
CORE 1–4 baseline if you have per-category CORE data) or skip the CORE
comparison and mark it N/A in both CLI output and the returned comparison
payload; update the block referencing overall_accuracy, core_sota, cat5_skipped,
and category_results to implement this conditional behavior and ensure the
comparison and printed message reflect the adjusted or omitted comparison
accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1427-1429: The print uses an unnecessary f-string causing F541;
change the call print(f"\n📊 Conversation Results:") to a regular string literal
(e.g., print("\n📊 Conversation Results:")) to remove the empty f-string and
re-run linting; update the nearby prints that use f-strings with interpolations
(the subsequent print using accuracy, correct_count, total_count, and skip_note)
should remain unchanged.

---

Outside diff comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1037-1049: The joined multi-hop text currently appends raw ISO
session timestamps (session_dt) into joined_text_parts so normalize_answer()
collapses them; modify the block that builds joined_text_parts (iterating
evidence_memories) to call the existing helper
_session_datetime_to_words(session_dt) and append or extend its returned
human-readable tokens alongside the raw session_dt (or instead of it) before
joining, so joined_text (and joined_norm via normalize_answer) contains
decomposed date/time words useful for word-overlap fallback and
match_dates_fuzzy(); ensure this change is applied in the same multi-hop branch
that creates joined_text and in the similar block around lines 1051-1054.
- Around line 1570-1618: The code currently compares overall_accuracy (which
excludes Category 5 when cat5_skipped > 0) to core_sota, producing an invalid
apples-to-apples comparison; change the logic so that when cat5_skipped is
non-zero you do NOT compare overall_accuracy to core_sota directly — instead
either compute a Category-1–4-only accuracy (sum correct and total from
category_results for categories 1–4 and compare that to a CORE 1–4 baseline if
you have per-category CORE data) or skip the CORE comparison and mark it N/A in
both CLI output and the returned comparison payload; update the block
referencing overall_accuracy, core_sota, cat5_skipped, and category_results to
implement this conditional behavior and ensure the comparison and printed
message reflect the adjusted or omitted comparison accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 79dea21a-fc66-427c-a413-5fd00d43101b

📥 Commits

Reviewing files that changed from the base of the PR and between 795368a and b527d76.

📒 Files selected for processing (4)
  • AGENTS.md
  • README.md
  • benchmarks/EXPERIMENT_LOG.md
  • tests/benchmarks/test_locomo.py

Comment thread tests/benchmarks/test_locomo.py
- Remove empty f-string (F541) in evaluate_conversation print
- Add _session_datetime_to_words to multi-hop joined text so
  word-overlap fallback can match date components (same fix already
  applied to Strategy 1 and Strategy 2)
- Add caveat to CORE comparison when cat5 is excluded, noting that
  CORE's 88.24% includes cat5 via GPT-4 judge

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/benchmarks/test_locomo.py (1)

1319-1326: Keep total_questions as the raw QA count.

These return objects now use total_questions for scored-only items, while qa_results still includes skipped cat-5 rows. That silently changes the payload contract for any downstream consumer expecting total_questions == len(qa_results). Consider adding explicit scored_questions / skipped_questions fields instead of repurposing total_questions.

Also applies to: 1431-1438

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/benchmarks/test_locomo.py` around lines 1319 - 1326, The return payload
currently repurposes "total_questions" to mean only scored items which breaks
the contract with "qa_results" (which still contains skipped cat-5 rows); change
the return object to preserve "total_questions" as the raw QA count (i.e.,
len(qa_results)) and add explicit fields "scored_questions" and
"skipped_questions" (or "skipped_count") that reflect scored vs skipped counts,
updating the code that computes total_count and correct_count accordingly so
"scored_questions" == total_count and "skipped_questions" == total_questions -
scored_questions; apply the same change to the other similar return block (the
one around lines 1431-1438) and ensure callers that expect the old semantics are
updated to use the new "scored_questions" field.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1607-1622: The comparison computes improvement = overall_accuracy
- core_sota and prints a lead/lag even when cat5_skipped is True
(overall_accuracy excludes cat-5), creating an invalid apples-to-oranges claim;
update the logic around core_sota/improvement (symbols: overall_accuracy,
core_sota, improvement, cat5_skipped, and comparison.improvement) so that if
cat5_skipped is truthy you do not compute or return a numeric improvement —
instead set improvement (and comparison.improvement) to None or omit it and
change the print block to only report CORE vs AutoMem percentages plus a warning
about excluded cat-5 Qs (no lead/lag message); apply the same guard to the
second occurrence referenced in the comment.

---

Nitpick comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1319-1326: The return payload currently repurposes
"total_questions" to mean only scored items which breaks the contract with
"qa_results" (which still contains skipped cat-5 rows); change the return object
to preserve "total_questions" as the raw QA count (i.e., len(qa_results)) and
add explicit fields "scored_questions" and "skipped_questions" (or
"skipped_count") that reflect scored vs skipped counts, updating the code that
computes total_count and correct_count accordingly so "scored_questions" ==
total_count and "skipped_questions" == total_questions - scored_questions; apply
the same change to the other similar return block (the one around lines
1431-1438) and ensure callers that expect the old semantics are updated to use
the new "scored_questions" field.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b2be523-9d80-4250-bce9-1292f4870152

📥 Commits

Reviewing files that changed from the base of the PR and between b527d76 and 9c759a1.

📒 Files selected for processing (1)
  • tests/benchmarks/test_locomo.py

Comment thread tests/benchmarks/test_locomo.py
@jack-arturo jack-arturo merged commit d9cbecb into main Mar 10, 2026
7 checks passed
@jack-arturo jack-arturo deleted the docs/benchmark-agent-guidelines branch March 10, 2026 10:31
jack-arturo added a commit that referenced this pull request Mar 25, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.15.0](v0.14.0...v0.15.0)
(2026-03-12)


### Features

* **bench:** add opt-in LoCoMo cat5 judge
([#119](#119))
([8b7915c](8b7915c))
* **memory:** optimize relationship taxonomy
([#121](#121))
([605633e](605633e))


### Bug Fixes

* **benchmarks:** correct evaluator bugs, add agent guidelines,
establish honest baseline
([#117](#117))
([d9cbecb](d9cbecb))
* **recall:** priority_ids parameter only boosts relevance
([#79](#79))
([#125](#125))
([5d3708c](5d3708c))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant