feat(bench): add opt-in LoCoMo cat5 judge#119
Conversation
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds an optional LLM-based judge for LoCoMo Category‑5, CLI/config flags to enable it, local and LLM-level caching, OpenAI request timeout and JSON parsing helpers, extensive Cat‑5 tests, and documentation/benchmark updates distinguishing judge-on vs judge-off baselines and cost notes. Changes
Sequence DiagramsequenceDiagram
actor User
participant Runner as LoCoMo Runner
participant Eval as LoCoMo Evaluator
participant Cache as Local Memory Cache
participant OpenAI as OpenAI API
User->>Runner: run_benchmark(--judge / --judge-model)
Runner->>Eval: init(config with judge_model)
Runner->>Eval: evaluate_conversation()
loop per Cat-5 question
Eval->>Cache: lookup prepared memories (sample_id)
alt cache hit
Cache-->>Eval: return cached memories
else cache miss
Eval->>Eval: fetch_evidence_memories(evidence_ids, use_local_cache=false)
Eval->>Cache: _cache_prepared_memories(sample_id, memories)
Cache-->>Eval: cached
end
Eval->>Eval: _format_memories_for_llm(memories)
Eval->>OpenAI: judge_complex_reasoning(prompt, timeout=OPENAI_REQUEST_TIMEOUT_SECONDS)
OpenAI-->>Eval: JSON verdict (verdict, confidence, answer)
Eval->>Eval: _parse_json_object_response()
Eval->>Eval: score and record QA result
end
Eval-->>Runner: aggregated results (per-category, cat-5 judge state)
Runner-->>User: benchmark report
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/benchmarks/test_locomo.py (1)
1795-1807:⚠️ Potential issue | 🟡 MinorDifferentiate judge-off from judge-failed in the cat-5 summary.
When judge mode is enabled but every cat-5 row is skipped (for example, missing API key or judge errors), this branch still prints
needs LLM judge. That makes a broken judge run look the same as judge-off mode.🛠️ Proposed fix
- print(f" {cat5_name:25s}: N/A ({cat5_skipped:3d} skipped, needs LLM judge)") + reason = ( + "judge unavailable/errors" + if self.config.judge_model + else "needs LLM judge" + ) + print(f" {cat5_name:25s}: N/A ({cat5_skipped:3d} skipped, {reason})")
🧹 Nitpick comments (1)
tests/test_locomo_cat5_judge.py (1)
9-21: Prefer importing a shared helper over executing another test file.Lines 9-21 create a second copy of
tests/benchmarks/test_locomo.pywith its own module-level state. That gets brittle once the benchmark module grows caches, fixtures, or import-time setup, because this suite will no longer be exercising the same module instance pytest collects. MovingLoCoMoConfigandLoCoMoEvaluatorinto a normal helper module would make these tests much less fragile; if you keep this pattern, at least register the module insys.modulesbeforeexec_module.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_locomo_cat5_judge.py` around lines 9 - 21, The test currently loads and executes tests/benchmarks/test_locomo.py via _load_locomo_module, creating a distinct module instance and duplicating module-level state; fix by either (preferred) refactoring LoCoMoConfig and LoCoMoEvaluator into a shared helper module and importing that helper in both places, or (if keeping exec_module) register the spec.name in sys.modules before calling spec.loader.exec_module(module) (use the same spec name used in spec_from_file_location) so pytest and this test share the same module object; update _load_locomo_module accordingly to avoid divergent module state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/TESTING.md`:
- Around line 232-239: Update the sentence describing how the judge is enabled
to state that both --judge and --judge-model enable category-5 judging;
specifically, change the line that currently says "`--judge` enables the judge
and defaults to `gpt-4o` unless `BENCH_JUDGE_MODEL` or `--judge-model` overrides
it`" to something like: "`--judge` and `--judge-model` both enable the judge;
`--judge` defaults to `gpt-4o` unless overridden by `BENCH_JUDGE_MODEL` or
`--judge-model`." Reference the flags `--judge`, `--judge-model` and the
environment variable `BENCH_JUDGE_MODEL` (and the benchmark test
`tests/benchmarks/test_locomo.py` if needed) when making the edit.
In `@README.md`:
- Around line 560-562: There is an extra blank line between the two blockquotes
("Methodology note:" and "History note:") triggering MD028; remove the blank
line so the two lines become a single continuous blockquote (or alternatively
convert both to normal paragraphs) to satisfy the linter while keeping the exact
texts of "Methodology note:" and "History note:" intact.
- Around line 535-540: The README currently documents the full baseline as
"locomo" with the judge enabled using "gpt-4o" but instructs readers to run just
"make bench-eval BENCH=locomo"; update that invocation to explicitly enable the
judge so the run matches the published baseline (for example, change the example
command to include the judge flag or env var that enables gpt-4o judging), and
mention that enabling the judge is required to include category 5 results;
reference the strings "make bench-eval BENCH=locomo", "locomo", "gpt-4o", and
"category 5" when making the change.
In `@tests/benchmarks/test_locomo.py`:
- Around line 1486-1488: The current guard unconditionally skips all category==5
rows when self.config.judge_model is false, hiding deterministic rows with
canonical answers; change the condition in the category==5 block so it only
returns (sets base_result["explanation"]="Skipped: requires LLM judge") when
judge mode is off AND the example lacks a canonical/gold answer (e.g., check
row.get("answer") or row.get("gold") is None/empty) so deterministic category-5
rows are allowed to proceed and be scored; update the if using the existing
symbols category, self.config.judge_model, and base_result to implement this
selective skip.
---
Nitpick comments:
In `@tests/test_locomo_cat5_judge.py`:
- Around line 9-21: The test currently loads and executes
tests/benchmarks/test_locomo.py via _load_locomo_module, creating a distinct
module instance and duplicating module-level state; fix by either (preferred)
refactoring LoCoMoConfig and LoCoMoEvaluator into a shared helper module and
importing that helper in both places, or (if keeping exec_module) register the
spec.name in sys.modules before calling spec.loader.exec_module(module) (use the
same spec name used in spec_from_file_location) so pytest and this test share
the same module object; update _load_locomo_module accordingly to avoid
divergent module state.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3330e165-dad3-4892-a1c1-b451c41237eb
📒 Files selected for processing (11)
AGENTS.mdREADME.mdbenchmarks/EXPERIMENT_LOG.mddocs/TESTING.mdscripts/bench/restore_and_eval.shtest-locomo-benchmark.shtests/benchmarks/BENCHMARK_2025-11-08.mdtests/benchmarks/BENCHMARK_2025-11-20.mdtests/benchmarks/BENCHMARK_2025-12-02.mdtests/benchmarks/test_locomo.pytests/test_locomo_cat5_judge.py
- docs/TESTING.md: clarify that --judge-model also enables the judge - README.md: update benchmark command to include judge for full baseline - README.md: merge blockquotes to fix MD028 lint warning - test_locomo.py: allow deterministic cat-5 rows when judge is off - test_locomo.py: differentiate judge-off vs judge-failed in cat-5 summary - test_locomo_cat5_judge.py: register module in sys.modules to avoid duplicates
|
Resolving CodeRabbit comment #2912430608 ✅ Addressed and implemented Note: Fixed in 48438ce — updated line 238 to clarify both --judge and --judge-model enable the judge Resolved via CodeRabbit MCP |
|
Resolving CodeRabbit comment #2912430626 ✅ Addressed and implemented Note: Fixed in 48438ce — updated benchmark command to include BENCH_JUDGE_MODEL=gpt-4o and CONFIG=baseline for full run Resolved via CodeRabbit MCP |
|
Resolving CodeRabbit comment #2912430632 ✅ Addressed and implemented Note: Fixed in 48438ce — merged blockquotes into single continuous blockquote to fix MD028 Resolved via CodeRabbit MCP |
|
Resolving CodeRabbit comment #2912430638 ✅ Addressed and implemented Note: Fixed in 48438ce — added Resolved via CodeRabbit MCP |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
README.md (1)
560-562:⚠️ Potential issue | 🟡 MinorRemove the blank line inside this blockquote.
The standalone
>on Line 561 still triggers MD028 (no-blanks-blockquote). Make these two notes one continuous blockquote or convert them to paragraphs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 560 - 562, The blockquote contains an empty blank line between two quote paragraphs which triggers MD028; remove the solitary ">" blank line so the two notes become one continuous blockquote (or alternatively convert the two quoted lines into normal paragraphs), editing the quoted lines starting with "**Methodology note:**" and "**History note:**" so there is no blank ">" line between them.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/benchmarks/test_locomo.py`:
- Around line 1857-1858: The JSON currently sets "judge_enabled" based solely on
self.config.judge_model which claims the judge ran even if it couldn't (e.g.,
missing OPENAI_API_KEY); change this to expose two flags instead:
"judge_requested": bool(self.config.judge_model) and "judge_available":
bool(self.config.judge_model) and bool(os.getenv("OPENAI_API_KEY")) (or the
runtime check your code uses for OpenAI availability), so callers can tell
whether the judge was requested vs actually available to run; update the dict
where "judge_enabled" and "judge_model" are set and import/use os.getenv or your
existing OpenAI-availability helper.
- Around line 68-72: The default for judge_model is being locked at import time
by using os.getenv(...) in the field default; change the dataclass field
declaration (judge_model) to use field(default_factory=lambda:
os.getenv("BENCH_JUDGE_MODEL") or None) so the environment is read when a
LoCoMoConfig instance is created (and keep the existing __post_init__ logic that
strips or sets None). Update the import to include dataclasses.field if needed
and ensure the symbol names to edit are judge_model in the dataclass and the
__post_init__ method of LoCoMoConfig.
In `@tests/test_locomo_cat5_judge.py`:
- Around line 229-257: The test
test_cat5_judge_failures_skip_instead_of_marking_wrong fails to stub the recall
step; patch locomo_evaluator._recall_memories_for_qa (the method called by
_evaluate_question) to return the same fixed evidence_memories used elsewhere in
the file so the test does not contact the local AutoMem API; update the test to
monkeypatch or set locomo_evaluator._recall_memories_for_qa = lambda qa,
sample_id, use_local_cache=False: evidence_memories before calling
_evaluate_question, ensuring the judge path runs with the stubbed memories.
- Around line 30-33: The locomo_evaluator fixture currently inherits
environment-driven judge/OpenAI settings; update the fixture so it explicitly
disables judge mode and clears any OpenAI key before constructing the evaluator:
obtain a LoCoMoConfig instance in the fixture, set the judge-related flag or
field (e.g., BENCH_JUDGE_MODEL or config.use_judge) to a falsy value and ensure
the OpenAI credential (e.g., OPENAI_API_KEY or config.openai_api_key) is
cleared/None, then pass that sanitized config into LoCoMoEvaluator;
alternatively explicitly unset os.environ['OPENAI_API_KEY'] in the fixture
before creating LoCoMoEvaluator so tests run with a deterministic, judge-off
baseline.
---
Duplicate comments:
In `@README.md`:
- Around line 560-562: The blockquote contains an empty blank line between two
quote paragraphs which triggers MD028; remove the solitary ">" blank line so the
two notes become one continuous blockquote (or alternatively convert the two
quoted lines into normal paragraphs), editing the quoted lines starting with
"**Methodology note:**" and "**History note:**" so there is no blank ">" line
between them.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 69fbbc58-4478-4a94-b765-eed6fc9e8e2e
📒 Files selected for processing (4)
README.mddocs/TESTING.mdtests/benchmarks/test_locomo.pytests/test_locomo_cat5_judge.py
- test_locomo.py: use field(default_factory=...) for judge_model to read env at instance time - test_locomo.py: split judge_enabled into judge_requested/judge_available - test_locomo_cat5_judge.py: sanitize fixture to disable judge/OpenAI for deterministic tests - test_locomo_cat5_judge.py: stub _recall_memories_for_qa in judge failure test
|
Resolving CodeRabbit comment #2912565229 ✅ Addressed and implemented Note: Fixed in f10fbfe — changed to field(default_factory=lambda: os.getenv(...)) so env is read at instance creation Resolved via CodeRabbit MCP |
|
Resolving CodeRabbit comment #2912565238 ✅ Addressed and implemented Note: Fixed in f10fbfe — split into judge_requested and judge_available (checks self.openai_client) Resolved via CodeRabbit MCP |
|
Resolving CodeRabbit comment #2912565245 ✅ Addressed and implemented Note: Fixed in f10fbfe — fixture now explicitly sets judge_model=None, openai_client=None, use_embedding_similarity=False Resolved via CodeRabbit MCP |
|
Resolving CodeRabbit comment #2912565250 ✅ Addressed and implemented Note: Fixed in f10fbfe — added monkeypatch for _recall_memories_for_qa to return stubbed memories Resolved via CodeRabbit MCP |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/test_locomo_cat5_judge.py (1)
74-89: Unused lambda arguments in stubs are intentional.The static analysis warnings (ARG005) for unused lambda arguments are false positives in this context. Test stubs must match the real method signatures to work correctly with
monkeypatch.setattr. Prefixing unused args with_is optional but would be more Pythonic:Optional: prefix unused args
- lambda question, sample_id, evidence: [_fake_memory("D10:20", "Jolene watches videos.")], + lambda _question, _sample_id, _evidence: [_fake_memory("D10:20", "Jolene watches videos.")],🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_locomo_cat5_judge.py` around lines 74 - 89, The lambdas used in the test stubs trigger static analysis for unused arguments; update the monkeypatch.setattr calls that replace locomo_evaluator._recall_memories_for_qa and locomo_evaluator.judge_complex_reasoning so their lambda parameters that are unused are prefixed with an underscore (e.g., question -> _question, sample_id -> _sample_id, evidence -> _evidence, adversarial_answer -> _adversarial_answer, recalled_memories -> _recalled_memories, evidence_dialog_ids -> _evidence_dialog_ids) to silence ARG005 while keeping the signatures identical for monkeypatching.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/test_locomo_cat5_judge.py`:
- Around line 74-89: The lambdas used in the test stubs trigger static analysis
for unused arguments; update the monkeypatch.setattr calls that replace
locomo_evaluator._recall_memories_for_qa and
locomo_evaluator.judge_complex_reasoning so their lambda parameters that are
unused are prefixed with an underscore (e.g., question -> _question, sample_id
-> _sample_id, evidence -> _evidence, adversarial_answer -> _adversarial_answer,
recalled_memories -> _recalled_memories, evidence_dialog_ids ->
_evidence_dialog_ids) to silence ARG005 while keeping the signatures identical
for monkeypatching.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4d9db0a6-9b24-4760-b814-3b136c8fed09
📒 Files selected for processing (2)
tests/benchmarks/test_locomo.pytests/test_locomo_cat5_judge.py
🤖 I have created a release *beep* *boop* --- ## [0.15.0](v0.14.0...v0.15.0) (2026-03-12) ### Features * **bench:** add opt-in LoCoMo cat5 judge ([#119](#119)) ([8b7915c](8b7915c)) * **memory:** optimize relationship taxonomy ([#121](#121)) ([605633e](605633e)) ### Bug Fixes * **benchmarks:** correct evaluator bugs, add agent guidelines, establish honest baseline ([#117](#117)) ([d9cbecb](d9cbecb)) * **recall:** priority_ids parameter only boosts relevance ([#79](#79)) ([#125](#125)) ([5d3708c](5d3708c)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Summary
BENCH_JUDGE_MODEL,--judge, and--judge-modelImplementation
LoCoMoEvaluatorwith operation-scoped LLM cachingVerification
make test185 passed, 1 skipped, 12 deselectedmake bench-eval BENCH=locomo-mini CONFIG=baseline89.27% (208/233)N/A (71 skipped)BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo-mini CONFIG=baseline89.80% (273/304)91.55% (65/71)BENCH_JUDGE_MODEL=gpt-4o make bench-eval BENCH=locomo CONFIG=baseline87.56% (1739/1986)95.74% (427/446)0judge skips/errorsNotes
88.24%as a published reference point rather than a strict apples-to-apples leaderboard claim