chore(bench): experiment postmortems, artifact promotion, log update by jack-arturo · Pull Request #126 · verygoodplugins/automem

jack-arturo · 2026-03-13T17:20:23Z

Summary

Add benchmarks/postmortems/ directory with 3 postmortems:
- Bug: priority_ids parameter only boosts relevance, doesn't fetch specific memories #79 (accepted): priority_ids fetch bug fix, merged as PR fix(recall): priority_ids parameter only boosts relevance (#79) #125, zero benchmark delta
- recall: Graph expansion follows too many hops through generic/hub nodes #74 (rejected): entity expansion precision, zero delta — benchmark doesn't exercise graph expansion
- PR feat: enhanced recall — BM25 search, LLM reranking, query expansion #80 (rejected): BM25+rerank+query expansion, -3.83pp regression driven by open-domain -11.4pp
Promote 5 comparison JSON artifacts to committed tests/benchmarks/results/
Update EXPERIMENT_LOG.md with 7 new rows + postmortem links
Prune 10 experiment worktrees and 9 local branches

Test plan

No code changes — documentation and artifacts only
make test passes (no functional changes)
Postmortem links in EXPERIMENT_LOG.md resolve correctly
Promoted artifacts match originals in benchmarks/results/

coderabbitai · 2026-03-13T17:20:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b8a5305a-4299-43f2-8c3a-1bbf2ade954a

📥 Commits

Reviewing files that changed from the base of the PR and between b2c20aa and 7d7a6be.

📒 Files selected for processing (10)

benchmarks/EXPERIMENT_LOG.md
benchmarks/postmortems/.gitkeep
benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md
benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md
benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
tests/benchmarks/results/compare_issue74_entity_precision_20260311.json
tests/benchmarks/results/compare_issue79_priority_ids_20260311.json
tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
tests/benchmarks/results/compare_pr80_judge_off_20260311.json
tests/benchmarks/results/compare_pr80_judge_on_20260311.json

✅ Files skipped from review due to trivial changes (9)

tests/benchmarks/results/compare_issue74_entity_precision_20260311.json
tests/benchmarks/results/compare_pr80_judge_off_20260311.json
tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
tests/benchmarks/results/compare_pr80_judge_on_20260311.json
tests/benchmarks/results/compare_issue79_priority_ids_20260311.json
benchmarks/EXPERIMENT_LOG.md
benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md
benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md

📝 Walkthrough

Summary by CodeRabbit

Release Notes

Documentation
- Added three benchmark postmortems documenting PR #80 and Issues #74 and #79 with outcomes, analysis, artifacts, and follow-up recommendations.
Tests
- Added multiple benchmark comparison snapshots and updated benchmark result tables to include new runs and per-category breakdowns (judge-on/off and BM25/rerank variants).

Walkthrough

This PR adds new LoCoMo-mini benchmark results (multiple runs and configurations) for 2026-03-10 to 2026-03-12, three postmortem documents for Issue #74, PR #80, and Issue #79, and corresponding JSON comparison artifacts recording baseline vs test metrics and per-category deltas.

Changes

Cohort / File(s)	Summary
Experiment Log Update `benchmarks/EXPERIMENT_LOG.md`	Appended new experiment rows for 2026-03-10 → 2026-03-12 covering fresh-main reruns, PR `#80` port variants (judge-off/on, BM25-only, BM25+rerank top-k), and experiments for Issue `#74` and Issue `#79`; also added category breakdown rows.
Postmortems `benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md`, `benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md`, `benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md`	Added three postmortem analyses documenting hypotheses, commands, LoCoMo-mini results, root-cause findings, promoted artifacts, outcomes (Issue `#74`: NON-PROMOTED/REJECTED direction; PR `#80`: rejected due to regressions; Issue `#79`: fix merged), and follow-up actions.
Benchmark Result Artifacts `tests/benchmarks/results/compare_issue74_entity_precision_20260311.json`, `tests/benchmarks/results/compare_issue79_priority_ids_20260311.json`, `tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`, `tests/benchmarks/results/compare_pr80_judge_off_20260311.json`, `tests/benchmarks/results/compare_pr80_judge_on_20260311.json`	Added 6 JSON comparison files containing `baseline_accuracy`, `test_accuracy`, `delta`, `category_deltas`, and `baseline_file`/`test_file` references for the new experiments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

docs(bench): add PR #73, #80, and #87 experiment results #103: Added initial/placeholder experiment entries for PR #80 in benchmarks/EXPERIMENT_LOG.md, directly related to the expanded runs and postmortems in this PR.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'chore(bench): experiment postmortems, artifact promotion, log update' accurately summarizes the main changes: adding postmortems, promoting artifacts, and updating the experiment log.
Description check	✅ Passed	The description clearly outlines the three postmortems added (`#79`, `#74`, PR `#80`), the promoted JSON artifacts, the EXPERIMENT_LOG.md updates, and the cleanup steps, all matching the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/experiment-hygiene-archive

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md (2)

61-71: Make reproduction commands commit-pinned.

Line 63 references the branch in a comment, but the command block should include an explicit checkout to commit a122ba2 so reruns stay deterministic.

Suggested doc patch

 ```bash
+# Reproduce exactly from recorded revision
+git checkout a122ba2
+
 # Full port evaluation
 make bench-eval BENCH=locomo-mini CONFIG=baseline  # on exp/pr80-enhanced-recall-v2
 make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md` around lines 61 -
71, The reproduction commands in the code block (the make bench-eval and make
bench-compare invocations) are not pinned to a commit; prepend an explicit git
checkout to commit a122ba2 before the make commands so reruns are deterministic
(i.e., add a step that runs git checkout a122ba2 prior to the make
bench-eval/make bench-compare commands in the same snippet).

81-86: Consider adding artifact checksums for integrity tracking.

Since these are promoted benchmark records, adding SHA256 values would make future integrity verification straightforward.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md` around lines 81 -
86, Add SHA256 checksums for each promoted artifact to enable integrity
tracking: compute the SHA256 hash for each listed file
(`tests/benchmarks/results/compare_pr80_judge_off_20260311.json`,
`tests/benchmarks/results/compare_pr80_judge_on_20260311.json`,
`tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`) and append
the checksum next to each entry in the "Promoted Artifacts" list (e.g., "-
filename — SHA256: <hex>"), ensuring the exact filename strings from the diff
are used so the checksums clearly map to the artifacts.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`:
- Around line 12-13: The JSON entries baseline_file and test_file currently
contain absolute local paths; replace them with repo-relative paths (e.g.,
"benchmarks/results/locomo-mini_baseline_20260310_233631.json" and
"benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json") so the
artifact references are portable and do not leak local filesystem details.

---

Nitpick comments:
In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md`:
- Around line 61-71: The reproduction commands in the code block (the make
bench-eval and make bench-compare invocations) are not pinned to a commit;
prepend an explicit git checkout to commit a122ba2 before the make commands so
reruns are deterministic (i.e., add a step that runs git checkout a122ba2 prior
to the make bench-eval/make bench-compare commands in the same snippet).
- Around line 81-86: Add SHA256 checksums for each promoted artifact to enable
integrity tracking: compute the SHA256 hash for each listed file
(`tests/benchmarks/results/compare_pr80_judge_off_20260311.json`,
`tests/benchmarks/results/compare_pr80_judge_on_20260311.json`,
`tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`) and append
the checksum next to each entry in the "Promoted Artifacts" list (e.g., "-
filename — SHA256: <hex>"), ensuring the exact filename strings from the diff
are used so the checksums clearly map to the artifacts.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f1f949e9-14f3-48ce-90c2-303ad3278cda

📥 Commits

Reviewing files that changed from the base of the PR and between 5d3708c and b2c20aa.

📒 Files selected for processing (10)

benchmarks/EXPERIMENT_LOG.md
benchmarks/postmortems/.gitkeep
benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md
benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md
benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
tests/benchmarks/results/compare_issue74_entity_precision_20260311.json
tests/benchmarks/results/compare_issue79_priority_ids_20260311.json
tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
tests/benchmarks/results/compare_pr80_judge_off_20260311.json
tests/benchmarks/results/compare_pr80_judge_on_20260311.json

Add postmortem infrastructure and archive three experiments: - #79 (accepted): priority_ids fetch bug fix, merged as PR #125 - #74 (rejected): entity expansion precision, zero benchmark delta - PR #80 (rejected): BM25+rerank+query expansion, -3.83pp regression Promote 5 comparison JSONs to tests/benchmarks/results/ for durable record. Update EXPERIMENT_LOG.md with 7 new rows and postmortem links. Prune 10 experiment worktrees and 9 local branches.

coderabbitai Bot added the enhancement New feature or request label Mar 13, 2026

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json

jack-arturo force-pushed the chore/experiment-hygiene-archive branch from b2c20aa to 7d7a6be Compare March 25, 2026 08:04

jack-arturo merged commit 9c461bd into main Mar 25, 2026
6 of 7 checks passed

jack-arturo deleted the chore/experiment-hygiene-archive branch March 25, 2026 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(bench): experiment postmortems, artifact promotion, log update#126

chore(bench): experiment postmortems, artifact promotion, log update#126
jack-arturo merged 1 commit intomainfrom
chore/experiment-hygiene-archive

jack-arturo commented Mar 13, 2026

Uh oh!

coderabbitai Bot commented Mar 13, 2026 •

edited

Loading

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jack-arturo commented Mar 13, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Mar 13, 2026 •

edited

Loading