Skip to content

chore(bench): experiment postmortems, artifact promotion, log update#126

Merged
jack-arturo merged 1 commit intomainfrom
chore/experiment-hygiene-archive
Mar 25, 2026
Merged

chore(bench): experiment postmortems, artifact promotion, log update#126
jack-arturo merged 1 commit intomainfrom
chore/experiment-hygiene-archive

Conversation

@jack-arturo
Copy link
Copy Markdown
Member

Summary

Test plan

  • No code changes — documentation and artifacts only
  • make test passes (no functional changes)
  • Postmortem links in EXPERIMENT_LOG.md resolve correctly
  • Promoted artifacts match originals in benchmarks/results/

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b8a5305a-4299-43f2-8c3a-1bbf2ade954a

📥 Commits

Reviewing files that changed from the base of the PR and between b2c20aa and 7d7a6be.

📒 Files selected for processing (10)
  • benchmarks/EXPERIMENT_LOG.md
  • benchmarks/postmortems/.gitkeep
  • benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md
  • benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md
  • benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
  • tests/benchmarks/results/compare_issue74_entity_precision_20260311.json
  • tests/benchmarks/results/compare_issue79_priority_ids_20260311.json
  • tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
  • tests/benchmarks/results/compare_pr80_judge_off_20260311.json
  • tests/benchmarks/results/compare_pr80_judge_on_20260311.json
✅ Files skipped from review due to trivial changes (9)
  • tests/benchmarks/results/compare_issue74_entity_precision_20260311.json
  • tests/benchmarks/results/compare_pr80_judge_off_20260311.json
  • tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
  • tests/benchmarks/results/compare_pr80_judge_on_20260311.json
  • tests/benchmarks/results/compare_issue79_priority_ids_20260311.json
  • benchmarks/EXPERIMENT_LOG.md
  • benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
  • benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md
  • benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added three benchmark postmortems documenting PR #80 and Issues #74 and #79 with outcomes, analysis, artifacts, and follow-up recommendations.
  • Tests

    • Added multiple benchmark comparison snapshots and updated benchmark result tables to include new runs and per-category breakdowns (judge-on/off and BM25/rerank variants).

Walkthrough

This PR adds new LoCoMo-mini benchmark results (multiple runs and configurations) for 2026-03-10 to 2026-03-12, three postmortem documents for Issue #74, PR #80, and Issue #79, and corresponding JSON comparison artifacts recording baseline vs test metrics and per-category deltas.

Changes

Cohort / File(s) Summary
Experiment Log Update
benchmarks/EXPERIMENT_LOG.md
Appended new experiment rows for 2026-03-10 → 2026-03-12 covering fresh-main reruns, PR #80 port variants (judge-off/on, BM25-only, BM25+rerank top-k), and experiments for Issue #74 and Issue #79; also added category breakdown rows.
Postmortems
benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md, benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md, benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
Added three postmortem analyses documenting hypotheses, commands, LoCoMo-mini results, root-cause findings, promoted artifacts, outcomes (Issue #74: NON-PROMOTED/REJECTED direction; PR #80: rejected due to regressions; Issue #79: fix merged), and follow-up actions.
Benchmark Result Artifacts
tests/benchmarks/results/compare_issue74_entity_precision_20260311.json, tests/benchmarks/results/compare_issue79_priority_ids_20260311.json, tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json, tests/benchmarks/results/compare_pr80_judge_off_20260311.json, tests/benchmarks/results/compare_pr80_judge_on_20260311.json
Added 6 JSON comparison files containing baseline_accuracy, test_accuracy, delta, category_deltas, and baseline_file/test_file references for the new experiments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'chore(bench): experiment postmortems, artifact promotion, log update' accurately summarizes the main changes: adding postmortems, promoting artifacts, and updating the experiment log.
Description check ✅ Passed The description clearly outlines the three postmortems added (#79, #74, PR #80), the promoted JSON artifacts, the EXPERIMENT_LOG.md updates, and the cleanup steps, all matching the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/experiment-hygiene-archive

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the enhancement New feature or request label Mar 13, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md (2)

61-71: Make reproduction commands commit-pinned.

Line 63 references the branch in a comment, but the command block should include an explicit checkout to commit a122ba2 so reruns stay deterministic.

Suggested doc patch
 ```bash
+# Reproduce exactly from recorded revision
+git checkout a122ba2
+
 # Full port evaluation
 make bench-eval BENCH=locomo-mini CONFIG=baseline  # on exp/pr80-enhanced-recall-v2
 make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md` around lines 61 -
71, The reproduction commands in the code block (the make bench-eval and make
bench-compare invocations) are not pinned to a commit; prepend an explicit git
checkout to commit a122ba2 before the make commands so reruns are deterministic
(i.e., add a step that runs git checkout a122ba2 prior to the make
bench-eval/make bench-compare commands in the same snippet).

81-86: Consider adding artifact checksums for integrity tracking.

Since these are promoted benchmark records, adding SHA256 values would make future integrity verification straightforward.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md` around lines 81 -
86, Add SHA256 checksums for each promoted artifact to enable integrity
tracking: compute the SHA256 hash for each listed file
(`tests/benchmarks/results/compare_pr80_judge_off_20260311.json`,
`tests/benchmarks/results/compare_pr80_judge_on_20260311.json`,
`tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`) and append
the checksum next to each entry in the "Promoted Artifacts" list (e.g., "-
filename — SHA256: <hex>"), ensuring the exact filename strings from the diff
are used so the checksums clearly map to the artifacts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`:
- Around line 12-13: The JSON entries baseline_file and test_file currently
contain absolute local paths; replace them with repo-relative paths (e.g.,
"benchmarks/results/locomo-mini_baseline_20260310_233631.json" and
"benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json") so the
artifact references are portable and do not leak local filesystem details.

---

Nitpick comments:
In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md`:
- Around line 61-71: The reproduction commands in the code block (the make
bench-eval and make bench-compare invocations) are not pinned to a commit;
prepend an explicit git checkout to commit a122ba2 before the make commands so
reruns are deterministic (i.e., add a step that runs git checkout a122ba2 prior
to the make bench-eval/make bench-compare commands in the same snippet).
- Around line 81-86: Add SHA256 checksums for each promoted artifact to enable
integrity tracking: compute the SHA256 hash for each listed file
(`tests/benchmarks/results/compare_pr80_judge_off_20260311.json`,
`tests/benchmarks/results/compare_pr80_judge_on_20260311.json`,
`tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`) and append
the checksum next to each entry in the "Promoted Artifacts" list (e.g., "-
filename — SHA256: <hex>"), ensuring the exact filename strings from the diff
are used so the checksums clearly map to the artifacts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f1f949e9-14f3-48ce-90c2-303ad3278cda

📥 Commits

Reviewing files that changed from the base of the PR and between 5d3708c and b2c20aa.

📒 Files selected for processing (10)
  • benchmarks/EXPERIMENT_LOG.md
  • benchmarks/postmortems/.gitkeep
  • benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md
  • benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md
  • benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
  • tests/benchmarks/results/compare_issue74_entity_precision_20260311.json
  • tests/benchmarks/results/compare_issue79_priority_ids_20260311.json
  • tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
  • tests/benchmarks/results/compare_pr80_judge_off_20260311.json
  • tests/benchmarks/results/compare_pr80_judge_on_20260311.json

Comment thread tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
Add postmortem infrastructure and archive three experiments:
- #79 (accepted): priority_ids fetch bug fix, merged as PR #125
- #74 (rejected): entity expansion precision, zero benchmark delta
- PR #80 (rejected): BM25+rerank+query expansion, -3.83pp regression

Promote 5 comparison JSONs to tests/benchmarks/results/ for durable
record. Update EXPERIMENT_LOG.md with 7 new rows and postmortem links.
Prune 10 experiment worktrees and 9 local branches.
@jack-arturo jack-arturo force-pushed the chore/experiment-hygiene-archive branch from b2c20aa to 7d7a6be Compare March 25, 2026 08:04
@jack-arturo jack-arturo merged commit 9c461bd into main Mar 25, 2026
6 of 7 checks passed
@jack-arturo jack-arturo deleted the chore/experiment-hygiene-archive branch March 25, 2026 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant