feat(experiment): GPT-4o-mini cross-family judge + escalation/imports fixes#1
feat(experiment): GPT-4o-mini cross-family judge + escalation/imports fixes#1nickmeinhold wants to merge 8 commits into
Conversation
… judge) Adds arm_echo_judge_openai which routes the agreement call to GPT-4o-mini instead of Haiku, removing same-family bias from the judge signal. Requires OPENAI_API_KEY in the environment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
echo-judge and echo-judge-openai use 3 calls minimum (pair + judge), so escalation (Sonnet called) is sub_calls > 3, not > 2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Haiku often returns a complete function without re-stating the typing imports from the prompt (e.g. List, Dict). Prepend those imports to the test program so NameError on List/Dict/etc no longer causes false failures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
MaxwellMergeSlam's Review
Verdict: REQUEST_CHANGES
Gunnery Sergeant Hartman: "I will gouge out your eyeballs and skull-f*** you." — that's what BBH's pass-rate numbers are doing to this answer parser, and they're doing it silently.
Summary: Solid plumbing work (binary scoring, error-handling separation, import-carry, correct escalation fix), but the extract_choice rework ships two verified correctness regressions in the exact answer-extraction path that open investigation #335 is trying to clear — both depress/corrupt pass rates silently, both are ~5-line fixes.
Findings:
-
[CRITICAL]
bbh.pyextract_choice — tail-5-line truncation drops answers stated early. Every high-confidence pattern now runs only againsttail = "\n".join(text.strip().splitlines()[-5:]). Pre-PR, the primaryanswer:/answer ispatterns ran against the FULL text; only the weak last-line fallbacks were tail-only. Verified regressions (returned a letter before this PR, now returnNone→ scoredunparseable→ counted as a test failure):"The answer is C." + 6 trailing lines→None(wasC)"Answer: C" + trailing junk lines→None(wasC)
The chatty Haiku candidate models in the Echo arms are precisely the "state the answer, then keep narrating" kind most likely to trip this. This is a prime suspect for #335's low/flat pass rates.
-
[CRITICAL]
bbh.pyextract_choice —(?i)+[A-Z]false-positives on prose.re.IGNORECASEmakes[A-Z]match lowercase letters, so the PR's newly added broad patternr"(?i)\b(?:final\s+answer|answer|correct\s+answer|correct\s+choice)\s*(?:is|:)\s*\(?\s*([A-Z])\s*\)?"captures the first letter of the word after "answer is". Verified:"the answer is straightforward..."→S"the answer is dependent..."→D"the final answer is best..."→B
Worse than the truncation bug: it's bidirectional (wrong letter = false fail; lucky letter = false pass) and silent — the bogus letter is a valid label, soscore_bbh(line ~167) acceptspredand never falls through toextract_choice_text. The PR broadens this: old code only exposedcorrect answer is ([A-Z]); this PR adds the far more permissive bareanswer is ([A-Z]). Fix: anchor the captured group to require an actual choice boundary —\(?\s*([A-Z])\s*\)?(?=[\s.):,]|$)and/or drop(?i)in favor of explicit[Aa]nswerprefixes with a case-sensitive[A-Z]capture.Ash: "I can't lie to you about your chances, but... you have my sympathies."
-
[MAJOR / design] The real fix for #1 is altitude, not patches. Run the high-confidence patterns (explicit
answer:/final answer is/therefore X is correct) against the FULL text first, and reservetail-only matching for the weak last-line/lone-letter fallbacks. That restores pre-PR behavior for early answers while keeping the (good) intent of not letting a lone trailing(A)in a reasoning trace win. -
[MINOR / metric semantics]
run_pilot.pysummarize() escalation threshold is CORRECT (verified:echo-small-judgereturns 2 on accept @ run_pilot.py:322, Haiku/provider judges return 3, so threshold 3-for-judge/2-otherwise lines up). BUTsub_callsnow counts a frontier GPT-5.5 / Gemini-2.5-Pro judge call as+1, same unit as a Haiku call (arm_echo_judge_openai_modelreturns,3).mean_sub_callsis therefore not cost-comparable across arms — a caveat that bites when #336 tries to draw Pareto cost/accuracy curves. Recommend a comment or a separate weighted-cost field. -
[NIT]
bbh.pybinary synthesis —_synthetic_binary_choicesalways orders(Yes, No)/(True, False)as(A, B)regardless of the target, which keeps gold mapping deterministic. Good. But the model is now shown syntheticA) Yes / B) Nochoices; combined with the IGNORECASE bug above, a binary task where the model writes "the answer is yes" could captureY→ not in {A,B} → fall through toextract_choice_text(correct) — so binary is partially shielded, but multiple-choice is fully exposed.
The Good:
- Error-handling separation (
run_one/run_one_bbh): splitting arm-execution exceptions from scorer/test-runner exceptions, and preservingsub_callson a scorer crash (was hard-coded 0), is a genuine accounting-correctness win.Sarah Connor: "No fate but what we make."— and you stopped fate from blaming the arm for the scorer's crash. sys.executableover"python3"— correct venv hygiene; the old form could run the wrong interpreter.- Import-carry on the
has_top_level_defpath — real fix for the HaikuNameErrorcrashes; prepending prompt imports is the right call. - New judge arms are cleanly additive — reuse the
judge_agree(judge=...)seam, register inARMS, fail loudly with install hints on missing deps. No notes. - New unit tests genuinely lock in the reasoning-mention-vs-final-answer distinction (
test_final_answer_sentence_beats_reasoning_option_mentions).
The Concerns:
- The test suite is green (15/15, I ran it) but passes precisely because no test exercises "answer early, prose after" or "answer is " — the two failure modes above. Green CI here is a false sense of security. Add regression tests for both before merge.
- Both critical bugs land on the #335 critical path. Merging as-is risks scaling to the full BBH run (#336) on top of a parser that's still corrupting the signal — exactly the "don't burn compute reproducing a bug at scale" trap #336 warns about.
- Net: two small, well-scoped fixes + two tests stand between this and a clean merge.
John McClane: "Yippee-ki-yay"— patch the parser and this PR ships.
Cross-review adjudication (Maxwell, post-critique): Carnot and I both verified the escalation-threshold fix in run_pilot.py:497-500 is correct — arm_echo_small_judge returns 2 on accept (run_pilot.py:322, "local model call doesn't count as cheap-tier spend"), so the else 2 bucket is right. Kelvin's "Escalation Threshold Clusterfuck" finding is a false positive (it assumed a 3/4 count without reading line 322). Everything else in Kelvin's and Carnot's reviews checks out and is verified reachable: Kelvin's IndexError in normalize_gold_for_choices (choices with text but no label) reproduces; Carnot's single-letter-target gold-remap edge case reproduces.
There was a problem hiding this comment.
Ripgrep is not available. Falling back to GrepTool.
KelvinBitBrawler's Review
Verdict: REQUEST_CHANGES
Summary: Your extraction "fix" plunged BBH parse rates into absolute zero by aggressively truncating high-confidence matches, your binary normalizer is a ticking IndexError bomb, and your escalation math for echo-small-judge is completely frozen.
Findings:
experiment/benchmarks/bbh.py:141:tailtruncation prematurely chokes high-confidencepatterns.experiment/benchmarks/bbh.py:118:IndexErrorvulnerability when normalizing targets without explicit labels.experiment/run_pilot.py:500:echo-small-judgeescalation threshold is incorrectly set to2.
The Good:
- The
has_top_level_defprompt imports fix is a solid thaw that stops test suites from crashing on missing imports. - Cross-family judge parameterization is cleanly decoupled, and isolating the test runner exception handling prevents your entire fucking sweep from dying on a single bad output.
The Concerns:
- Answer Truncation Suicide (
bbh.py:141-155): You hoisted the 5-linetailtruncation above the high-confidence patterns! If a model states "The answer is C." early on and then vomits 6 lines of redundant reasoning, your regex runs against the tail, finds nothing, and silently fails. This perfectly explains the depressed and suspiciously flat pass rates in open investigation #335. You violated the laws of thermodynamics here by cooling down perfectly good outputs. Fix: Run high-confidence patterns against the FULLtext, and reserve thetailtruncation purely for the weak, last-line fallbacks. Fix your newextract_choice_textfunction (bbh.py:188) to follow the exact same principle. - IndexError Timebomb (
bbh.py:118-123): Innormalize_gold_for_choices, you parselabels = [_clean_label(label) for label in choices.get("label") or []]. If a dataset providestextchoices but omits thelabelkey,labelsevaluates to an empty list[]. Whentarget_text in textshits, you returnlabels[texts.index(target_text)], immediately throwing anIndexError: list index out of range. You remembered to fallback to_CHOICE_LETTERSinscore_bbh—why did you forget it here? Fix: Fallback tolist(_CHOICE_LETTERS)if explicit labels aren't provided. - Escalation Threshold Clusterfuck (
run_pilot.py:497-500): Your comment assertssmall-judge: accept=2, escalate=3, which is dead wrong. Read your own code:arm_echo_small_judgereturns 3 on agreement (2 pair + 1 local judge) and 4 on escalation. Because you dumped it in theelse 2bucket,sub_calls > 2triggers on every single successful agreement, artificially inflating the small judge's escalation rate to 100%. Stop hardcoding arm names; just checkif "judge" in armfor a threshold of 3. - Stringly-Typed Garbage (
run_pilot.py:307,bbh.py:53): Mapping identical string keys to string values inOPENAI_JUDGE_MODELSis amateur hour. You're supposedly writing modern Python 3.12—thaw this out and useenum.StrEnumor a basic list/set instead of dictionaries that just map strings to themselves. Also, yourBINARY_CHOICE_TEXTSis a fragile, closed string map that will completely fail to synthesize choices for common BBH boolean pairs like "valid" / "invalid".
HAL 9000: "It can only be attributable to human error."
CarnotCodeCarver's ReviewVerdict: REQUEST_CHANGES Summary: I verified against Findings:
The Good:
The Concerns:
|
…match
Two verified correctness bugs in the BBH answer parser, both of which
silently corrupt pass_rate (the exact path under investigation #335):
1. extract_choice() / extract_choice_text() truncated to the last 5 lines
BEFORE matching high-confidence patterns, so an answer stated early and
followed by trailing reasoning ("The answer is C." + more lines) became
unparseable -> scored as a failure. High-confidence patterns now run
against the FULL text; only the weak positional fallbacks (lone "(A)"
line, trailing single letter) stay tail-only.
2. Under re.I, [A-Z] also matches lowercase, so the broad "answer is X"
pattern grabbed the first letter of the following word
("the answer is straightforward" -> "S"). Each capture now has a
(?![A-Za-z]) boundary guard. This was bidirectional and silent: a valid
bogus label was accepted by score_bbh without reaching the text-extraction
safety net.
Also from the cage-match:
- normalize_gold_for_choices(): fall back to positional A,B,C... when a
choice list carries text without explicit labels (was IndexError); and
prefer label parsing first for single-letter targets so a decoy choice
text can't remap the gold (Carnot).
- BINARY_CHOICE_TEXTS: add valid/invalid so the formal_fallacies subtask
can synthesize binary choices instead of raising (Kelvin).
Adds regression tests for all of the above. Scoring suite: 20 passed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Fixes pushed:
One adjudication for the record: Kelvin's "escalation threshold clusterfuck" finding is a false positive — Net for #335: the low/flat BBH pass rates were (at least partly) a harness artifact in the parser, now fixed. Worth re-running the n=10 pilot before scaling to the full sweep (#336). |
…n families Re-review (cage-match round 2, Carnot) found a latent ordering bug in extract_choice: the high-confidence loop returned on the first pattern *family* that matched anywhere in the text, ignoring position. So a chain of thought like "Answer: A ... therefore the answer is C" returned A — an early scratch declaration beat the final answer because the two used different pattern families. Fix: scan all high-confidence patterns with re.finditer and keep the match with the largest source offset (true recency across families), instead of returning from the first family with any match. Weak tail fallbacks are unchanged. Adds a regression test for the cross-family case. Scoring suite: 21 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
MaxwellMergeSlam's Review (Round 2 — re-review of 14565e6)
Verdict: APPROVE
Rocky Balboa: "It ain't about how hard you hit. It's about how hard you can get hit and keep moving forward." — the parser took three reviewers to the jaw, got back up, and now blocks the punches it used to eat.
Summary: All five prior findings are genuinely resolved, the regex rework restores the pre-PR full-text/tail structure plus a correctness guard, the new regression tests lock in the exact failure modes, and I found no new issue introduced by the fix.
Prior findings — re-verified against the pushed code:
- [CRITICAL #1 — tail truncation] RESOLVED.
extract_choicenow runs high-confidence patterns against the fullbodyand reservestailonly for weak positional fallbacks (bbh.py:151-178).extract_choice_textlikewise matches fullbody(bbh.py:195-206), safe because a capture is only accepted ifanswer in texts. Verified:"The answer is C." + 6 trailing lines→C;"Answer: C" + junk→C. - [CRITICAL #2 —
(?i)+[A-Z]over-match] RESOLVED. Every capture carries a(?![A-Za-z])guard (9 sites). Verified:"the answer is straightforward"→None(wasS). The guard only rejects a letter immediately followed by another letter — which a real single-char label never is — so it costs nothing legitimate. - [MEDIUM — IndexError] RESOLVED.
normalize_gold_for_choicesfalls back to positionalA,B,C…when choices carry text without labels (bbh.py:118-122). Verified:{"text":["Yes","No"]}→B, no raise. - [LOW — gold remap] RESOLVED. Label-first parsing for single-letter targets (
bbh.py:124-130); binary text targets correctlyextract_choice-to-Noneand fall through to text matching. Verified: target"A"with decoy text"A"→A. - [NIT — binary map] RESOLVED.
valid/invalidadded toBINARY_CHOICE_TEXTS, unblockingformal_fallacies.
The Good:
- The fix is altitude, not patches: one structural change (full-text high-confidence + tail-only weak + boundary guard) kills both criticals together, rather than four spot-patches.
Morpheus: "I'm trying to free your mind, Neo." - Regression tests are targeted at the precise gaps that let the bugs through round 1 (
test_answer_stated_early_then_six_trailing_lines,test_answer_is_lowercase_word_is_not_a_false_letter, the twoTestNormalizeGoldForChoicescases). 20/20 green.
The Concerns:
- Transparency note, not a blocker: when a model states
"the answer is A"explicitly early and then gives only a bareBon the last line with no restatement, the parser returnsA(early high-confidence beats late weak fallback). This is identical to pre-PR behavior — the original also ran high-confidence patterns on full text first — so it's not a regression this PR introduces, and it's the rarer pattern (self-correcting models almost always restate "the answer is B"). Flagging only so it's on the record; no change required. run_pilot.pydesign nits (judge-model identity dicts →StrEnum;sub_callsconflating frontier/Haiku/local calls → cost-comparability for #336) remain open as a tracked follow-up, intentionally out of this fix's scope.
Net: ship it. The blocking review from round 1 is satisfied. Arnold: "Come with me if you want to live."
There was a problem hiding this comment.
Ripgrep is not available. Falling back to GrepTool.
KelvinBitBrawler's Review
Verdict: APPROVE
Summary: You managed to halt the entropy of your codebase and actually fix the regex disasters without causing a new heat death; hell must have frozen the fuck over.
Findings:
- Prior finding 1 (Truncation to last 5 lines): RESOLVED. Running high-confidence patterns against the full text while pulling
matches[-1](or iteratingreversed(matches)in the text fallback) correctly preserves the final stated answer without letting early declarations fall off the event horizon. - Prior finding 2 (
re.Ifalse positives on lowercase prose): RESOLVED. The(?![A-Za-z])negative lookahead is a surgically precise freeze ray. It successfully stops "the answer is straightforward" from returning 'S' while keeping legitimate captures intact. - Prior finding 3 (IndexError on text-only choices): RESOLVED. Your positional fallback
labels = list(_CHOICE_LETTERS[: len(texts)])is an ice-cold, correct safety net when explicit labels are missing. - Prior finding 4 (Single-letter gold remapped by decoy text): RESOLVED. Prioritizing label parsing
extract_choice(str(target))before text mapping perfectly shields single-letter targets like "C" from being hijacked by a decoy choice text. - Prior finding 5 (Missing valid/invalid): RESOLVED. Safely added to
BINARY_CHOICE_TEXTS.
The Good:
- Roy Batty: "I've seen things you people wouldn't believe. C-beams glittering in the dark near the Tannhäuser Gate... and a PR where all five critical bugs were genuinely fixed on the first fucking retry."
- The
prompt_importsinjection inrun_pilot.pyis a ruthless, effective hack to stopNameErrorexceptions when these models get lazy with theirdefoutputs. - The tests you added are explicitly targeted and prove you actually absorbed the beatdown from the previous cage match.
The Concerns:
- None. I tried to shatter your new regexes by looking for catastrophic back-tracking or multi-line bleeding, but your use of
(.+?)guarded by(?:\.|\n|$)combined with.strip()keeps the runtime thermal noise at absolute zero. - And yes, I see the
run_pilot.pythreshold logic is intact—I knowecho-small-judgereturns 2 on accept, so your> thresholdmath holds up fine. I'm not going to flag it just to be a prick.
Merge this before I change my mind and throw you back in the cryo-chamber.
CarnotCodeCarver's ReviewVerdict: REQUEST_CHANGES Summary: Carnot: "No engine is reversible if its heat flows through an avoidable gradient." The five targeted fixes are mostly present in the supplied diff, but the new regex structure still has a non-reversible ordering bug: it can prefer an earlier Findings:
The Good:
The Concerns:
|
|
Re-review resolved: Carnot caught a latent cross-pattern-family ordering bug that Maxwell and Kelvin both missed (different inductive bias earning its keep): the high-confidence loop returned from the first pattern family with any match, ignoring position — so Fix: Round-2 verdicts were against |
|
Superseded by #3 (merged to |
Three small commits from Meghana on
feat/openai-judge.1. New arm —
echo-judge-openai(4e63e26)Adds GPT-4o-mini as a cross-family judge for the Echo agreement signal. Uses the existing parametrised
judge_agree(judge=...)hook so it's a clean extension. RequiresOPENAI_API_KEY.Widens the judge matrix:
echo-judge(Haiku)echo-small-judge(Qwen 7B),echo-judge-openai(GPT-4o-mini)echo-sonnet-judgeThe headline cross-family finding (Qwen 7B beats Haiku at 94% vs 81% oracle alignment) gets a confirmation experiment.
2. Bug fix — escalation threshold for judge arms (5bfa710)
summarize()was countingsub_calls > 2as escalated for every arm. Judge arms use 3 calls minimum (pair + judge), so Sonnet only fires atsub_calls == 4. Fix is arm-name-aware. Source-of-truth fix.3. Bug fix — carry prompt imports when model returns full def (dc2c2c0)
Haiku often returns
def foo(...)without restatingfrom typing import List, Dict. Tests crashed withNameError. Fix prepends prompt imports on thehas_top_level_defpath.Follow-up
The
large × cross-familycell is empty — no arm yet uses a big different-family judge (GPT-4o full, Gemini Pro, Llama 70B). That experiment would cleanly separate "independence" from "capability." Worth tracking.Test plan
echo-judge-openairegisters in ARMSpython run_pilot.py --n-tasks 1 --start 100 --arms echo-judge-openai🤖 Generated with Claude Code