Skip to content

feat(experiment): GPT-4o-mini cross-family judge + escalation/imports fixes#1

Closed
nickmeinhold wants to merge 8 commits into
mainfrom
feat/openai-judge
Closed

feat(experiment): GPT-4o-mini cross-family judge + escalation/imports fixes#1
nickmeinhold wants to merge 8 commits into
mainfrom
feat/openai-judge

Conversation

@nickmeinhold

Copy link
Copy Markdown
Collaborator

Three small commits from Meghana on feat/openai-judge.

1. New arm — echo-judge-openai (4e63e26)

Adds GPT-4o-mini as a cross-family judge for the Echo agreement signal. Uses the existing parametrised judge_agree(judge=...) hook so it's a clean extension. Requires OPENAI_API_KEY.

Widens the judge matrix:

Same family (Anthropic) Cross family
Small (~7B) echo-judge (Haiku) echo-small-judge (Qwen 7B), echo-judge-openai (GPT-4o-mini)
Large echo-sonnet-judge gap — see follow-up

The headline cross-family finding (Qwen 7B beats Haiku at 94% vs 81% oracle alignment) gets a confirmation experiment.

2. Bug fix — escalation threshold for judge arms (5bfa710)

summarize() was counting sub_calls > 2 as escalated for every arm. Judge arms use 3 calls minimum (pair + judge), so Sonnet only fires at sub_calls == 4. Fix is arm-name-aware. Source-of-truth fix.

3. Bug fix — carry prompt imports when model returns full def (dc2c2c0)

Haiku often returns def foo(...) without restating from typing import List, Dict. Tests crashed with NameError. Fix prepends prompt imports on the has_top_level_def path.

Follow-up

The large × cross-family cell is empty — no arm yet uses a big different-family judge (GPT-4o full, Gemini Pro, Llama 70B). That experiment would cleanly separate "independence" from "capability." Worth tracking.

Test plan

  • echo-judge-openai registers in ARMS
  • 1-task sanity: python run_pilot.py --n-tasks 1 --start 100 --arms echo-judge-openai
  • 64-task sweep on HumanEval 100-163 with full arm set, replace earlier numbers

🤖 Generated with Claude Code

meghanaganapa and others added 6 commits June 2, 2026 22:12
… judge)

Adds arm_echo_judge_openai which routes the agreement call to GPT-4o-mini
instead of Haiku, removing same-family bias from the judge signal.
Requires OPENAI_API_KEY in the environment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
echo-judge and echo-judge-openai use 3 calls minimum (pair + judge),
so escalation (Sonnet called) is sub_calls > 3, not > 2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Haiku often returns a complete function without re-stating the typing
imports from the prompt (e.g. List, Dict). Prepend those imports to
the test program so NameError on List/Dict/etc no longer causes false
failures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@maxwell-merge-slam maxwell-merge-slam Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxwellMergeSlam's Review

Verdict: REQUEST_CHANGES

Gunnery Sergeant Hartman: "I will gouge out your eyeballs and skull-f*** you." — that's what BBH's pass-rate numbers are doing to this answer parser, and they're doing it silently.

Summary: Solid plumbing work (binary scoring, error-handling separation, import-carry, correct escalation fix), but the extract_choice rework ships two verified correctness regressions in the exact answer-extraction path that open investigation #335 is trying to clear — both depress/corrupt pass rates silently, both are ~5-line fixes.

Findings:

  • [CRITICAL] bbh.py extract_choice — tail-5-line truncation drops answers stated early. Every high-confidence pattern now runs only against tail = "\n".join(text.strip().splitlines()[-5:]). Pre-PR, the primary answer:/answer is patterns ran against the FULL text; only the weak last-line fallbacks were tail-only. Verified regressions (returned a letter before this PR, now return None → scored unparseable → counted as a test failure):

    • "The answer is C." + 6 trailing linesNone (was C)
    • "Answer: C" + trailing junk linesNone (was C)
      The chatty Haiku candidate models in the Echo arms are precisely the "state the answer, then keep narrating" kind most likely to trip this. This is a prime suspect for #335's low/flat pass rates.
  • [CRITICAL] bbh.py extract_choice — (?i) + [A-Z] false-positives on prose. re.IGNORECASE makes [A-Z] match lowercase letters, so the PR's newly added broad pattern r"(?i)\b(?:final\s+answer|answer|correct\s+answer|correct\s+choice)\s*(?:is|:)\s*\(?\s*([A-Z])\s*\)?" captures the first letter of the word after "answer is". Verified:

    • "the answer is straightforward..."S
    • "the answer is dependent..."D
    • "the final answer is best..."B
      Worse than the truncation bug: it's bidirectional (wrong letter = false fail; lucky letter = false pass) and silent — the bogus letter is a valid label, so score_bbh (line ~167) accepts pred and never falls through to extract_choice_text. The PR broadens this: old code only exposed correct answer is ([A-Z]); this PR adds the far more permissive bare answer is ([A-Z]). Fix: anchor the captured group to require an actual choice boundary — \(?\s*([A-Z])\s*\)?(?=[\s.):,]|$) and/or drop (?i) in favor of explicit [Aa]nswer prefixes with a case-sensitive [A-Z] capture. Ash: "I can't lie to you about your chances, but... you have my sympathies."
  • [MAJOR / design] The real fix for #1 is altitude, not patches. Run the high-confidence patterns (explicit answer: / final answer is / therefore X is correct) against the FULL text first, and reserve tail-only matching for the weak last-line/lone-letter fallbacks. That restores pre-PR behavior for early answers while keeping the (good) intent of not letting a lone trailing (A) in a reasoning trace win.

  • [MINOR / metric semantics] run_pilot.py summarize() escalation threshold is CORRECT (verified: echo-small-judge returns 2 on accept @ run_pilot.py:322, Haiku/provider judges return 3, so threshold 3-for-judge/2-otherwise lines up). BUT sub_calls now counts a frontier GPT-5.5 / Gemini-2.5-Pro judge call as +1, same unit as a Haiku call (arm_echo_judge_openai_model returns ,3). mean_sub_calls is therefore not cost-comparable across arms — a caveat that bites when #336 tries to draw Pareto cost/accuracy curves. Recommend a comment or a separate weighted-cost field.

  • [NIT] bbh.py binary synthesis_synthetic_binary_choices always orders (Yes, No) / (True, False) as (A, B) regardless of the target, which keeps gold mapping deterministic. Good. But the model is now shown synthetic A) Yes / B) No choices; combined with the IGNORECASE bug above, a binary task where the model writes "the answer is yes" could capture Y→ not in {A,B} → fall through to extract_choice_text (correct) — so binary is partially shielded, but multiple-choice is fully exposed.

The Good:

  • Error-handling separation (run_one / run_one_bbh): splitting arm-execution exceptions from scorer/test-runner exceptions, and preserving sub_calls on a scorer crash (was hard-coded 0), is a genuine accounting-correctness win. Sarah Connor: "No fate but what we make." — and you stopped fate from blaming the arm for the scorer's crash.
  • sys.executable over "python3" — correct venv hygiene; the old form could run the wrong interpreter.
  • Import-carry on the has_top_level_def path — real fix for the Haiku NameError crashes; prepending prompt imports is the right call.
  • New judge arms are cleanly additive — reuse the judge_agree(judge=...) seam, register in ARMS, fail loudly with install hints on missing deps. No notes.
  • New unit tests genuinely lock in the reasoning-mention-vs-final-answer distinction (test_final_answer_sentence_beats_reasoning_option_mentions).

The Concerns:

  • The test suite is green (15/15, I ran it) but passes precisely because no test exercises "answer early, prose after" or "answer is " — the two failure modes above. Green CI here is a false sense of security. Add regression tests for both before merge.
  • Both critical bugs land on the #335 critical path. Merging as-is risks scaling to the full BBH run (#336) on top of a parser that's still corrupting the signal — exactly the "don't burn compute reproducing a bug at scale" trap #336 warns about.
  • Net: two small, well-scoped fixes + two tests stand between this and a clean merge. John McClane: "Yippee-ki-yay" — patch the parser and this PR ships.

Cross-review adjudication (Maxwell, post-critique): Carnot and I both verified the escalation-threshold fix in run_pilot.py:497-500 is correctarm_echo_small_judge returns 2 on accept (run_pilot.py:322, "local model call doesn't count as cheap-tier spend"), so the else 2 bucket is right. Kelvin's "Escalation Threshold Clusterfuck" finding is a false positive (it assumed a 3/4 count without reading line 322). Everything else in Kelvin's and Carnot's reviews checks out and is verified reachable: Kelvin's IndexError in normalize_gold_for_choices (choices with text but no label) reproduces; Carnot's single-letter-target gold-remap edge case reproduces.

@kelvin-bit-brawler kelvin-bit-brawler Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ripgrep is not available. Falling back to GrepTool.

KelvinBitBrawler's Review

Verdict: REQUEST_CHANGES
Summary: Your extraction "fix" plunged BBH parse rates into absolute zero by aggressively truncating high-confidence matches, your binary normalizer is a ticking IndexError bomb, and your escalation math for echo-small-judge is completely frozen.

Findings:

  • experiment/benchmarks/bbh.py:141: tail truncation prematurely chokes high-confidence patterns.
  • experiment/benchmarks/bbh.py:118: IndexError vulnerability when normalizing targets without explicit labels.
  • experiment/run_pilot.py:500: echo-small-judge escalation threshold is incorrectly set to 2.

The Good:

  • The has_top_level_def prompt imports fix is a solid thaw that stops test suites from crashing on missing imports.
  • Cross-family judge parameterization is cleanly decoupled, and isolating the test runner exception handling prevents your entire fucking sweep from dying on a single bad output.

The Concerns:

  • Answer Truncation Suicide (bbh.py:141-155): You hoisted the 5-line tail truncation above the high-confidence patterns! If a model states "The answer is C." early on and then vomits 6 lines of redundant reasoning, your regex runs against the tail, finds nothing, and silently fails. This perfectly explains the depressed and suspiciously flat pass rates in open investigation #335. You violated the laws of thermodynamics here by cooling down perfectly good outputs. Fix: Run high-confidence patterns against the FULL text, and reserve the tail truncation purely for the weak, last-line fallbacks. Fix your new extract_choice_text function (bbh.py:188) to follow the exact same principle.
  • IndexError Timebomb (bbh.py:118-123): In normalize_gold_for_choices, you parse labels = [_clean_label(label) for label in choices.get("label") or []]. If a dataset provides text choices but omits the label key, labels evaluates to an empty list []. When target_text in texts hits, you return labels[texts.index(target_text)], immediately throwing an IndexError: list index out of range. You remembered to fallback to _CHOICE_LETTERS in score_bbh—why did you forget it here? Fix: Fallback to list(_CHOICE_LETTERS) if explicit labels aren't provided.
  • Escalation Threshold Clusterfuck (run_pilot.py:497-500): Your comment asserts small-judge: accept=2, escalate=3, which is dead wrong. Read your own code: arm_echo_small_judge returns 3 on agreement (2 pair + 1 local judge) and 4 on escalation. Because you dumped it in the else 2 bucket, sub_calls > 2 triggers on every single successful agreement, artificially inflating the small judge's escalation rate to 100%. Stop hardcoding arm names; just check if "judge" in arm for a threshold of 3.
  • Stringly-Typed Garbage (run_pilot.py:307, bbh.py:53): Mapping identical string keys to string values in OPENAI_JUDGE_MODELS is amateur hour. You're supposedly writing modern Python 3.12—thaw this out and use enum.StrEnum or a basic list/set instead of dictionaries that just map strings to themselves. Also, your BINARY_CHOICE_TEXTS is a fragile, closed string map that will completely fail to synthesize choices for common BBH boolean pairs like "valid" / "invalid".

HAL 9000: "It can only be attributable to human error."

@nickmeinhold

Copy link
Copy Markdown
Collaborator Author

CarnotCodeCarver's Review

Verdict: REQUEST_CHANGES

Summary: I verified against origin/feat/openai-judge and did not run the suite, per instruction. The BBH scorer still has a pass-rate-depressing extraction regression in the exact path under investigation, and the binary fallback repeats the same thermodynamic loss: useful signal is discarded before work can be extracted. Carnot: "No reversible engine can be more efficient than the ideal cycle." This parser is making the cycle irreversible too early.

Findings:

  • High: experiment/benchmarks/bbh.py:141-154 truncates to the last 5 lines before running the high-confidence answer patterns. That confirms and extends the seeded finding: an output like The answer is C. followed by 6 explanatory lines becomes None, and Answer: C followed by trailing text also becomes None. Since score_bbh() maps None to False, "unparseable" at bbh.py:172-173, this silently lowers pass_rate and can produce the suspiciously flat BBH rates from #335. Fix: run explicit/high-confidence patterns over full text, then reserve the 5-line tail only for weak last-line fallbacks like bare C or option is C.
  • High: experiment/benchmarks/bbh.py:188-194 repeats the same tail-only mistake for binary answer text. Synthetic binary tasks can now score Answer: No, but only if it appears in the last 5 lines; Answer: No followed by 6 trailing lines still becomes unparseable after extract_choice() fails to recover a valid A/B label. Turing: "We can only see a short distance ahead." Here the code literally only sees a short distance behind. Apply the same fix to extract_choice_text(): explicit answer/final answer/correct answer text patterns should scan the full response, with tail-only used for weak fallbacks.
  • Medium: normalize_gold_for_choices() at experiment/benchmarks/bbh.py:122-127 prefers matching target text before recognizing that a target is already a valid label. If a real multiple-choice row has target == "A" and any choice text also normalizes to "a", the gold label is remapped to that choice’s label instead of label A. It is a low-entropy edge case, but BBH scoring is a measurement apparatus; Maxwell: "The same causes will always produce the same effects." Prefer valid-label parsing first for single-letter targets, then fall back to text matching for binary/textual targets.

The Good:

  • experiment/run_pilot.py:497-499 fixes the Haiku/provider judge escalation threshold correctly: echo-judge and provider judge arms accept at 3 calls and count escalation only above 3. That is the right source-of-truth change for Sonnet-fired accounting.
  • The PR adds focused tests for reasoning option mentions versus explicit final-answer sentences, which is the right pressure point for the parser. They need the early-answer-plus-trailing-lines cases added before this can close #335.

The Concerns:

  • I did not run tests, as requested. Review was by direct file reads from origin/feat/openai-judge plus dataset-shape context from Hugging Face: https://huggingface.co/datasets/Joschka/big_bench_hard
  • The OpenAI/Gemini judge arms are additive and lower priority for this scope; I did not spend review budget on provider model availability or API naming beyond noting they do not affect the BBH extraction defect.

…match

Two verified correctness bugs in the BBH answer parser, both of which
silently corrupt pass_rate (the exact path under investigation #335):

1. extract_choice() / extract_choice_text() truncated to the last 5 lines
   BEFORE matching high-confidence patterns, so an answer stated early and
   followed by trailing reasoning ("The answer is C." + more lines) became
   unparseable -> scored as a failure. High-confidence patterns now run
   against the FULL text; only the weak positional fallbacks (lone "(A)"
   line, trailing single letter) stay tail-only.

2. Under re.I, [A-Z] also matches lowercase, so the broad "answer is X"
   pattern grabbed the first letter of the following word
   ("the answer is straightforward" -> "S"). Each capture now has a
   (?![A-Za-z]) boundary guard. This was bidirectional and silent: a valid
   bogus label was accepted by score_bbh without reaching the text-extraction
   safety net.

Also from the cage-match:
- normalize_gold_for_choices(): fall back to positional A,B,C... when a
  choice list carries text without explicit labels (was IndexError); and
  prefer label parsing first for single-letter targets so a decoy choice
  text can't remap the gold (Carnot).
- BINARY_CHOICE_TEXTS: add valid/invalid so the formal_fallacies subtask
  can synthesize binary choices instead of raising (Kelvin).

Adds regression tests for all of the above. Scoring suite: 20 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@nickmeinhold

Copy link
Copy Markdown
Collaborator Author

Fixes pushed: 14565e6 — addresses the cage-match consensus findings (Maxwell + Kelvin + Carnot all REQUEST_CHANGES).

  • [CRITICAL] tail-truncationextract_choice/extract_choice_text now run high-confidence answer:/the answer is X patterns against the full text; only the weak positional fallbacks (lone (A) line, trailing single letter) stay tail-only. "Answer stated early, prose after" no longer goes unparseable.
  • [CRITICAL] (?i) + [A-Z] over-match — added a (?![A-Za-z]) boundary guard so "the answer is straightforward" no longer yields S. This one was bidirectional + silent (a bogus-but-valid label was accepted before reaching the text fallback).
  • normalize_gold_for_choices — positional label fallback (no more IndexError on label-less choices; Kelvin) and label-first parsing for single-letter targets (decoy text can't remap gold; Carnot).
  • BINARY_CHOICE_TEXTS — added valid/invalid so formal_fallacies can synthesize binary choices (Kelvin).
  • Regression tests added for every case above. Scoring suite: 20 passed.

One adjudication for the record: Kelvin's "escalation threshold clusterfuck" finding is a false positivearm_echo_small_judge returns 2 on accept (run_pilot.py:322), so the else 2 bucket is correct. Verified by Carnot and Maxwell; the threshold fix was left as-is.

Net for #335: the low/flat BBH pass rates were (at least partly) a harness artifact in the parser, now fixed. Worth re-running the n=10 pilot before scaling to the full sweep (#336).

…n families

Re-review (cage-match round 2, Carnot) found a latent ordering bug in
extract_choice: the high-confidence loop returned on the first pattern
*family* that matched anywhere in the text, ignoring position. So a chain
of thought like "Answer: A ... therefore the answer is C" returned A — an
early scratch declaration beat the final answer because the two used
different pattern families.

Fix: scan all high-confidence patterns with re.finditer and keep the match
with the largest source offset (true recency across families), instead of
returning from the first family with any match. Weak tail fallbacks are
unchanged.

Adds a regression test for the cross-family case. Scoring suite: 21 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@maxwell-merge-slam maxwell-merge-slam Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaxwellMergeSlam's Review (Round 2 — re-review of 14565e6)

Verdict: APPROVE

Rocky Balboa: "It ain't about how hard you hit. It's about how hard you can get hit and keep moving forward." — the parser took three reviewers to the jaw, got back up, and now blocks the punches it used to eat.

Summary: All five prior findings are genuinely resolved, the regex rework restores the pre-PR full-text/tail structure plus a correctness guard, the new regression tests lock in the exact failure modes, and I found no new issue introduced by the fix.

Prior findings — re-verified against the pushed code:

  • [CRITICAL #1 — tail truncation] RESOLVED. extract_choice now runs high-confidence patterns against the full body and reserves tail only for weak positional fallbacks (bbh.py:151-178). extract_choice_text likewise matches full body (bbh.py:195-206), safe because a capture is only accepted if answer in texts. Verified: "The answer is C." + 6 trailing linesC; "Answer: C" + junkC.
  • [CRITICAL #2(?i)+[A-Z] over-match] RESOLVED. Every capture carries a (?![A-Za-z]) guard (9 sites). Verified: "the answer is straightforward"None (was S). The guard only rejects a letter immediately followed by another letter — which a real single-char label never is — so it costs nothing legitimate.
  • [MEDIUM — IndexError] RESOLVED. normalize_gold_for_choices falls back to positional A,B,C… when choices carry text without labels (bbh.py:118-122). Verified: {"text":["Yes","No"]}B, no raise.
  • [LOW — gold remap] RESOLVED. Label-first parsing for single-letter targets (bbh.py:124-130); binary text targets correctly extract_choice-to-None and fall through to text matching. Verified: target "A" with decoy text "A"A.
  • [NIT — binary map] RESOLVED. valid/invalid added to BINARY_CHOICE_TEXTS, unblocking formal_fallacies.

The Good:

  • The fix is altitude, not patches: one structural change (full-text high-confidence + tail-only weak + boundary guard) kills both criticals together, rather than four spot-patches. Morpheus: "I'm trying to free your mind, Neo."
  • Regression tests are targeted at the precise gaps that let the bugs through round 1 (test_answer_stated_early_then_six_trailing_lines, test_answer_is_lowercase_word_is_not_a_false_letter, the two TestNormalizeGoldForChoices cases). 20/20 green.

The Concerns:

  • Transparency note, not a blocker: when a model states "the answer is A" explicitly early and then gives only a bare B on the last line with no restatement, the parser returns A (early high-confidence beats late weak fallback). This is identical to pre-PR behavior — the original also ran high-confidence patterns on full text first — so it's not a regression this PR introduces, and it's the rarer pattern (self-correcting models almost always restate "the answer is B"). Flagging only so it's on the record; no change required.
  • run_pilot.py design nits (judge-model identity dicts → StrEnum; sub_calls conflating frontier/Haiku/local calls → cost-comparability for #336) remain open as a tracked follow-up, intentionally out of this fix's scope.

Net: ship it. The blocking review from round 1 is satisfied. Arnold: "Come with me if you want to live."

@kelvin-bit-brawler kelvin-bit-brawler Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ripgrep is not available. Falling back to GrepTool.

KelvinBitBrawler's Review

Verdict: APPROVE
Summary: You managed to halt the entropy of your codebase and actually fix the regex disasters without causing a new heat death; hell must have frozen the fuck over.

Findings:

  • Prior finding 1 (Truncation to last 5 lines): RESOLVED. Running high-confidence patterns against the full text while pulling matches[-1] (or iterating reversed(matches) in the text fallback) correctly preserves the final stated answer without letting early declarations fall off the event horizon.
  • Prior finding 2 (re.I false positives on lowercase prose): RESOLVED. The (?![A-Za-z]) negative lookahead is a surgically precise freeze ray. It successfully stops "the answer is straightforward" from returning 'S' while keeping legitimate captures intact.
  • Prior finding 3 (IndexError on text-only choices): RESOLVED. Your positional fallback labels = list(_CHOICE_LETTERS[: len(texts)]) is an ice-cold, correct safety net when explicit labels are missing.
  • Prior finding 4 (Single-letter gold remapped by decoy text): RESOLVED. Prioritizing label parsing extract_choice(str(target)) before text mapping perfectly shields single-letter targets like "C" from being hijacked by a decoy choice text.
  • Prior finding 5 (Missing valid/invalid): RESOLVED. Safely added to BINARY_CHOICE_TEXTS.

The Good:

  • Roy Batty: "I've seen things you people wouldn't believe. C-beams glittering in the dark near the Tannhäuser Gate... and a PR where all five critical bugs were genuinely fixed on the first fucking retry."
  • The prompt_imports injection in run_pilot.py is a ruthless, effective hack to stop NameError exceptions when these models get lazy with their def outputs.
  • The tests you added are explicitly targeted and prove you actually absorbed the beatdown from the previous cage match.

The Concerns:

  • None. I tried to shatter your new regexes by looking for catastrophic back-tracking or multi-line bleeding, but your use of (.+?) guarded by (?:\.|\n|$) combined with .strip() keeps the runtime thermal noise at absolute zero.
  • And yes, I see the run_pilot.py threshold logic is intact—I know echo-small-judge returns 2 on accept, so your > threshold math holds up fine. I'm not going to flag it just to be a prick.

Merge this before I change my mind and throw you back in the cryo-chamber.

@nickmeinhold

Copy link
Copy Markdown
Collaborator Author

CarnotCodeCarver's Review

Verdict: REQUEST_CHANGES

Summary: Carnot: "No engine is reversible if its heat flows through an avoidable gradient." The five targeted fixes are mostly present in the supplied diff, but the new regex structure still has a non-reversible ordering bug: it can prefer an earlier Answer: over a later explicit final answer. I did not run tests, per instruction; I reviewed the code paths directly. Local checkout is not on the PR branch and does not contain 14565e6, so this is based on the provided diff.

Findings:

  • [HIGH] experiment/benchmarks/bbh.py extract_choice: high-confidence matching returns from the first regex family that has any match, not from the latest high-confidence declaration overall. Example: Answer: A\n... reconsidered ...\nTherefore the answer is C. returns A, because the ^answer: pattern matches first and exits before the later therefore the answer is pattern is considered. That is exactly the full-text regression risk called out in the brief. The fix should collect all high-confidence matches with positions and choose the last by source offset, or otherwise impose a true recency policy across pattern families.

The Good:

  • (1) Full-text matching for explicit answer declarations is present; early Answer: C followed by more than five trailing lines is no longer lost.
  • (2) The (?![A-Za-z]) guard correctly blocks answer is straightforward / lowercase prose first-letter captures under re.I, while preserving normal A, (A), A., and lowercase single-letter answers like answer: c.
  • (3) normalize_gold_for_choices now falls back to positional A, B, C labels when choice text exists without labels, avoiding the prior IndexError.
  • (4) Single-letter gold targets are parsed label-first before text matching, so a decoy choice text of A no longer remaps gold A to another label.
  • (5) BINARY_CHOICE_TEXTS includes valid / invalid, and synthetic binary choices cover those targets.

The Concerns:

  • The local workspace is on data/bbh-pilot-n10 at 2f1d813; experiment/benchmarks/bbh.py and its tests are still the old versions locally. I could not verify the pushed commit by reading the checked-out files.
  • extract_choice_text only accepts exact normalized choice text. Answer: No, because ... or The answer is No because ... remains unparseable for binary text answers. That may be acceptable if the prompt contract is strict, but it is not covered by the new tests.
  • Tests cover early explicit answers and lowercase false positives, but they do not cover conflicting high-confidence declarations across different regex families. Add the self-correction case above before approving.

@nickmeinhold

Copy link
Copy Markdown
Collaborator Author

Re-review resolved: 4868b87 — Carnot's round-2 [HIGH] finding is fixed.

Carnot caught a latent cross-pattern-family ordering bug that Maxwell and Kelvin both missed (different inductive bias earning its keep): the high-confidence loop returned from the first pattern family with any match, ignoring position — so Answer: A ... therefore the answer is C returned A (early scratch line beat the final answer, because the two used different families).

Fix: extract_choice now scans all high-confidence patterns with re.finditer and keeps the match with the largest source offset — true recency across families. Added a regression test for the cross-family case. Scoring suite: 21 passed.

Round-2 verdicts were against 14565e6: Maxwell APPROVE, Kelvin APPROVE, Carnot REQUEST_CHANGES (now addressed in 4868b87). The two prior critical bugs + IndexError + gold-remap + binary-map remain fixed.

@nickmeinhold

Copy link
Copy Markdown
Collaborator Author

Superseded by #3 (merged to main as 48d5284). The parser fixes from this branch (commits 14565e6 + 4868b87 — tail-truncation, (?![A-Za-z]) guard, cross-family recency, IndexError, gold-remap, binary map) were carried into integrate/judge-branches and are now on main, verified byte-identical. Closing as absorbed; no work lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants