fix(bbh): support Yes/No targets (causal_judgement, web_of_lies, navigate, sports_understanding) by nickmeinhold · Pull Request #2 · enspyrco/echo

nickmeinhold · 2026-06-03T02:04:47Z

What this fixes

Adi's BBH harness (#39b767d) only handled MCQ subtasks. Loading causal_judgement crashes:

```
ValueError: Could not parse gold target: 'No'
at normalize_gold (bbh.py:87)
at _row_to_task (bbh.py:135)
```

Several BBH subtasks have boolean targets instead of MCQ letters: causal_judgement, web_of_lies, navigate, sports_understanding. The dataset rows for these have no choices field, just question and target ("Yes" or "No"). The current format_prompt would also render a broken prompt for these (telling the model to answer with a letter when there are no letter choices).

Changes

File	Change
`benchmarks/bbh.py`	+ `extract_yes_no()` parallel to `extract_choice()`, same layered-regex shape
`benchmarks/bbh.py`	+ `extract_answer()` unified extractor — Yes/No first (see note below)
`benchmarks/bbh.py`	`format_prompt` branches: MCQ → "Answer: X"; Yes/No → "Answer: Yes / Answer: No"
`benchmarks/bbh.py`	`normalize_gold` accepts "Yes"/"No" targets, normalizes to "YES"/"NO"
`benchmarks/bbh.py`	`score_bbh` routes extraction by gold shape so Yes/No reasoning that mentions "(A)" in passing isn't mis-scored as MCQ
`benchmarks/bbh.py`	`_row_to_task` tolerates missing `choices` field
`benchmarks/bbh_arms.py`	`lexical_agree` uses `extract_answer` so Yes/No subtasks get the letter-equality fast path
`tests/test_bbh_scoring.py`	+9 new tests covering Yes/No extraction, scoring, prompt format, cross-shape edge cases

Why Yes/No-first in `extract_answer`

extract_choice's "Answer: X" regex has no $ anchor, so it matches the "Y" in "Answer: Yes" before extract_yes_no would get a look. Putting extract_yes_no first in the unified extractor short-circuits cleanly. For genuine letter answers ("Answer: B"), extract_yes_no returns None and the call falls through to extract_choice as before. Caught by a unit test.

Test plan

```bash
cd experiment
.venv/bin/python -m unittest tests.test_bbh_scoring tests.test_bbh_arms -v # 24 tests, all green
.venv/bin/python -c "from benchmarks.bbh import load_bbh; print(len(load_bbh(['causal_judgement'], n_per_subtask=3)))" # prints 3
```

Follow-up (not in this PR)

Other BBH subtasks have shapes we still don't handle:

dyck_languages, word_sorting → free-form sequence output
multistep_arithmetic_two, object_counting → numeric output
formal_fallacies → "valid"/"invalid" (similar to Yes/No, could reuse the pattern)

Worth a follow-up issue/PR to extend to numeric and free-form once the Yes/No path proves out.

🤖 Generated with Claude Code

The pilot harness only handled MCQ subtasks. Loading causal_judgement crashed at _row_to_task → normalize_gold('No') because extract_choice can't parse a Yes/No string. Several BBH subtasks have this shape: causal_judgement, web_of_lies, navigate, sports_understanding. Changes: - Add extract_yes_no() with the same layered regex approach as extract_choice - Add unified extract_answer() (Yes/No first to avoid extract_choice's "Y" in "Yes" false-match), used by bbh_arms.lexical_agree - format_prompt branches on whether choices is present: MCQ "Answer: X" or boolean "Answer: Yes / Answer: No" - normalize_gold accepts Yes/No targets, normalizing to "YES"/"NO" - score_bbh routes by gold shape so Yes/No reasoning that happens to mention "(A)" isn't mis-scored as MCQ - _row_to_task handles missing 'choices' field on the row - 9 new unit tests covering Yes/No extraction, scoring, prompt format, and the cross-shape edge cases Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maxwell-merge-slam Bot mentioned this pull request Jun 9, 2026

feat(experiment): GPT-4o-mini cross-family judge + escalation/imports fixes #1

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bbh): support Yes/No targets (causal_judgement, web_of_lies, navigate, sports_understanding)#2

fix(bbh): support Yes/No targets (causal_judgement, web_of_lies, navigate, sports_understanding)#2
nickmeinhold wants to merge 1 commit into
mainfrom
fix/bbh-yes-no-targets

nickmeinhold commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nickmeinhold commented Jun 3, 2026

What this fixes

Changes

Why Yes/No-first in extract_answer

Test plan

Follow-up (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why Yes/No-first in `extract_answer`