Skip to content

fix(bbh): support Yes/No targets (causal_judgement, web_of_lies, navigate, sports_understanding)#2

Open
nickmeinhold wants to merge 1 commit into
mainfrom
fix/bbh-yes-no-targets
Open

fix(bbh): support Yes/No targets (causal_judgement, web_of_lies, navigate, sports_understanding)#2
nickmeinhold wants to merge 1 commit into
mainfrom
fix/bbh-yes-no-targets

Conversation

@nickmeinhold

Copy link
Copy Markdown
Collaborator

What this fixes

Adi's BBH harness (#39b767d) only handled MCQ subtasks. Loading causal_judgement crashes:

```
ValueError: Could not parse gold target: 'No'
at normalize_gold (bbh.py:87)
at _row_to_task (bbh.py:135)
```

Several BBH subtasks have boolean targets instead of MCQ letters: causal_judgement, web_of_lies, navigate, sports_understanding. The dataset rows for these have no choices field, just question and target ("Yes" or "No"). The current format_prompt would also render a broken prompt for these (telling the model to answer with a letter when there are no letter choices).

Changes

File Change
benchmarks/bbh.py + extract_yes_no() parallel to extract_choice(), same layered-regex shape
benchmarks/bbh.py + extract_answer() unified extractor — Yes/No first (see note below)
benchmarks/bbh.py format_prompt branches: MCQ → "Answer: X"; Yes/No → "Answer: Yes / Answer: No"
benchmarks/bbh.py normalize_gold accepts "Yes"/"No" targets, normalizes to "YES"/"NO"
benchmarks/bbh.py score_bbh routes extraction by gold shape so Yes/No reasoning that mentions "(A)" in passing isn't mis-scored as MCQ
benchmarks/bbh.py _row_to_task tolerates missing choices field
benchmarks/bbh_arms.py lexical_agree uses extract_answer so Yes/No subtasks get the letter-equality fast path
tests/test_bbh_scoring.py +9 new tests covering Yes/No extraction, scoring, prompt format, cross-shape edge cases

Why Yes/No-first in extract_answer

extract_choice's "Answer: X" regex has no $ anchor, so it matches the "Y" in "Answer: Yes" before extract_yes_no would get a look. Putting extract_yes_no first in the unified extractor short-circuits cleanly. For genuine letter answers ("Answer: B"), extract_yes_no returns None and the call falls through to extract_choice as before. Caught by a unit test.

Test plan

```bash
cd experiment
.venv/bin/python -m unittest tests.test_bbh_scoring tests.test_bbh_arms -v # 24 tests, all green
.venv/bin/python -c "from benchmarks.bbh import load_bbh; print(len(load_bbh(['causal_judgement'], n_per_subtask=3)))" # prints 3
```

Follow-up (not in this PR)

Other BBH subtasks have shapes we still don't handle:

  • dyck_languages, word_sorting → free-form sequence output
  • multistep_arithmetic_two, object_counting → numeric output
  • formal_fallacies → "valid"/"invalid" (similar to Yes/No, could reuse the pattern)

Worth a follow-up issue/PR to extend to numeric and free-form once the Yes/No path proves out.

🤖 Generated with Claude Code

The pilot harness only handled MCQ subtasks. Loading causal_judgement
crashed at _row_to_task → normalize_gold('No') because extract_choice
can't parse a Yes/No string. Several BBH subtasks have this shape:
causal_judgement, web_of_lies, navigate, sports_understanding.

Changes:
- Add extract_yes_no() with the same layered regex approach as extract_choice
- Add unified extract_answer() (Yes/No first to avoid extract_choice's "Y" in
  "Yes" false-match), used by bbh_arms.lexical_agree
- format_prompt branches on whether choices is present: MCQ "Answer: X"
  or boolean "Answer: Yes / Answer: No"
- normalize_gold accepts Yes/No targets, normalizing to "YES"/"NO"
- score_bbh routes by gold shape so Yes/No reasoning that happens to
  mention "(A)" isn't mis-scored as MCQ
- _row_to_task handles missing 'choices' field on the row
- 9 new unit tests covering Yes/No extraction, scoring, prompt format,
  and the cross-shape edge cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant