Batch eval generation across turns via --eval_batch_size by TonyChen06 · Pull Request #62 · ELM-Research/ECG-Language-Models

TonyChen06 · 2026-06-10T01:37:21Z

Third piece of splitting up #12 into simpler PRs — the batched-generation core, rewritten against current main (the original predates the thinking-split metrics, pretrain breakdown, max_new_tokens plumbing, and the current ELM names, so a rebase would have resurrected stale behavior).

What

The evaluator generated one turn at a time: per sample, per response range, one generate() call at batch size 1. This restructures evaluate() into two phases:

flatten_eval_turns — expands every (sample, turn) pair into a flat work item: prefix ids/mask, per-turn masked signal_id_indices, per-sample encoder outputs, ground truth. Same logic as the previous inner loop, run once up front.
Batched generation — chunks of --eval_batch_size turns, left-padded to the chunk max (collate_turns), signal indices shifted per item, one generate() per chunk. Results are re-assembled in original order before the unchanged metric tail.

--eval_batch_size defaults to 1, which preserves current behavior exactly — same generate() calls, same RNG consumption order.

Verification (real eval: qwen2.5-0.5b + patch_elf, ECG-QA PTB-XL subset, 277 turn pairs from 55 multi-turn samples)

Default path (bs=1, sampling on): 277/277 generations bit-identical to main, all metrics identical.
Batched greedy (bs=4 vs bs=1): 208/277 (signal) and 214/277 (rgb+CLIP) exact-match; the rest differ via bfloat16 batched kernels resolving near-tie argmaxes differently — the known caveat from Batched and distributed eval (without full determinism feature) #12 (note this is measured with an untrained connector, i.e. maximally tie-prone logits; aggregate metrics agree to the third decimal: F1 0.0859 vs 0.0880).
Speed: 115.6 s → 31.4 s wall (×3.7) at --eval_batch_size 8 for the 277-turn eval, including model load.

Full determinism across batch sizes (the fp64 variant of #12) is intentionally out of scope, per review feedback there.

The evaluator generated one turn at a time: for every sample it looped over response ranges and called generate() per turn with batch size 1. Restructure into two phases: flatten_eval_turns expands every (sample, turn) pair into a flat work item (prefix ids, masked signal indices, per-sample encoder outputs, ground truth), then generation runs over chunks of --eval_batch_size turns, left-padded to the chunk max with signal indices shifted per item. Results are reassembled in the original order before the (unchanged) metric computation. --eval_batch_size defaults to 1, which preserves today's behavior exactly: same generate() calls in the same order, verified on a real eval (277 turn pairs, sampling on): 277/277 generations and all metrics identical to main. With batching, greedy decoding at eval_batch_size=4 reproduces 75-77% of eval_batch_size=1 generations exactly (signal and rgb configs, untrained connector); the remainder differ through bfloat16 batched kernels reaching different logit argmaxes on near-ties. Aggregate metrics agree to the third decimal. index_nested now returns the squeezed per-sample entry; its only caller is the new flatten step.

TonyChen06 mentioned this pull request Jun 10, 2026

Support distributed eval via --distributed #63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch eval generation across turns via --eval_batch_size#62

Batch eval generation across turns via --eval_batch_size#62
TonyChen06 wants to merge 1 commit into
ELM-Research:mainfrom
TonyChen06:feat/batched-eval

TonyChen06 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

TonyChen06 commented Jun 10, 2026

What

Verification (real eval: qwen2.5-0.5b + patch_elf, ECG-QA PTB-XL subset, 277 turn pairs from 55 multi-turn samples)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant