Support distributed eval via --distributed by TonyChen06 · Pull Request #63 · ELM-Research/ECG-Language-Models

TonyChen06 · 2026-06-10T01:38:15Z

Stacked on #62 — the first commit here is that PR; review/merge it first. The last commit is this PR. Final piece of splitting up #12.

What

evaluate() shards the flattened turn list across ranks with turns[rank::world_size] — an exact partition, so unlike Batched and distributed eval (without full determinism feature) #12's DistributedSampler approach there are no padding duplicates and no dedup step. Each rank generates its shard; results are all_gather_object-ed, sorted back to original order, and every rank computes identical metrics from the full set.
The model is unwrapped from DDP for generate().
main_evaluator.py calls init_dist()/cleanup() when --distributed is passed (flag already exists) and gates prints/file writes to rank 0.
Single-process runs are unchanged (world_size == 1 short-circuits both the shard and the gather).

Launch: torchrun --standalone --nproc_per_node=N src/main_evaluator.py ... --distributed

Verification (2× RTX A5000 via torchrun vs the 1-GPU reference, greedy, 277 turn pairs)

Coverage exact: 277/277 pairs, no duplicates, no gaps; references identical and in original order.
Generations: 207/277 (74.7%) exact-match vs the single-GPU run — same bfloat16 near-tie envelope as batched-vs-unbatched in Batch eval generation across turns via --eval_batch_size #62 (and measured with an untrained connector across different GPU models, i.e. worst case). Aggregate metrics agree to the third decimal.

Note: this touches the same main_evaluator.py region as #54; whichever merges second has a trivial conflict to resolve.

The evaluator generated one turn at a time: for every sample it looped over response ranges and called generate() per turn with batch size 1. Restructure into two phases: flatten_eval_turns expands every (sample, turn) pair into a flat work item (prefix ids, masked signal indices, per-sample encoder outputs, ground truth), then generation runs over chunks of --eval_batch_size turns, left-padded to the chunk max with signal indices shifted per item. Results are reassembled in the original order before the (unchanged) metric computation. --eval_batch_size defaults to 1, which preserves today's behavior exactly: same generate() calls in the same order, verified on a real eval (277 turn pairs, sampling on): 277/277 generations and all metrics identical to main. With batching, greedy decoding at eval_batch_size=4 reproduces 75-77% of eval_batch_size=1 generations exactly (signal and rgb configs, untrained connector); the remainder differ through bfloat16 batched kernels reaching different logit argmaxes on near-ties. Aggregate metrics agree to the third decimal. index_nested now returns the squeezed per-sample entry; its only caller is the new flatten step.

Shard the flattened turn list across ranks (turns[rank::world_size] — an exact partition, so no post-gather deduplication is needed), generate locally, then all_gather_object the results and sort by the original order; every rank computes identical metrics from the full gathered set. The model is unwrapped from DDP for generate(). main_evaluator initializes/destroys the process group when --distributed is passed and gates prints and file writes to rank 0. Single-process behavior is unchanged (world_size == 1 keeps the full turn list and skips the gather).

TonyChen06 added 2 commits June 9, 2026 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support distributed eval via --distributed#63

Support distributed eval via --distributed#63
TonyChen06 wants to merge 2 commits into
ELM-Research:mainfrom
TonyChen06:feat/distributed-eval

TonyChen06 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

TonyChen06 commented Jun 10, 2026

What

Verification (2× RTX A5000 via torchrun vs the 1-GPU reference, greedy, 277 turn pairs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant