Support multiple datasets in eval command#188
Merged
Conversation
The eval command now takes one or more dataset positionals (Hugging Face ids or local manifests) and scores each separately in a single run. Each dataset is loaded, transcribed, scored, and reported on its own; under --json it emits one JSON object per dataset, and the final exit-code tally pools failures across every dataset. The --limit/--split/--subset/--column flags apply to all datasets.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enable the
assembly evalcommand to score multiple datasets in a single run, with each dataset loaded, scored, and reported separately.Summary
The
evalcommand now accepts multiple dataset arguments instead of a single dataset. Each dataset is evaluated independently and emits its own JSON payload (when--jsonis used), while failure counts are aggregated across all datasets in the exit message.Key Changes
dataset: strargument todatasets: list[str]to accept variadic dataset inputsrun_evaluate()to iterate over multiple datasets, extracting the single-dataset logic into_evaluate_one()failedandtotalcounts across all datasets for the final error message--json), with human-readable output showing a block per datasetImplementation Details
_evaluate_one()function encapsulates the scoring logic for a single dataset and returns its payloadrun_evaluate()now orchestrates the loop, emitting each payload and accumulating failure statistics"{failed} of {total} items failed to transcribe."where totals span all datasets--limit,--split,--subset,--column) apply uniformly to every dataset in the runhttps://claude.ai/code/session_01JmbkV2dVWv8FZxbgGVrPXu