Skip to content

Support multiple datasets in eval command#188

Merged
alexkroman merged 1 commit into
mainfrom
claude/lucid-cori-pz29mm
Jun 16, 2026
Merged

Support multiple datasets in eval command#188
alexkroman merged 1 commit into
mainfrom
claude/lucid-cori-pz29mm

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Enable the assembly eval command to score multiple datasets in a single run, with each dataset loaded, scored, and reported separately.

Summary

The eval command now accepts multiple dataset arguments instead of a single dataset. Each dataset is evaluated independently and emits its own JSON payload (when --json is used), while failure counts are aggregated across all datasets in the exit message.

Key Changes

  • Command signature: Changed dataset: str argument to datasets: list[str] to accept variadic dataset inputs
  • Evaluation loop: Refactored run_evaluate() to iterate over multiple datasets, extracting the single-dataset logic into _evaluate_one()
  • Failure aggregation: Accumulated failed and total counts across all datasets for the final error message
  • Output format: Each dataset emits its own JSON object (under --json), with human-readable output showing a block per dataset
  • Help text: Updated docstrings and CLI help to reflect multi-dataset support with examples
  • Test coverage: Added four new tests covering multiple datasets with JSON output, human output, failure aggregation, and the requirement for at least one dataset

Implementation Details

  • The _evaluate_one() function encapsulates the scoring logic for a single dataset and returns its payload
  • run_evaluate() now orchestrates the loop, emitting each payload and accumulating failure statistics
  • The error message format remains consistent: "{failed} of {total} items failed to transcribe." where totals span all datasets
  • All existing flags (--limit, --split, --subset, --column) apply uniformly to every dataset in the run

https://claude.ai/code/session_01JmbkV2dVWv8FZxbgGVrPXu

The eval command now takes one or more dataset positionals (Hugging Face
ids or local manifests) and scores each separately in a single run. Each
dataset is loaded, transcribed, scored, and reported on its own; under
--json it emits one JSON object per dataset, and the final exit-code tally
pools failures across every dataset. The --limit/--split/--subset/--column
flags apply to all datasets.
@alexkroman alexkroman enabled auto-merge June 16, 2026 20:34
@alexkroman alexkroman added this pull request to the merge queue Jun 16, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 16, 2026
@alexkroman alexkroman added this pull request to the merge queue Jun 16, 2026
Merged via the queue into main with commit bf6689d Jun 16, 2026
19 checks passed
@alexkroman alexkroman deleted the claude/lucid-cori-pz29mm branch June 16, 2026 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants