Support multiple datasets in eval command by alexkroman · Pull Request #188 · AssemblyAI/cli

alexkroman · 2026-06-16T20:34:20Z

Enable the assembly eval command to score multiple datasets in a single run, with each dataset loaded, scored, and reported separately.

Summary

The eval command now accepts multiple dataset arguments instead of a single dataset. Each dataset is evaluated independently and emits its own JSON payload (when --json is used), while failure counts are aggregated across all datasets in the exit message.

Key Changes

Command signature: Changed dataset: str argument to datasets: list[str] to accept variadic dataset inputs
Evaluation loop: Refactored run_evaluate() to iterate over multiple datasets, extracting the single-dataset logic into _evaluate_one()
Failure aggregation: Accumulated failed and total counts across all datasets for the final error message
Output format: Each dataset emits its own JSON object (under --json), with human-readable output showing a block per dataset
Help text: Updated docstrings and CLI help to reflect multi-dataset support with examples
Test coverage: Added four new tests covering multiple datasets with JSON output, human output, failure aggregation, and the requirement for at least one dataset

Implementation Details

The _evaluate_one() function encapsulates the scoring logic for a single dataset and returns its payload
run_evaluate() now orchestrates the loop, emitting each payload and accumulating failure statistics
The error message format remains consistent: "{failed} of {total} items failed to transcribe." where totals span all datasets
All existing flags (--limit, --split, --subset, --column) apply uniformly to every dataset in the run

https://claude.ai/code/session_01JmbkV2dVWv8FZxbgGVrPXu

The eval command now takes one or more dataset positionals (Hugging Face ids or local manifests) and scores each separately in a single run. Each dataset is loaded, transcribed, scored, and reported on its own; under --json it emits one JSON object per dataset, and the final exit-code tally pools failures across every dataset. The --limit/--split/--subset/--column flags apply to all datasets.

alexkroman enabled auto-merge June 16, 2026 20:34

alexkroman added this pull request to the merge queue Jun 16, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 16, 2026

alexkroman added this pull request to the merge queue Jun 16, 2026

Merged via the queue into main with commit bf6689d Jun 16, 2026
19 checks passed

alexkroman deleted the claude/lucid-cori-pz29mm branch June 16, 2026 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple datasets in eval command#188

Support multiple datasets in eval command#188
alexkroman merged 1 commit into
mainfrom
claude/lucid-cori-pz29mm

alexkroman commented Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexkroman commented Jun 16, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants