Add `assembly eval` command for WER/DER scoring of datasets by alexkroman · Pull Request #74 · AssemblyAI/cli

alexkroman · 2026-06-11T22:30:04Z

Adds a new assembly eval command that transcribes evaluation datasets and scores them against reference texts (WER) and/or speaker turns (DER). Supports both local manifests (.csv/.jsonl) and Hugging Face Hub datasets.

Key Changes

New aai_cli/eval_data.py module: Dataset loader supporting local manifests and Hugging Face datasets via the datasets-server REST API (no heavyweight datasets dependency). Auto-detects audio/text columns, resolves relative paths, handles gated datasets via HF_TOKEN, and loads speaker-turn references for diarization scoring.
New aai_cli/wer.py module: Word error rate (WER) scoring backed by jiwer. Normalizes text (lowercase, punctuation stripped, whitespace collapsed) before alignment so transcripts aren't penalized for casing/punctuation style differences.
New aai_cli/der.py module: Diarization error rate (DER) scoring backed by pyannote.metrics. Computes missed detection, false alarm, and speaker confusion errors with optional collar tolerance around reference boundaries.
New aai_cli/commands/evaluate.py module: The eval command that orchestrates transcription, scoring, and output. Supports --speech-model selection, --speaker-labels for diarization, --collar for DER tolerance, and --json output. Renders results as a table with per-item and pooled scores.
Comprehensive test coverage:
- tests/test_eval_command.py: End-to-end command behavior, WER/DER scoring, flag handling, error paths
- tests/test_eval_data_manifest.py: Local manifest loading (CSV/JSONL), column resolution, path handling
- tests/test_eval_data_hf.py: Hugging Face dataset loading, split/subset selection, authentication
- tests/test_wer.py: WER normalization and alignment
- tests/test_der.py: DER scoring and collar behavior
Integration: Registered eval command in main app, added to help snapshots, updated README and smoke tests.

Implementation Details

Dataset loading is unified: eval_data.load() detects whether the input is a local manifest (by file extension or existence) or a Hugging Face dataset ID (validated against ^[\w.-]+(?:/[\w.-]+)?$ to prevent typos being sent to the hub).
Hugging Face integration uses httpx directly with the datasets-server REST API, avoiding the heavyweight datasets library while still supporting gated/private datasets.
Column auto-detection follows ASR dataset conventions (HF audio datasets, Common Voice, NeMo manifests) with explicit --audio-column/--text-column overrides.
Speaker turns use the parallel-array convention from Hugging Face diarization datasets (speakers, timestamps_start, timestamps_end in seconds).
All dataclasses are frozen for immutability; errors are rich with suggestions via the existing UsageError/APIError framework.
Scoring is pooled across items for corpus-level metrics; JSON output includes per-item and aggregate scores.

https://claude.ai/code/session_014o9JsqBmqmybikbtHNsbSf

Transcribe an evaluation dataset — a Hugging Face dataset id (via the datasets-server REST API, no datasets dependency) or a local .csv/.jsonl manifest — and score word error rate against its reference texts with jiwer. With --speaker-labels the run also diarizes and scores diarization error rate against reference speaker turns via pyannote.metrics (--collar for boundary forgiveness). Per-file rows plus pooled corpus scores render as a table or --json. https://claude.ai/code/session_014o9JsqBmqmybikbtHNsbSf

aikido-pr-checks · 2026-06-11T22:46:42Z

+) -> EvalDataset:
+    """Load evaluation items from a local manifest or a Hugging Face dataset id."""
+    path = Path(dataset)
+    if path.suffix in _MANIFEST_SUFFIXES or path.is_file():


load() classifies any existing file as a manifest, even without .csv/.jsonl extension, so non-manifest files are incorrectly parsed as dataset manifests.

Suggested change

if path.suffix in _MANIFEST_SUFFIXES or path.is_file():

if path.suffix in _MANIFEST_SUFFIXES and path.is_file():

Details

✨ AI Reasoning
The loader is trying to distinguish local manifests from Hugging Face dataset IDs. However, the decision condition accepts two cases: known manifest suffixes or any path that exists as a file. That makes the suffix check effectively optional for local files. As a result, an existing non-manifest file can be misclassified as a manifest and sent into CSV/JSONL parsing logic, which contradicts the documented accepted input shapes and can fail for reasons unrelated to dataset validity.

_{Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.}
_{Reply @AikidoSec ignore: [REASON] to ignore this issue.}
_{More info}

alexkroman enabled auto-merge (squash) June 11, 2026 22:30

alexkroman force-pushed the claude/eager-volta-vcxgph branch from bf75138 to f2e972c Compare June 11, 2026 22:46

aikido-pr-checks Bot reviewed Jun 11, 2026

View reviewed changes

alexkroman added 2 commits June 11, 2026 15:48

Merge branch 'main' into claude/eager-volta-vcxgph

33489a1

Merge branch 'main' into claude/eager-volta-vcxgph

b9e55f2

alexkroman disabled auto-merge June 11, 2026 22:56

alexkroman merged commit 73a1dec into main Jun 11, 2026
8 checks passed

alexkroman deleted the claude/eager-volta-vcxgph branch June 11, 2026 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `assembly eval` command for WER/DER scoring of datasets#74

Add `assembly eval` command for WER/DER scoring of datasets#74
alexkroman merged 3 commits into
mainfrom
claude/eager-volta-vcxgph

alexkroman commented Jun 11, 2026

Uh oh!

aikido-pr-checks Bot Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if path.suffix in _MANIFEST_SUFFIXES or path.is_file():
	if path.suffix in _MANIFEST_SUFFIXES and path.is_file():

Uh oh!

Conversation

alexkroman commented Jun 11, 2026

Key Changes

Implementation Details

Uh oh!

aikido-pr-checks Bot Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aikido-pr-checks Bot Jun 11, 2026 •

edited

Loading