Skip to content

Add assembly eval command for WER/DER scoring of datasets#74

Merged
alexkroman merged 3 commits into
mainfrom
claude/eager-volta-vcxgph
Jun 11, 2026
Merged

Add assembly eval command for WER/DER scoring of datasets#74
alexkroman merged 3 commits into
mainfrom
claude/eager-volta-vcxgph

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Adds a new assembly eval command that transcribes evaluation datasets and scores them against reference texts (WER) and/or speaker turns (DER). Supports both local manifests (.csv/.jsonl) and Hugging Face Hub datasets.

Key Changes

  • New aai_cli/eval_data.py module: Dataset loader supporting local manifests and Hugging Face datasets via the datasets-server REST API (no heavyweight datasets dependency). Auto-detects audio/text columns, resolves relative paths, handles gated datasets via HF_TOKEN, and loads speaker-turn references for diarization scoring.

  • New aai_cli/wer.py module: Word error rate (WER) scoring backed by jiwer. Normalizes text (lowercase, punctuation stripped, whitespace collapsed) before alignment so transcripts aren't penalized for casing/punctuation style differences.

  • New aai_cli/der.py module: Diarization error rate (DER) scoring backed by pyannote.metrics. Computes missed detection, false alarm, and speaker confusion errors with optional collar tolerance around reference boundaries.

  • New aai_cli/commands/evaluate.py module: The eval command that orchestrates transcription, scoring, and output. Supports --speech-model selection, --speaker-labels for diarization, --collar for DER tolerance, and --json output. Renders results as a table with per-item and pooled scores.

  • Comprehensive test coverage:

    • tests/test_eval_command.py: End-to-end command behavior, WER/DER scoring, flag handling, error paths
    • tests/test_eval_data_manifest.py: Local manifest loading (CSV/JSONL), column resolution, path handling
    • tests/test_eval_data_hf.py: Hugging Face dataset loading, split/subset selection, authentication
    • tests/test_wer.py: WER normalization and alignment
    • tests/test_der.py: DER scoring and collar behavior
  • Integration: Registered eval command in main app, added to help snapshots, updated README and smoke tests.

Implementation Details

  • Dataset loading is unified: eval_data.load() detects whether the input is a local manifest (by file extension or existence) or a Hugging Face dataset ID (validated against ^[\w.-]+(?:/[\w.-]+)?$ to prevent typos being sent to the hub).
  • Hugging Face integration uses httpx directly with the datasets-server REST API, avoiding the heavyweight datasets library while still supporting gated/private datasets.
  • Column auto-detection follows ASR dataset conventions (HF audio datasets, Common Voice, NeMo manifests) with explicit --audio-column/--text-column overrides.
  • Speaker turns use the parallel-array convention from Hugging Face diarization datasets (speakers, timestamps_start, timestamps_end in seconds).
  • All dataclasses are frozen for immutability; errors are rich with suggestions via the existing UsageError/APIError framework.
  • Scoring is pooled across items for corpus-level metrics; JSON output includes per-item and aggregate scores.

https://claude.ai/code/session_014o9JsqBmqmybikbtHNsbSf

@alexkroman alexkroman enabled auto-merge (squash) June 11, 2026 22:30
Transcribe an evaluation dataset — a Hugging Face dataset id (via the
datasets-server REST API, no datasets dependency) or a local .csv/.jsonl
manifest — and score word error rate against its reference texts with
jiwer. With --speaker-labels the run also diarizes and scores diarization
error rate against reference speaker turns via pyannote.metrics
(--collar for boundary forgiveness). Per-file rows plus pooled corpus
scores render as a table or --json.

https://claude.ai/code/session_014o9JsqBmqmybikbtHNsbSf
@alexkroman alexkroman force-pushed the claude/eager-volta-vcxgph branch from bf75138 to f2e972c Compare June 11, 2026 22:46
Comment thread aai_cli/eval_data.py
) -> EvalDataset:
"""Load evaluation items from a local manifest or a Hugging Face dataset id."""
path = Path(dataset)
if path.suffix in _MANIFEST_SUFFIXES or path.is_file():

@aikido-pr-checks aikido-pr-checks Bot Jun 11, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load() classifies any existing file as a manifest, even without .csv/.jsonl extension, so non-manifest files are incorrectly parsed as dataset manifests.

Suggested change
if path.suffix in _MANIFEST_SUFFIXES or path.is_file():
if path.suffix in _MANIFEST_SUFFIXES and path.is_file():
Details

✨ AI Reasoning
​The loader is trying to distinguish local manifests from Hugging Face dataset IDs. However, the decision condition accepts two cases: known manifest suffixes or any path that exists as a file. That makes the suffix check effectively optional for local files. As a result, an existing non-manifest file can be misclassified as a manifest and sent into CSV/JSONL parsing logic, which contradicts the documented accepted input shapes and can fail for reasons unrelated to dataset validity.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

@alexkroman alexkroman disabled auto-merge June 11, 2026 22:56
@alexkroman alexkroman merged commit 73a1dec into main Jun 11, 2026
8 checks passed
@alexkroman alexkroman deleted the claude/eager-volta-vcxgph branch June 11, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants