Add assembly eval command for WER/DER scoring of datasets#74
Conversation
Transcribe an evaluation dataset — a Hugging Face dataset id (via the datasets-server REST API, no datasets dependency) or a local .csv/.jsonl manifest — and score word error rate against its reference texts with jiwer. With --speaker-labels the run also diarizes and scores diarization error rate against reference speaker turns via pyannote.metrics (--collar for boundary forgiveness). Per-file rows plus pooled corpus scores render as a table or --json. https://claude.ai/code/session_014o9JsqBmqmybikbtHNsbSf
bf75138 to
f2e972c
Compare
| ) -> EvalDataset: | ||
| """Load evaluation items from a local manifest or a Hugging Face dataset id.""" | ||
| path = Path(dataset) | ||
| if path.suffix in _MANIFEST_SUFFIXES or path.is_file(): |
There was a problem hiding this comment.
load() classifies any existing file as a manifest, even without .csv/.jsonl extension, so non-manifest files are incorrectly parsed as dataset manifests.
| if path.suffix in _MANIFEST_SUFFIXES or path.is_file(): | |
| if path.suffix in _MANIFEST_SUFFIXES and path.is_file(): |
Details
✨ AI Reasoning
The loader is trying to distinguish local manifests from Hugging Face dataset IDs. However, the decision condition accepts two cases: known manifest suffixes or any path that exists as a file. That makes the suffix check effectively optional for local files. As a result, an existing non-manifest file can be misclassified as a manifest and sent into CSV/JSONL parsing logic, which contradicts the documented accepted input shapes and can fail for reasons unrelated to dataset validity.
Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
Adds a new
assembly evalcommand that transcribes evaluation datasets and scores them against reference texts (WER) and/or speaker turns (DER). Supports both local manifests (.csv/.jsonl) and Hugging Face Hub datasets.Key Changes
New
aai_cli/eval_data.pymodule: Dataset loader supporting local manifests and Hugging Face datasets via the datasets-server REST API (no heavyweightdatasetsdependency). Auto-detects audio/text columns, resolves relative paths, handles gated datasets viaHF_TOKEN, and loads speaker-turn references for diarization scoring.New
aai_cli/wer.pymodule: Word error rate (WER) scoring backed by jiwer. Normalizes text (lowercase, punctuation stripped, whitespace collapsed) before alignment so transcripts aren't penalized for casing/punctuation style differences.New
aai_cli/der.pymodule: Diarization error rate (DER) scoring backed by pyannote.metrics. Computes missed detection, false alarm, and speaker confusion errors with optional collar tolerance around reference boundaries.New
aai_cli/commands/evaluate.pymodule: Theevalcommand that orchestrates transcription, scoring, and output. Supports--speech-modelselection,--speaker-labelsfor diarization,--collarfor DER tolerance, and--jsonoutput. Renders results as a table with per-item and pooled scores.Comprehensive test coverage:
tests/test_eval_command.py: End-to-end command behavior, WER/DER scoring, flag handling, error pathstests/test_eval_data_manifest.py: Local manifest loading (CSV/JSONL), column resolution, path handlingtests/test_eval_data_hf.py: Hugging Face dataset loading, split/subset selection, authenticationtests/test_wer.py: WER normalization and alignmenttests/test_der.py: DER scoring and collar behaviorIntegration: Registered
evalcommand in main app, added to help snapshots, updated README and smoke tests.Implementation Details
eval_data.load()detects whether the input is a local manifest (by file extension or existence) or a Hugging Face dataset ID (validated against^[\w.-]+(?:/[\w.-]+)?$to prevent typos being sent to the hub).datasetslibrary while still supporting gated/private datasets.--audio-column/--text-columnoverrides.UsageError/APIErrorframework.https://claude.ai/code/session_014o9JsqBmqmybikbtHNsbSf