Add per-row and pooled latency metrics to eval command by alexkroman · Pull Request #178 · AssemblyAI/cli

alexkroman · 2026-06-16T15:53:31Z

Instruments the eval command to measure and report transcription latency for each item and compute latency percentiles (p50, p90) across the dataset.

Summary

The evaluation command now tracks wall-clock latency for each transcription request using time.perf_counter() and includes per-row latency in both JSON and human-readable output. Latency metrics are pooled across all rows (including failed ones) to compute p50 and p90 percentiles.

Key Changes

Latency measurement: Wrapped _transcribe_one() with time.perf_counter() calls to measure request duration; introduced _Timed dataclass to pair transcription outcomes with their latency
Per-row latency: Added latency field to all result rows (both scored and failed); updated _ItemResult to carry latency alongside words for pooling
Percentile computation: Implemented _percentile() function using linear interpolation between ranks (numpy's default method) to compute p50 and p90 from latency values
Pooled metrics: Extended _pooled_metrics() to compute latency_p50 and latency_p90 across all transcribed items
Output formatting:
- Added _secs() formatter for latency display (e.g., "1.50s")
- Updated human-readable table to include LATENCY column when latency data is present
- Extended summary line to show "latency p50 X.XXs · p90 Y.YYs"
Test coverage: Added comprehensive tests for latency tracking in both success and failure paths, percentile interpolation edge cases, and output formatting

Implementation Details

Latency is measured for every transcription attempt, including failed requests, ensuring the distribution reflects actual API behavior
The _Timed wrapper is immutable (frozen dataclass) to prevent accidental mutation
Failed rows include latency in their output row dict alongside the error message
Percentile calculation handles edge cases: single values, odd/even counts, and boundary quantiles (q=0, q=1)
Test helper _fake_perf_counter() pins the timer for deterministic latency assertions

https://claude.ai/code/session_01SWJQ3VVvR2YLyPDbtrw6tU

Time each transcription's wall-clock latency and surface it: a per-row LATENCY column / JSON field, plus pooled p50 and p90 percentiles in the summary. Latency is recorded for every row that ran a request (including failed ones), independent of WER scoring. https://claude.ai/code/session_01SWJQ3VVvR2YLyPDbtrw6tU

alexkroman enabled auto-merge June 16, 2026 15:53

alexkroman added this pull request to the merge queue Jun 16, 2026

Merged via the queue into main with commit fc19a5a Jun 16, 2026
19 checks passed

alexkroman deleted the claude/zen-albattani-5sg0hc branch June 16, 2026 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-row and pooled latency metrics to eval command#178

Add per-row and pooled latency metrics to eval command#178
alexkroman merged 1 commit into
mainfrom
claude/zen-albattani-5sg0hc

alexkroman commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexkroman commented Jun 16, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants