Skip to content

Add per-row and pooled latency metrics to eval command#178

Merged
alexkroman merged 1 commit into
mainfrom
claude/zen-albattani-5sg0hc
Jun 16, 2026
Merged

Add per-row and pooled latency metrics to eval command#178
alexkroman merged 1 commit into
mainfrom
claude/zen-albattani-5sg0hc

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Instruments the eval command to measure and report transcription latency for each item and compute latency percentiles (p50, p90) across the dataset.

Summary

The evaluation command now tracks wall-clock latency for each transcription request using time.perf_counter() and includes per-row latency in both JSON and human-readable output. Latency metrics are pooled across all rows (including failed ones) to compute p50 and p90 percentiles.

Key Changes

  • Latency measurement: Wrapped _transcribe_one() with time.perf_counter() calls to measure request duration; introduced _Timed dataclass to pair transcription outcomes with their latency
  • Per-row latency: Added latency field to all result rows (both scored and failed); updated _ItemResult to carry latency alongside words for pooling
  • Percentile computation: Implemented _percentile() function using linear interpolation between ranks (numpy's default method) to compute p50 and p90 from latency values
  • Pooled metrics: Extended _pooled_metrics() to compute latency_p50 and latency_p90 across all transcribed items
  • Output formatting:
    • Added _secs() formatter for latency display (e.g., "1.50s")
    • Updated human-readable table to include LATENCY column when latency data is present
    • Extended summary line to show "latency p50 X.XXs · p90 Y.YYs"
  • Test coverage: Added comprehensive tests for latency tracking in both success and failure paths, percentile interpolation edge cases, and output formatting

Implementation Details

  • Latency is measured for every transcription attempt, including failed requests, ensuring the distribution reflects actual API behavior
  • The _Timed wrapper is immutable (frozen dataclass) to prevent accidental mutation
  • Failed rows include latency in their output row dict alongside the error message
  • Percentile calculation handles edge cases: single values, odd/even counts, and boundary quantiles (q=0, q=1)
  • Test helper _fake_perf_counter() pins the timer for deterministic latency assertions

https://claude.ai/code/session_01SWJQ3VVvR2YLyPDbtrw6tU

Time each transcription's wall-clock latency and surface it: a per-row
LATENCY column / JSON field, plus pooled p50 and p90 percentiles in the
summary. Latency is recorded for every row that ran a request (including
failed ones), independent of WER scoring.

https://claude.ai/code/session_01SWJQ3VVvR2YLyPDbtrw6tU
@alexkroman alexkroman enabled auto-merge June 16, 2026 15:53
@alexkroman alexkroman added this pull request to the merge queue Jun 16, 2026
Merged via the queue into main with commit fc19a5a Jun 16, 2026
19 checks passed
@alexkroman alexkroman deleted the claude/zen-albattani-5sg0hc branch June 16, 2026 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants