Add per-row and pooled latency metrics to eval command#178
Merged
Conversation
Time each transcription's wall-clock latency and surface it: a per-row LATENCY column / JSON field, plus pooled p50 and p90 percentiles in the summary. Latency is recorded for every row that ran a request (including failed ones), independent of WER scoring. https://claude.ai/code/session_01SWJQ3VVvR2YLyPDbtrw6tU
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Instruments the
evalcommand to measure and report transcription latency for each item and compute latency percentiles (p50, p90) across the dataset.Summary
The evaluation command now tracks wall-clock latency for each transcription request using
time.perf_counter()and includes per-row latency in both JSON and human-readable output. Latency metrics are pooled across all rows (including failed ones) to compute p50 and p90 percentiles.Key Changes
_transcribe_one()withtime.perf_counter()calls to measure request duration; introduced_Timeddataclass to pair transcription outcomes with their latencylatencyfield to all result rows (both scored and failed); updated_ItemResultto carry latency alongside words for pooling_percentile()function using linear interpolation between ranks (numpy's default method) to compute p50 and p90 from latency values_pooled_metrics()to computelatency_p50andlatency_p90across all transcribed items_secs()formatter for latency display (e.g., "1.50s")Implementation Details
_Timedwrapper is immutable (frozen dataclass) to prevent accidental mutation_fake_perf_counter()pins the timer for deterministic latency assertionshttps://claude.ai/code/session_01SWJQ3VVvR2YLyPDbtrw6tU