fix: skip debug TPS on short streams#849
Merged
Merged
Conversation
Decode TPS is meaningless when the streamed window is only ~1ms (short / single-chunk tool-call turns), since dividing output tokens by a timer-quantized duration reports inflated rates like tens of thousands of tok/s. Only compute TPS when the stream window reaches 50ms; otherwise show the raw token count and duration.
🦋 Changeset detectedLatest commit: be7f33b The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
commit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issue
No linked issue. Supersedes #848 with an alternative fix that preserves decode-TPS semantics.
Problem
Debug mode computes TPS as
output tokens / stream duration. For short or single-chunk tool-call turns the streamed window drains in ~1ms, so dividing by a timer-quantized duration reports inflated rates like tens of thousands of tok/s, e.g.TPS: 44000.0 tok/s (44 tokens in 1ms).What changed
Only compute debug TPS when the streamed window reaches 50ms, which keeps the existing decode-TPS semantics for normal turns. Below that threshold the line shows the raw token count and duration (
44 tokens in 1ms (stream too short for TPS)) instead of a meaningless ratio, rather than redefining TPS over the full response window (TTFT + stream) as #848 did. Added regression tests and a patch changeset.Checklist
gen-changesetsskill, or this PR needs no changeset.gen-docsskill, or this PR needs no doc update.