Reduce Skippy decode hot-path overhead (3/3)#798
Open
i386 wants to merge 2 commits into
Open
Conversation
cd1a007 to
aeae536
Compare
ndizazzo
approved these changes
Jun 6, 2026
Collaborator
ndizazzo
left a comment
There was a problem hiding this comment.
Impl looks good/layered correctly - rebasing and getting across the line
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Skippy decode now does less CPU-side work on every token hop when debug telemetry is off, and downstream decode/verify frames get a correctly sized output activation buffer before entering the native stage runtime.
This is the first controlled cleanup PR on the path to higher decode throughput. It keeps the optimization small and reviewable so we can benchmark follow-up decode changes independently instead of bundling several theories together.
Before
Decode handling paid two recurring costs in the hot path:
0as the native output activation buffer capacity, leaving dense downstream activation output sizing to the slower fallback/probe path.flowchart LR A["Receive decode frame"] --> B["Build debug attrs"] B --> C["Run llama stage with output capacity = 0"] C --> D["Maybe resize/probe output buffer"] D --> E["Build more debug attrs"] E --> F["Forward activation / return token"]After
The hot path now avoids that avoidable work unless it is actually needed:
telemetry.is_debug_enabled()is true.token_count * activation_width * f32, matching the existing wire activation size calculation.flowchart LR A["Receive decode frame"] --> B{"Debug telemetry enabled?"} B -- "yes" --> C["Build debug attrs"] B -- "no" --> D["Skip debug attr construction"] C --> E["Estimate downstream activation capacity"] D --> E E --> F["Run llama stage with pre-sized output buffer"] F --> G["Forward activation / return token"]Performance Impact
This targets fixed per-token overhead rather than model math:
No lab benchmark numbers are included here because the benchmark lab is currently delayed. This PR is meant to be the clean baseline for the next controlled benchmark pass.
Compatibility
No protocol, ABI, topology, sampling, or activation dtype changes. This is internal hot-path cleanup only.
Validation
cargo fmt -p skippy-server -- --checkcargo check -p skippy-servercargo test -p skippy-server --lib— 115 passedcargo clippy -p skippy-server --all-targets -- -D warningscargo check -p mesh-llmcargo clippy -p mesh-llm --all-targets -- -D warnings