Reduce Skippy decode hot-path overhead (3/3) by i386 · Pull Request #798 · Mesh-LLM/mesh-llm

i386 · 2026-06-05T06:24:49Z

Summary

Skippy decode now does less CPU-side work on every token hop when debug telemetry is off, and downstream decode/verify frames get a correctly sized output activation buffer before entering the native stage runtime.

This is the first controlled cleanup PR on the path to higher decode throughput. It keeps the optimization small and reviewable so we can benchmark follow-up decode changes independently instead of bundling several theories together.

Before

Decode handling paid two recurring costs in the hot path:

Binary transport and OpenAI frontend decode loops built detailed debug telemetry attribute maps for every token even when debug telemetry was disabled.
Decode and verify stage execution passed 0 as the native output activation buffer capacity, leaving dense downstream activation output sizing to the slower fallback/probe path.

flowchart LR
    A["Receive decode frame"] --> B["Build debug attrs"]
    B --> C["Run llama stage with output capacity = 0"]
    C --> D["Maybe resize/probe output buffer"]
    D --> E["Build more debug attrs"]
    E --> F["Forward activation / return token"]

After

The hot path now avoids that avoidable work unless it is actually needed:

Debug span attributes are only constructed when telemetry.is_debug_enabled() is true.
Decode/verify output capacity is precomputed for downstream stages from token_count * activation_width * f32, matching the existing wire activation size calculation.
Final stages and empty-token paths still pass zero capacity because they do not need downstream activation output.
Prefill behavior is intentionally unchanged in this PR.

flowchart LR
    A["Receive decode frame"] --> B{"Debug telemetry enabled?"}
    B -- "yes" --> C["Build debug attrs"]
    B -- "no" --> D["Skip debug attr construction"]
    C --> E["Estimate downstream activation capacity"]
    D --> E
    E --> F["Run llama stage with pre-sized output buffer"]
    F --> G["Forward activation / return token"]

Performance Impact

This targets fixed per-token overhead rather than model math:

Lower CPU allocation/serialization work in the normal non-debug telemetry mode.
Fewer native output-buffer fallback/probe opportunities for dense downstream decode and verify frames.
Expected benefit is most visible when decode TPOT is dominated by orchestration overhead, transport latency, or small per-token CPU costs around stage execution.

No lab benchmark numbers are included here because the benchmark lab is currently delayed. This PR is meant to be the clean baseline for the next controlled benchmark pass.

Compatibility

No protocol, ABI, topology, sampling, or activation dtype changes. This is internal hot-path cleanup only.

Validation

cargo fmt -p skippy-server -- --check
cargo check -p skippy-server
cargo test -p skippy-server --lib — 115 passed
cargo clippy -p skippy-server --all-targets -- -D warnings
cargo check -p mesh-llm
cargo clippy -p mesh-llm --all-targets -- -D warnings

ndizazzo

Impl looks good/layered correctly - rebasing and getting across the line

i386 mentioned this pull request Jun 5, 2026

Reuse Skippy decode wire messages (2/3) #799

Merged

ndizazzo self-requested a review June 6, 2026 04:49

ndizazzo assigned i386 Jun 6, 2026

Reduce Skippy decode hot-path overhead

aeae536

ndizazzo force-pushed the skippy-decode-hotpath-cleanup branch from cd1a007 to aeae536 Compare June 6, 2026 05:07

ndizazzo approved these changes Jun 6, 2026

View reviewed changes

ndizazzo changed the title ~~Reduce Skippy decode hot-path overhead~~ Reduce Skippy decode hot-path overhead (3/3) Jun 6, 2026

Reuse Skippy decode wire messages

4d4752b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce Skippy decode hot-path overhead (3/3)#798

Reduce Skippy decode hot-path overhead (3/3)#798
i386 wants to merge 2 commits into
mainfrom
skippy-decode-hotpath-cleanup

i386 commented Jun 5, 2026

Uh oh!

ndizazzo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

i386 commented Jun 5, 2026

Summary

Before

After

Performance Impact

Compatibility

Validation

Uh oh!

ndizazzo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants