Skip to content

Reduce Skippy decode hot-path overhead (3/3)#798

Open
i386 wants to merge 2 commits into
mainfrom
skippy-decode-hotpath-cleanup
Open

Reduce Skippy decode hot-path overhead (3/3)#798
i386 wants to merge 2 commits into
mainfrom
skippy-decode-hotpath-cleanup

Conversation

@i386
Copy link
Copy Markdown
Collaborator

@i386 i386 commented Jun 5, 2026

Summary

Skippy decode now does less CPU-side work on every token hop when debug telemetry is off, and downstream decode/verify frames get a correctly sized output activation buffer before entering the native stage runtime.

This is the first controlled cleanup PR on the path to higher decode throughput. It keeps the optimization small and reviewable so we can benchmark follow-up decode changes independently instead of bundling several theories together.

Before

Decode handling paid two recurring costs in the hot path:

  • Binary transport and OpenAI frontend decode loops built detailed debug telemetry attribute maps for every token even when debug telemetry was disabled.
  • Decode and verify stage execution passed 0 as the native output activation buffer capacity, leaving dense downstream activation output sizing to the slower fallback/probe path.
flowchart LR
    A["Receive decode frame"] --> B["Build debug attrs"]
    B --> C["Run llama stage with output capacity = 0"]
    C --> D["Maybe resize/probe output buffer"]
    D --> E["Build more debug attrs"]
    E --> F["Forward activation / return token"]
Loading

After

The hot path now avoids that avoidable work unless it is actually needed:

  • Debug span attributes are only constructed when telemetry.is_debug_enabled() is true.
  • Decode/verify output capacity is precomputed for downstream stages from token_count * activation_width * f32, matching the existing wire activation size calculation.
  • Final stages and empty-token paths still pass zero capacity because they do not need downstream activation output.
  • Prefill behavior is intentionally unchanged in this PR.
flowchart LR
    A["Receive decode frame"] --> B{"Debug telemetry enabled?"}
    B -- "yes" --> C["Build debug attrs"]
    B -- "no" --> D["Skip debug attr construction"]
    C --> E["Estimate downstream activation capacity"]
    D --> E
    E --> F["Run llama stage with pre-sized output buffer"]
    F --> G["Forward activation / return token"]
Loading

Performance Impact

This targets fixed per-token overhead rather than model math:

  • Lower CPU allocation/serialization work in the normal non-debug telemetry mode.
  • Fewer native output-buffer fallback/probe opportunities for dense downstream decode and verify frames.
  • Expected benefit is most visible when decode TPOT is dominated by orchestration overhead, transport latency, or small per-token CPU costs around stage execution.

No lab benchmark numbers are included here because the benchmark lab is currently delayed. This PR is meant to be the clean baseline for the next controlled benchmark pass.

Compatibility

No protocol, ABI, topology, sampling, or activation dtype changes. This is internal hot-path cleanup only.

Validation

  • cargo fmt -p skippy-server -- --check
  • cargo check -p skippy-server
  • cargo test -p skippy-server --lib — 115 passed
  • cargo clippy -p skippy-server --all-targets -- -D warnings
  • cargo check -p mesh-llm
  • cargo clippy -p mesh-llm --all-targets -- -D warnings

@ndizazzo ndizazzo force-pushed the skippy-decode-hotpath-cleanup branch from cd1a007 to aeae536 Compare June 6, 2026 05:07
Copy link
Copy Markdown
Collaborator

@ndizazzo ndizazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impl looks good/layered correctly - rebasing and getting across the line

@ndizazzo ndizazzo changed the title Reduce Skippy decode hot-path overhead Reduce Skippy decode hot-path overhead (3/3) Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants