Skip to content

Reuse Skippy forwarded decode frames (1/3)#800

Merged
ndizazzo merged 1 commit into
skippy-decode-frame-reusefrom
skippy-decode-forward-reuse
Jun 6, 2026
Merged

Reuse Skippy forwarded decode frames (1/3)#800
ndizazzo merged 1 commit into
skippy-decode-frame-reusefrom
skippy-decode-forward-reuse

Conversation

@i386
Copy link
Copy Markdown
Collaborator

@i386 i386 commented Jun 5, 2026

Summary

Skippy split decode now reuses the forwarded activation frame envelope and activation encode buffer across decode tokens.

This is stacked on #799. It keeps the existing writer and wire format unchanged, but removes repeated forwarded StageWireMessage construction and gives activation encoding a caller-owned buffer to refill.

What changed

  • Added an internal in-place activation encoder: encode_f32_activation_payload_with_state_flags_into(...).
  • Kept the existing Vec-returning activation encode API by delegating to the in-place helper.
  • Added ReusableForwardedStageMessage for the decode forwarding hot path.
  • Reused forwarded decode frame containers in normal split decode and multimodal split decode.
  • Preserved existing one-shot forwarded_stage_message_timed(...) for non-loop callsites.

Before

flowchart LR
    A["Native decode output"] --> B["Encode activation into new Vec"]
    B --> C["Build new forwarded StageWireMessage"]
    C --> D["Clone sampling/tokens/positions"]
    D --> E["write_stage_message_conditioned"]
Loading

After

flowchart LR
    A["Before decode loop"] --> B["Create reusable forwarded frame"]
    B --> C["Native decode output"]
    C --> D["Refill activation encode buffer"]
    D --> E["Update forwarded frame fields"]
    E --> F["write_stage_message_conditioned"]
Loading

Performance Impact

This targets fixed CPU/allocation overhead on decode TPOT:

  • Reuses the activation encode buffer instead of allocating a fresh activation Vec every forwarded decode token.
  • Reuses token/position/raw-byte containers on the forwarded message.
  • Avoids repeatedly cloning stable sampling config in the two split decode loops unless it actually changes.

Expected impact is still modest because GPU forward, network wait, activation wire bytes, and downstream compute remain unchanged. This should be more meaningful than #799 when activation encode allocation shows up in profiles, but it is still a cleanup-class improvement rather than the main 30 tok/s lever.

High-impact follow-up

The bigger decode lever remains overlap/pipelining: hide stage0 setup, sampler/direct-return handling, or downstream wait behind work already in flight. That can remove or hide milliseconds of TPOT; this PR removes repeated local allocation work.

Compatibility

No wire protocol, ABI, topology, sampling, or activation dtype changes. The emitted stage messages keep the same fields and activation bytes.

Validation

  • cargo fmt -p skippy-protocol -p skippy-server -- --check
  • cargo check -p skippy-server
  • cargo test -p skippy-protocol --lib — 34 passed
  • cargo test -p skippy-server --lib — 117 passed
  • cargo clippy -p skippy-protocol --all-targets -- -D warnings
  • cargo clippy -p skippy-server --all-targets -- -D warnings
  • cargo check -p mesh-llm
  • cargo clippy -p mesh-llm --all-targets -- -D warnings

@ndizazzo ndizazzo changed the title Reuse Skippy forwarded decode frames Reuse Skippy forwarded decode frames (1/3) Jun 6, 2026
@ndizazzo ndizazzo self-requested a review June 6, 2026 05:23
@ndizazzo
Copy link
Copy Markdown
Collaborator

ndizazzo commented Jun 6, 2026

Tipping the stack

@ndizazzo ndizazzo merged commit dafa489 into skippy-decode-frame-reuse Jun 6, 2026
28 checks passed
@ndizazzo ndizazzo deleted the skippy-decode-forward-reuse branch June 6, 2026 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants