Skip to content

Enable package-declared draft speculation#710

Draft
i386 wants to merge 8 commits into
mainfrom
codex/package-declared-draft-spec-main
Draft

Enable package-declared draft speculation#710
i386 wants to merge 8 commits into
mainfrom
codex/package-declared-draft-spec-main

Conversation

@i386
Copy link
Copy Markdown
Collaborator

@i386 i386 commented May 27, 2026

Summary

This PR lets Skippy layer packages declare a known-good draft speculative decoding setup, then has Skippy serving pick that setup up by default. Operators can publish a package with its draft model and window policy encoded in model-package.json instead of remembering launch flags for every single-stage or multi-stage run.

The user-visible goal is draft speculation for Skippy multi-stage: splitting a model across stages can expose stage/network latency, and package-declared speculation gives Skippy a default path to recover a meaningful amount of that wall time while keeping the target model split across nodes.

What Changed

  • Adds generation.speculative_decoding metadata to layer packages.
  • Adds package writer and HF Job support for recording a draft strategy in generated package metadata.
  • Adds config support for [defaults.speculative] and [models.speculative], with explicit config taking precedence over package defaults.
  • Wires resolved speculation settings into Skippy's embedded stage-0 OpenAI frontend.
  • Adds --no-draft so operators can disable package-declared draft defaults for a run.
  • Documents the package schema, config behavior, and operational defaults.

Default behavior is intentionally simple: if a package declares a usable draft strategy, Skippy enables it automatically. If the draft cannot be resolved, Skippy serves the baseline package without speculation instead of failing startup.

Skippy Protocol / Compatibility

This PR adds an additive Skippy speculation contract at the package/config/runtime boundary. It does not add a new mesh gossip field or a new protobuf StageControlRequest variant.

The new protocol/config payload is:

  • model-package.json may now declare generation.speculative_decoding.default plus named strategies.
  • The supported strategy for this PR is type = "draft-model" with a draft_model Hugging Face shorthand and a window policy.
  • The Skippy resolver carries the selected package/default/config state as ResolvedSpeculativeConfig.
  • The embedded stage-0 OpenAI runtime receives ResolvedEmbeddedOpenAiArgs fields: draft_model_path, speculative_window, adaptive_speculative_window, and draft_n_gpu_layers.

The existing Skippy stage-control protocol still carries split target-model lifecycle messages:

  • StageControlRequest::Claim reserves the generation/term on each stage.
  • StageControlRequest::Prepare asks each downstream stage to materialize or prefetch the target layer package range.
  • StageControlRequest::Inventory reports package/source readiness while stage 0 waits for exact prepared ranges.
  • StageControlRequest::Load(StageLoadRequest) starts the downstream target-model stage with its layer range, upstream/downstream peer links, package identity, wire dtype, KV/cache config, batch/ubatch settings, and load mode.
  • StageControlResponse::Ready returns the downstream stage endpoint that stage 0 uses for target verification.

Speculation itself is intentionally stage-0-owned:

  • The draft model is loaded only by stage 0.
  • Downstream stages do not load the draft model.
  • StageLoadRequest remains target-model-only; it does not carry draft_model_path or draft window fields.
  • Verification still uses the existing binary activation transport between stages after stage 0 chooses speculative verify inputs.
sequenceDiagram
  autonumber
  participant Pkg as model-package.json
  participant Host as mesh-llm host / resolver
  participant S0 as Skippy stage 0 + OpenAI frontend
  participant S1 as Skippy downstream stage(s)
  participant Draft as Draft GGUF
  participant Client as OpenAI client

  Pkg->>Host: generation.speculative_decoding default + strategy
  Host->>Host: resolve config; explicit config or --no-draft can override package defaults
  Host->>Draft: resolve/download draft_model shorthand
  Host->>S1: StageControlRequest::Claim
  S1-->>Host: claim accepted
  Host->>S1: StageControlRequest::Prepare(target layer range)
  S1-->>Host: StageControlResponse::PrepareAccepted
  loop until exact source range is available
    Host->>S1: StageControlRequest::Inventory
    S1-->>Host: StageControlResponse::Inventory
  end
  Host->>S1: StageControlRequest::Load(StageLoadRequest target config only)
  S1-->>Host: StageControlResponse::Ready(endpoint)
  Host->>S0: start embedded runtime with draft_model_path, speculative_window, adaptive flag
  Client->>S0: /v1/chat/completions or /v1/completions
  S0->>Draft: propose draft tokens
  S0->>S1: verify target tokens via existing binary activation transport
  S1-->>S0: target logits / verification output
  S0-->>Client: committed accepted tokens; restore/repair on rejection
Loading

Compatibility notes:

  • Existing packages without generation.speculative_decoding keep the current no-speculation behavior.
  • Existing explicit config continues to win over package metadata.
  • Older Skippy stage nodes still see the same stage-control message kinds for split loading. The new draft behavior is local to the coordinator/stage-0 runtime path.
  • Multi-stage serving remains stage-0-driven: the draft model is local to stage 0, while accepted/rejected target verification still flows through the existing staged target-model pipeline.

Package Metadata Shape

{
  "generation": {
    "speculative_decoding": {
      "default": "llama32-1b-q4",
      "strategies": {
        "llama32-1b-q4": {
          "type": "draft-model",
          "draft_model": "unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_K_M",
          "window_policy": {
            "default": "adaptive",
            "initial_window": 16,
            "min_window": 2,
            "max_window": 16
          }
        }
      }
    }
  }
}

Benchmark Evidence

Clean table: clean-table-llama33q3-draft1b-w16-20260527.

Target: Llama-3.3-70B-Instruct-Q3_K_M
Draft: unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
Run shape: Studio 54, clean server per measurement, 3 measured runs, 3 warmups, max_tokens=192, prompt limit 2.

Condition Definition Runs Median tok/s Median wall Accept rate Comparison
llama baseline vanilla llama-server, no draft 3/3 8.02 47.90s n/a baseline
llama draft W16 vanilla llama-server, fixed draft window 16 3/3 11.04 34.78s 86.98% +37.7% tok/s, -27.4% wall
llama adaptive W16 vanilla llama-server, adaptive draft up to 16 3/3 10.60 36.22s 83.18% +32.3% tok/s, -24.4% wall
Skippy 2-stage baseline Skippy split serving, no draft 3/3 7.96 48.23s n/a 99.3% of llama baseline tok/s
Skippy 2-stage draft W16 Skippy split serving, package-style fixed draft window 16 3/3 10.02 38.31s 89.89% +25.9% tok/s, -20.6% wall vs Skippy baseline
Skippy 2-stage adaptive W16 Skippy split serving, package-style adaptive draft up to 16 3/3 10.02 38.33s 89.89% +25.9% tok/s, -20.5% wall vs Skippy baseline

Pairwise comparisons:

Comparison Baseline tok/s Candidate tok/s tok/s change Baseline wall Candidate wall Wall change
vanilla llama-server draft W16 vs vanilla llama-server baseline 8.02 11.04 +37.7% 47.90s 34.78s -27.4%
Skippy 2-stage adaptive W16 vs Skippy 2-stage baseline 7.96 10.02 +25.9% 48.23s 38.33s -20.5%
Skippy 2-stage adaptive W16 vs vanilla llama-server draft W16 11.04 10.02 -9.2% 34.78s 38.33s +10.2%

Charted comparison uses adjacent bars inside each condition group. Blue is llama-server; green is Skippy 2-stage. The speculative group compares llama-server Draft W16 against Skippy 2-stage Adaptive W16.

%%{init: {"themeVariables": {"xyChart": {"plotColorPalette": "#2563eb,#16a34a"}}}}%%
xychart-beta
  title "Completion throughput, higher is better"
  x-axis ["Baseline llama", "Baseline Skippy", "Spec llama", "Spec Skippy"]
  y-axis "tok/s" 0 --> 12
  bar "llama-server" [8.02, 0, 11.04, 0]
  bar "Skippy 2-stage" [0, 7.96, 0, 10.02]
Loading
%%{init: {"themeVariables": {"xyChart": {"plotColorPalette": "#2563eb,#16a34a"}}}}%%
xychart-beta
  title "Wall time, lower is better"
  x-axis ["Baseline llama", "Baseline Skippy", "Spec llama", "Spec Skippy"]
  y-axis "seconds" 0 --> 50
  bar "llama-server" [47.90, 0, 34.78, 0]
  bar "Skippy 2-stage" [0, 48.23, 0, 38.33]
Loading

The data story is that Skippy split serving is close to vanilla llama without speculation, but split serving still pays distributed-stage latency. Package-declared draft speculation gives Skippy multi-stage a default path to recover a meaningful chunk of that latency while keeping the model split across machines.

Validation

  • cargo fmt --all -- --check
  • cargo clippy -p mesh-llm-host-runtime --all-targets -- -D warnings
  • cargo test -p mesh-llm-host-runtime --lib
  • cargo test -p mesh-llm-host-runtime --lib runtime::local::tests::load_split_runtime_generation_stops_candidate_stages_after_partial_load_failure -- --nocapture
  • just with-lld cargo test -p skippy-runtime --lib package
  • just with-lld cargo test -p skippy-model-package package_generation
  • just with-lld cargo test -p model-package
  • just with-lld cargo test -p mesh-llm-config --lib
  • just release-build

Package job:

  • HF package job 6a168fc45c8d10ffa1103c11 completed successfully and published https://huggingface.co/meshllm/Llama-3.3-70B-Instruct-Q3_K_M-draft-layers.

Studio smoke tests:

  • single-node default package serve returned ok and loaded Llama-3.2-1B-Instruct as the package draft
  • single-node --no-draft returned ok and did not load Llama-3.2-1B-Instruct
  • two-stage split serve returned ok, loaded the package draft on the joined stage, and assigned layer_range=79..80

Generated with AI assistance.

@i386 i386 marked this pull request as draft May 27, 2026 11:15
@i386 i386 force-pushed the codex/package-declared-draft-spec-main branch from 8d15a52 to 926b5bb Compare June 5, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant