Enable package-declared draft speculation by i386 · Pull Request #710 · Mesh-LLM/mesh-llm

i386 · 2026-05-27T08:31:49Z

Summary

This PR lets Skippy layer packages declare a known-good draft speculative decoding setup, then has Skippy serving pick that setup up by default. Operators can publish a package with its draft model and window policy encoded in model-package.json instead of remembering launch flags for every single-stage or multi-stage run.

The user-visible goal is draft speculation for Skippy multi-stage: splitting a model across stages can expose stage/network latency, and package-declared speculation gives Skippy a default path to recover a meaningful amount of that wall time while keeping the target model split across nodes.

What Changed

Adds generation.speculative_decoding metadata to layer packages.
Adds package writer and HF Job support for recording a draft strategy in generated package metadata.
Adds config support for [defaults.speculative] and [models.speculative], with explicit config taking precedence over package defaults.
Wires resolved speculation settings into Skippy's embedded stage-0 OpenAI frontend.
Adds --no-draft so operators can disable package-declared draft defaults for a run.
Documents the package schema, config behavior, and operational defaults.

Default behavior is intentionally simple: if a package declares a usable draft strategy, Skippy enables it automatically. If the draft cannot be resolved, Skippy serves the baseline package without speculation instead of failing startup.

Skippy Protocol / Compatibility

This PR adds an additive Skippy speculation contract at the package/config/runtime boundary. It does not add a new mesh gossip field or a new protobuf StageControlRequest variant.

The new protocol/config payload is:

model-package.json may now declare generation.speculative_decoding.default plus named strategies.
The supported strategy for this PR is type = "draft-model" with a draft_model Hugging Face shorthand and a window policy.
The Skippy resolver carries the selected package/default/config state as ResolvedSpeculativeConfig.
The embedded stage-0 OpenAI runtime receives ResolvedEmbeddedOpenAiArgs fields: draft_model_path, speculative_window, adaptive_speculative_window, and draft_n_gpu_layers.

The existing Skippy stage-control protocol still carries split target-model lifecycle messages:

StageControlRequest::Claim reserves the generation/term on each stage.
StageControlRequest::Prepare asks each downstream stage to materialize or prefetch the target layer package range.
StageControlRequest::Inventory reports package/source readiness while stage 0 waits for exact prepared ranges.
StageControlRequest::Load(StageLoadRequest) starts the downstream target-model stage with its layer range, upstream/downstream peer links, package identity, wire dtype, KV/cache config, batch/ubatch settings, and load mode.
StageControlResponse::Ready returns the downstream stage endpoint that stage 0 uses for target verification.

Speculation itself is intentionally stage-0-owned:

The draft model is loaded only by stage 0.
Downstream stages do not load the draft model.
StageLoadRequest remains target-model-only; it does not carry draft_model_path or draft window fields.
Verification still uses the existing binary activation transport between stages after stage 0 chooses speculative verify inputs.

sequenceDiagram
  autonumber
  participant Pkg as model-package.json
  participant Host as mesh-llm host / resolver
  participant S0 as Skippy stage 0 + OpenAI frontend
  participant S1 as Skippy downstream stage(s)
  participant Draft as Draft GGUF
  participant Client as OpenAI client

  Pkg->>Host: generation.speculative_decoding default + strategy
  Host->>Host: resolve config; explicit config or --no-draft can override package defaults
  Host->>Draft: resolve/download draft_model shorthand
  Host->>S1: StageControlRequest::Claim
  S1-->>Host: claim accepted
  Host->>S1: StageControlRequest::Prepare(target layer range)
  S1-->>Host: StageControlResponse::PrepareAccepted
  loop until exact source range is available
    Host->>S1: StageControlRequest::Inventory
    S1-->>Host: StageControlResponse::Inventory
  end
  Host->>S1: StageControlRequest::Load(StageLoadRequest target config only)
  S1-->>Host: StageControlResponse::Ready(endpoint)
  Host->>S0: start embedded runtime with draft_model_path, speculative_window, adaptive flag
  Client->>S0: /v1/chat/completions or /v1/completions
  S0->>Draft: propose draft tokens
  S0->>S1: verify target tokens via existing binary activation transport
  S1-->>S0: target logits / verification output
  S0-->>Client: committed accepted tokens; restore/repair on rejection

Compatibility notes:

Existing packages without generation.speculative_decoding keep the current no-speculation behavior.
Existing explicit config continues to win over package metadata.
Older Skippy stage nodes still see the same stage-control message kinds for split loading. The new draft behavior is local to the coordinator/stage-0 runtime path.
Multi-stage serving remains stage-0-driven: the draft model is local to stage 0, while accepted/rejected target verification still flows through the existing staged target-model pipeline.

Package Metadata Shape

{
  "generation": {
    "speculative_decoding": {
      "default": "llama32-1b-q4",
      "strategies": {
        "llama32-1b-q4": {
          "type": "draft-model",
          "draft_model": "unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_K_M",
          "window_policy": {
            "default": "adaptive",
            "initial_window": 16,
            "min_window": 2,
            "max_window": 16
          }
        }
      }
    }
  }
}

Benchmark Evidence

Clean table: clean-table-llama33q3-draft1b-w16-20260527.

Target: Llama-3.3-70B-Instruct-Q3_K_M
Draft: unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
Run shape: Studio 54, clean server per measurement, 3 measured runs, 3 warmups, max_tokens=192, prompt limit 2.

Condition	Definition	Runs	Median tok/s	Median wall	Accept rate	Comparison
llama baseline	vanilla `llama-server`, no draft	3/3	8.02	47.90s	n/a	baseline
llama draft W16	vanilla `llama-server`, fixed draft window 16	3/3	11.04	34.78s	86.98%	+37.7% tok/s, -27.4% wall
llama adaptive W16	vanilla `llama-server`, adaptive draft up to 16	3/3	10.60	36.22s	83.18%	+32.3% tok/s, -24.4% wall
Skippy 2-stage baseline	Skippy split serving, no draft	3/3	7.96	48.23s	n/a	99.3% of llama baseline tok/s
Skippy 2-stage draft W16	Skippy split serving, package-style fixed draft window 16	3/3	10.02	38.31s	89.89%	+25.9% tok/s, -20.6% wall vs Skippy baseline
Skippy 2-stage adaptive W16	Skippy split serving, package-style adaptive draft up to 16	3/3	10.02	38.33s	89.89%	+25.9% tok/s, -20.5% wall vs Skippy baseline

Pairwise comparisons:

Comparison	Baseline tok/s	Candidate tok/s	tok/s change	Baseline wall	Candidate wall	Wall change
vanilla `llama-server` draft W16 vs vanilla `llama-server` baseline	8.02	11.04	+37.7%	47.90s	34.78s	-27.4%
Skippy 2-stage adaptive W16 vs Skippy 2-stage baseline	7.96	10.02	+25.9%	48.23s	38.33s	-20.5%
Skippy 2-stage adaptive W16 vs vanilla `llama-server` draft W16	11.04	10.02	-9.2%	34.78s	38.33s	+10.2%

Charted comparison uses adjacent bars inside each condition group. Blue is llama-server; green is Skippy 2-stage. The speculative group compares llama-server Draft W16 against Skippy 2-stage Adaptive W16.

%%{init: {"themeVariables": {"xyChart": {"plotColorPalette": "#2563eb,#16a34a"}}}}%%
xychart-beta
  title "Completion throughput, higher is better"
  x-axis ["Baseline llama", "Baseline Skippy", "Spec llama", "Spec Skippy"]
  y-axis "tok/s" 0 --> 12
  bar "llama-server" [8.02, 0, 11.04, 0]
  bar "Skippy 2-stage" [0, 7.96, 0, 10.02]

%%{init: {"themeVariables": {"xyChart": {"plotColorPalette": "#2563eb,#16a34a"}}}}%%
xychart-beta
  title "Wall time, lower is better"
  x-axis ["Baseline llama", "Baseline Skippy", "Spec llama", "Spec Skippy"]
  y-axis "seconds" 0 --> 50
  bar "llama-server" [47.90, 0, 34.78, 0]
  bar "Skippy 2-stage" [0, 48.23, 0, 38.33]

The data story is that Skippy split serving is close to vanilla llama without speculation, but split serving still pays distributed-stage latency. Package-declared draft speculation gives Skippy multi-stage a default path to recover a meaningful chunk of that latency while keeping the model split across machines.

Validation

cargo fmt --all -- --check
cargo clippy -p mesh-llm-host-runtime --all-targets -- -D warnings
cargo test -p mesh-llm-host-runtime --lib
cargo test -p mesh-llm-host-runtime --lib runtime::local::tests::load_split_runtime_generation_stops_candidate_stages_after_partial_load_failure -- --nocapture
just with-lld cargo test -p skippy-runtime --lib package
just with-lld cargo test -p skippy-model-package package_generation
just with-lld cargo test -p model-package
just with-lld cargo test -p mesh-llm-config --lib
just release-build

Package job:

HF package job 6a168fc45c8d10ffa1103c11 completed successfully and published https://huggingface.co/meshllm/Llama-3.3-70B-Instruct-Q3_K_M-draft-layers.

Studio smoke tests:

single-node default package serve returned ok and loaded Llama-3.2-1B-Instruct as the package draft
single-node --no-draft returned ok and did not load Llama-3.2-1B-Instruct
two-stage split serve returned ok, loaded the package draft on the joined stage, and assigned layer_range=79..80

Generated with AI assistance.

i386 marked this pull request as draft May 27, 2026 11:15

i386 added 7 commits June 5, 2026 17:22

Enable package-declared draft speculation

0c1c14d

Retry HF artifact uploads during package jobs

f8f4315

Honor package draft disablement

f666e72

Add speculative config authoring API

df287b8

Document speculative config authoring re-exports

ff1a70a

Fix package spec CI failures

054b6ce

Fix package draft branch after main refactors

926b5bb

i386 force-pushed the codex/package-declared-draft-spec-main branch from 8d15a52 to 926b5bb Compare June 5, 2026 07:30

Fix native runtime loading for draft packages

f56a90b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable package-declared draft speculation#710

Enable package-declared draft speculation#710
i386 wants to merge 8 commits into
mainfrom
codex/package-declared-draft-spec-main

i386 commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

i386 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Skippy Protocol / Compatibility

Package Metadata Shape

Benchmark Evidence

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

i386 commented May 27, 2026 •

edited

Loading