Enable package-declared draft speculation#710
Draft
i386 wants to merge 8 commits into
Draft
Conversation
8d15a52 to
926b5bb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR lets Skippy layer packages declare a known-good draft speculative decoding setup, then has Skippy serving pick that setup up by default. Operators can publish a package with its draft model and window policy encoded in
model-package.jsoninstead of remembering launch flags for every single-stage or multi-stage run.The user-visible goal is draft speculation for Skippy multi-stage: splitting a model across stages can expose stage/network latency, and package-declared speculation gives Skippy a default path to recover a meaningful amount of that wall time while keeping the target model split across nodes.
What Changed
generation.speculative_decodingmetadata to layer packages.[defaults.speculative]and[models.speculative], with explicit config taking precedence over package defaults.--no-draftso operators can disable package-declared draft defaults for a run.Default behavior is intentionally simple: if a package declares a usable draft strategy, Skippy enables it automatically. If the draft cannot be resolved, Skippy serves the baseline package without speculation instead of failing startup.
Skippy Protocol / Compatibility
This PR adds an additive Skippy speculation contract at the package/config/runtime boundary. It does not add a new mesh gossip field or a new protobuf
StageControlRequestvariant.The new protocol/config payload is:
model-package.jsonmay now declaregeneration.speculative_decoding.defaultplus named strategies.type = "draft-model"with adraft_modelHugging Face shorthand and a window policy.ResolvedSpeculativeConfig.ResolvedEmbeddedOpenAiArgsfields:draft_model_path,speculative_window,adaptive_speculative_window, anddraft_n_gpu_layers.The existing Skippy stage-control protocol still carries split target-model lifecycle messages:
StageControlRequest::Claimreserves the generation/term on each stage.StageControlRequest::Prepareasks each downstream stage to materialize or prefetch the target layer package range.StageControlRequest::Inventoryreports package/source readiness while stage 0 waits for exact prepared ranges.StageControlRequest::Load(StageLoadRequest)starts the downstream target-model stage with its layer range, upstream/downstream peer links, package identity, wire dtype, KV/cache config, batch/ubatch settings, and load mode.StageControlResponse::Readyreturns the downstream stage endpoint that stage 0 uses for target verification.Speculation itself is intentionally stage-0-owned:
StageLoadRequestremains target-model-only; it does not carrydraft_model_pathor draft window fields.sequenceDiagram autonumber participant Pkg as model-package.json participant Host as mesh-llm host / resolver participant S0 as Skippy stage 0 + OpenAI frontend participant S1 as Skippy downstream stage(s) participant Draft as Draft GGUF participant Client as OpenAI client Pkg->>Host: generation.speculative_decoding default + strategy Host->>Host: resolve config; explicit config or --no-draft can override package defaults Host->>Draft: resolve/download draft_model shorthand Host->>S1: StageControlRequest::Claim S1-->>Host: claim accepted Host->>S1: StageControlRequest::Prepare(target layer range) S1-->>Host: StageControlResponse::PrepareAccepted loop until exact source range is available Host->>S1: StageControlRequest::Inventory S1-->>Host: StageControlResponse::Inventory end Host->>S1: StageControlRequest::Load(StageLoadRequest target config only) S1-->>Host: StageControlResponse::Ready(endpoint) Host->>S0: start embedded runtime with draft_model_path, speculative_window, adaptive flag Client->>S0: /v1/chat/completions or /v1/completions S0->>Draft: propose draft tokens S0->>S1: verify target tokens via existing binary activation transport S1-->>S0: target logits / verification output S0-->>Client: committed accepted tokens; restore/repair on rejectionCompatibility notes:
generation.speculative_decodingkeep the current no-speculation behavior.Package Metadata Shape
{ "generation": { "speculative_decoding": { "default": "llama32-1b-q4", "strategies": { "llama32-1b-q4": { "type": "draft-model", "draft_model": "unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_K_M", "window_policy": { "default": "adaptive", "initial_window": 16, "min_window": 2, "max_window": 16 } } } } } }Benchmark Evidence
Clean table:
clean-table-llama33q3-draft1b-w16-20260527.Target:
Llama-3.3-70B-Instruct-Q3_K_MDraft:
unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_K_MRun shape: Studio 54, clean server per measurement, 3 measured runs, 3 warmups,
max_tokens=192, prompt limit 2.llama-server, no draftllama-server, fixed draft window 16llama-server, adaptive draft up to 16Pairwise comparisons:
llama-serverdraft W16 vs vanillallama-serverbaselinellama-serverdraft W16Charted comparison uses adjacent bars inside each condition group. Blue is
llama-server; green isSkippy 2-stage. The speculative group compares llama-server Draft W16 against Skippy 2-stage Adaptive W16.%%{init: {"themeVariables": {"xyChart": {"plotColorPalette": "#2563eb,#16a34a"}}}}%% xychart-beta title "Completion throughput, higher is better" x-axis ["Baseline llama", "Baseline Skippy", "Spec llama", "Spec Skippy"] y-axis "tok/s" 0 --> 12 bar "llama-server" [8.02, 0, 11.04, 0] bar "Skippy 2-stage" [0, 7.96, 0, 10.02]%%{init: {"themeVariables": {"xyChart": {"plotColorPalette": "#2563eb,#16a34a"}}}}%% xychart-beta title "Wall time, lower is better" x-axis ["Baseline llama", "Baseline Skippy", "Spec llama", "Spec Skippy"] y-axis "seconds" 0 --> 50 bar "llama-server" [47.90, 0, 34.78, 0] bar "Skippy 2-stage" [0, 48.23, 0, 38.33]The data story is that Skippy split serving is close to vanilla llama without speculation, but split serving still pays distributed-stage latency. Package-declared draft speculation gives Skippy multi-stage a default path to recover a meaningful chunk of that latency while keeping the model split across machines.
Validation
cargo fmt --all -- --checkcargo clippy -p mesh-llm-host-runtime --all-targets -- -D warningscargo test -p mesh-llm-host-runtime --libcargo test -p mesh-llm-host-runtime --lib runtime::local::tests::load_split_runtime_generation_stops_candidate_stages_after_partial_load_failure -- --nocapturejust with-lld cargo test -p skippy-runtime --lib packagejust with-lld cargo test -p skippy-model-package package_generationjust with-lld cargo test -p model-packagejust with-lld cargo test -p mesh-llm-config --libjust release-buildPackage job:
6a168fc45c8d10ffa1103c11completed successfully and publishedhttps://huggingface.co/meshllm/Llama-3.3-70B-Instruct-Q3_K_M-draft-layers.Studio smoke tests:
okand loadedLlama-3.2-1B-Instructas the package draft--no-draftreturnedokand did not loadLlama-3.2-1B-Instructok, loaded the package draft on the joined stage, and assignedlayer_range=79..80Generated with AI assistance.