Skip to content

Speculative Decoding via "Draft KV Stitching" #63

@Nitya-003

Description

@Nitya-003

Description

The current MTP (Multi-Token Prediction) path is experimental and provides "at most a slight speedup." We can improve this by implementing Draft KV Stitching.

Proposed Changes

  • Small Draft Model Integration: Allow the engine to load a tiny (e.g., DeepSeek 1.3B) model purely for drafting.
  • Speculative Verification: The main DS4 engine verifies 4-8 draft tokens in a single Metal graph pass.
  • KV Alignment: Ensure that when a draft is accepted, the "compressed KV" of the main model is updated without a full re-prefill.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions