Speculative Decoding via "Draft KV Stitching"

## Description
The current MTP (Multi-Token Prediction) path is experimental and provides "at most a slight speedup." We can improve this by implementing Draft KV Stitching.

## Proposed Changes
* **Small Draft Model Integration:** Allow the engine to load a tiny (e.g., DeepSeek 1.3B) model purely for drafting.
* **Speculative Verification:** The main DS4 engine verifies 4-8 draft tokens in a single Metal graph pass.
* **KV Alignment:** Ensure that when a draft is accepted, the "compressed KV" of the main model is updated without a full re-prefill.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative Decoding via "Draft KV Stitching" #63

Description

Proposed Changes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Speculative Decoding via "Draft KV Stitching" #63

Description

Description

Proposed Changes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions