Skip to content

Feature request: Natural Language Level Speculative Decoding with heterogeneous Draft Models (e.g. Qwen 2B) #80

@vagrillo

Description

@vagrillo

Currently, speculative decoding in many implementations relies on using a lower-quantized version of the same model (e.g., a Q2 draft for a Q4 target). However, in practice, the decoding speed of a Q2 model is often too close to the Q4 version because the process is heavily bound by memory bandwidth. This results in marginal performance gains that don't justify the overhead.

To achieve a significant speedup, we need a wider gap between the Draft and Target models' inference speeds. Using a highly optimized, small-parameter model with a different architecture (like Qwen 3.5 2B) as a drafter for DS4 could provide the necessary throughput to make speculation viable.

The Challenge: Tokenizer Mismatch

The main obstacle is that models like Qwen and DS4 use different tokenizers and vocabularies. Traditional speculative decoding (sampling/logit comparison) fails here because the token IDs do not map 1:1.

Proposed Solution: Text-Level Speculative Decoding

I propose implementing a "Natural Language" or "Text-based" speculative layer. Instead of comparing logits at the token level, the system would:

  1. Context Warm-up: The generation begins normally with the Target model. Once a stable context (e.g., 100+ tokens) is established, the Speculative Decoding engine activates. This ensures the Draft model has enough semantic context to generate relevant code snippets or text.
  2. Drafting: The small model (e.g., Qwen 3.5 2B) takes the existing context and generates a speculative sequence of $N$ tokens (e.g., 10-15 tokens) as raw text.
  3. Trans-tokenization: This raw text is then enququed to the context and re-tokenized using the DS4 (Target) tokenizer.
  4. Parallel Validation: The Target model performs a single forward pass on the entire sequence (original 100+ tokens + the new speculative tokens) to validate the drafted text in one go.
  5. Prefix Alignment & Correction: The system accepts the longest matching prefix where the Target model's predictions align with the drafted text. If a mismatch occurs at token $i$, the system accepts the first $i-1$ tokens, uses the Target model's correct prediction for token $i$, and discards the rest, restarting the cycle.

Speculative writing and decoding could be triggered by events, such as code triggers, or characters like "-", "=", "*", etc., and speculative decoding could be used only when there's a better chance of guessing the next tokens.
Or, one could consider speculative decoding with variable size based on predefined contexts. If I see a "#", I make a prediction of 10. If I'm in a free text context, I can even avoid trying to roll dice.

Why this excels for Coding

Coding tasks are uniquely suited for this approach:

  • Predictable Boilerplate: Small models are excellent at guessing repetitive syntax (public static void, if __name__ == "__main__":, bracket closures).
  • Indentation & Formatting: Large chunks of whitespace and standard formatting can be offloaded to the 2B model.
  • Speed Gap: A 1.5B/2B model on modern hardware can often reach >100 t/s, creating a massive lead time that the DS4 model can verify in batches.

...why not? ... a Fine Tuned Qwen 3.5 2B by DS4 coding generated DataSets ...maybe could help Qwen to be more lucky....

Potential Implementation in DS4

While this introduces a small overhead for re-tokenization, this cost is negligible compared to the time saved by skipping full autoregressive steps on the Target model. This would allow DS4 to leverage the "intelligence" of specialized small coding models to accelerate the generation of the larger, more capable Flash model.

I believe this would be a significant architectural upgrade for DS4, making it much more responsive for developer-centric workflows.
This is a really messy time for me right now, and I’m actually not sure I’ll be able to implement the test, but it might be worth doing a check anyway

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededspeedwontfixThis will not be worked on

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions