Skip to content

ngram-mod: Reset i_last when low acceptance streak occurs#22168

Open
treo wants to merge 1 commit intoggml-org:masterfrom
treo:master
Open

ngram-mod: Reset i_last when low acceptance streak occurs#22168
treo wants to merge 1 commit intoggml-org:masterfrom
treo:master

Conversation

@treo
Copy link
Copy Markdown

@treo treo commented Apr 20, 2026

Overview

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.

The existing behavior would skip the current context, thereby often losing the benefit of speculative decoding during later parts of the generation.

The effect of this seems to depend on both the model and the speculation parameters.

Benchmark:
vllm bench serve --model google/gemma-4-26B-A4B-it --host 127.0.0.1 --port 9876 --num-prompts 96 --dataset-name hf --dataset-path vdaita/edit_5k_char --backend openai-chat --endpoint '/v1/chat/completions' --max-concurrency 1

Gemma 4 26B-A4B:

  • Baseline: 97.57 t/s (peak 108)
  • without i_last = 0: 141.08 t/s (peak 1032)
  • with i_last = 0: 155.10 t/s (peak 1032)

Qwen 3.6 35B-A3B:

  • Baseline: 112.54 t/s (peak 127)
  • without i_last = 0: 148.56 t/s (peak 774)
  • with i_last = 0: 153.65 t/s (peak 768)

As we can see, the effect isn't huge but at 3% to 9% it is still measurable.
The price we pay for it is 1 line of code and a moment longer for speculative map repopulation whenever a low acceptance streak occours.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES. Used to help understand the ngram-mod flow.

By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
@treo treo requested a review from a team as a code owner April 20, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant