Support Qwen2.5 VL eagle3 by rahul-tuli · Pull Request #118 · neuralmagic/vllm

rahul-tuli · 2025-09-26T20:49:10Z

VLLM_USE_V1=1 \
  CUDA_VISIBLE_DEVICES=6 \
  python examples/offline_inference/spec_decode.py \
    --method "eagle3" \
    --tp 1 \
    --print-output \
    --model-dir "Qwen/Qwen2.5-VL-7B-Instruct" \
    --eagle-dir "nm-testing/MOCK-UP-Eagle3ForQwen2.5VL7B" \
    --dataset_name "hf" \
    --dataset_path "philschmid/mt-bench" \
    --num-spec-tokens 3

Results:

--------------------------------------------------
--------------------------------------------------
total_num_output_tokens: 198111
num_drafts: 143816
num_draft_tokens: 431448
num_accepted_tokens: 53491
mean acceptance length: 1.37
--------------------------------------------------
acceptance at token 0: 0.31
acceptance at token 1: 0.06
acceptance at token 2: 0.01

- Add SupportsEagle3 interface to Llama4ForConditionalGeneration and Llama4ForCausalLM - Implement custom auxiliary hidden state layers (1, 23, 44) for Eagle3 speculative decoding - Enable multimodal input handling in Eagle3LlamaForCausalLM with text-only inference mode - Add proper dimension adaptation for auxiliary hidden states from multimodal verifiers - Implement dynamic Eagle3 auxiliary layer configuration from speculators config - Add GPU model runner method to read eagle_aux_hidden_state_layer_ids from draft config - Update auxiliary layer configuration logic to use speculative config dynamically - Simplify model implementations to provide fallback defaults This is the first successful implementation of Eagle3 speculative decoding with multimodal Llama4 models, supporting custom layer extraction and text-only drafter processing while leveraging multimodal context from auxiliary hidden states. The implementation now dynamically reads auxiliary layer configuration from the draft model's speculative config, eliminating hardcoded layer IDs.

- Fix aux_hidden_state_layers initialization syntax error in qwen2.py - Add missing return statement in qwen2_5_vl.py get_eagle3_aux_hidden_state_layers - Improve error handling with hasattr check instead of assert - Clean up method delegation to use direct return from language_model - Add fallback default auxiliary layers for Qwen2.5VL models These fixes enable Eagle3 speculative decoding support for Qwen2.5VL models. Successfully tested with Qwen2.5VL-7B + Eagle3 configuration.

rahul-tuli added 3 commits September 25, 2025 14:09

Some fixes

780c072

rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch 7 times, most recently from 1f6fd40 to 5e93541 Compare October 3, 2025 08:45

rahul-tuli force-pushed the support-llama3-eagle3-head-with-llama4-verifier branch from cac1941 to 1037b36 Compare October 6, 2025 13:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Qwen2.5 VL eagle3#118

Support Qwen2.5 VL eagle3#118
rahul-tuli wants to merge 3 commits into
support-llama3-eagle3-head-with-llama4-verifierfrom
support-qwen2.5-vl-eagle3

rahul-tuli commented Sep 26, 2025 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rahul-tuli commented Sep 26, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rahul-tuli commented Sep 26, 2025 •

edited by github-actions Bot

Loading