Below is my startup script.
START_CMD="vllm serve
/model/llm/Qwen3.5-35B-A3B-AWQ
--served-model-name Qwen3.5-VL
--gpu-memory-utilization 0.95
--max-num-seqs 8
--tensor-parallel-size 2
--generation-config auto
--reasoning-parser qwen3
--enable-auto-tool-choice --tool-call-parser qwen3_coder
--speculative-config {"method":"dflash","num_speculative_tokens":15,"model":"/model/llm/Qwen3.5-35B-A3B-DFlash"}
--distributed-executor-backend mp
--allowed-local-media-path /
--chat-template-content-format auto
--enable-prefix-caching
--kv-cache-dtype auto
--max-num-batched-tokens 8192
--max-model-len auto
--attention-backend flash_attn
--host 0.0.0.0
--port 9997"
I noticed that its speed has become even slower, suggesting that the adoption rate of speculation is very low.
(APIServer pid=11250) INFO 04-08 07:59:24 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11250) INFO 04-08 07:59:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.50, Accepted throughput: 1.00 tokens/s, Drafted throughput: 30.00 tokens/s, Accepted: 10 tokens, Drafted: 300 tokens, Per-position acceptance rate: 0.300, 0.200, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 3.3%
Below is my startup script.
START_CMD="vllm serve
/model/llm/Qwen3.5-35B-A3B-AWQ
--served-model-name Qwen3.5-VL
--gpu-memory-utilization 0.95
--max-num-seqs 8
--tensor-parallel-size 2
--generation-config auto
--reasoning-parser qwen3
--enable-auto-tool-choice --tool-call-parser qwen3_coder
--speculative-config {"method":"dflash","num_speculative_tokens":15,"model":"/model/llm/Qwen3.5-35B-A3B-DFlash"}
--distributed-executor-backend mp
--allowed-local-media-path /
--chat-template-content-format auto
--enable-prefix-caching
--kv-cache-dtype auto
--max-num-batched-tokens 8192
--max-model-len auto
--attention-backend flash_attn
--host 0.0.0.0
--port 9997"
I noticed that its speed has become even slower, suggesting that the adoption rate of speculation is very low.
(APIServer pid=11250) INFO 04-08 07:59:24 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11250) INFO 04-08 07:59:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.50, Accepted throughput: 1.00 tokens/s, Drafted throughput: 30.00 tokens/s, Accepted: 10 tokens, Drafted: 300 tokens, Per-position acceptance rate: 0.300, 0.200, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 3.3%