Does this support using quantized models like AWQ for the main model?

Below is my startup script.
START_CMD="vllm serve \
  /model/llm/Qwen3.5-35B-A3B-AWQ \
  --served-model-name Qwen3.5-VL \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 8 \
  --tensor-parallel-size 2 \
  --generation-config auto \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --speculative-config {\"method\":\"dflash\",\"num_speculative_tokens\":15,\"model\":\"/model/llm/Qwen3.5-35B-A3B-DFlash\"} \
  --distributed-executor-backend mp \
  --allowed-local-media-path / \
  --chat-template-content-format auto \
  --enable-prefix-caching \
  --kv-cache-dtype auto \
  --max-num-batched-tokens 8192 \
  --max-model-len auto \
  --attention-backend flash_attn \
  --host 0.0.0.0 \
  --port 9997"

I noticed that its speed has become even slower, suggesting that the adoption rate of speculation is very low.
(APIServer pid=11250) INFO 04-08 07:59:24 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11250) INFO 04-08 07:59:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.50, Accepted throughput: 1.00 tokens/s, Drafted throughput: 30.00 tokens/s, Accepted: 10 tokens, Drafted: 300 tokens, Per-position acceptance rate: 0.300, 0.200, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 3.3%


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does this support using quantized models like AWQ for the main model? #52

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Does this support using quantized models like AWQ for the main model? #52

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions