Skip to content

Does this support using quantized models like AWQ for the main model? #52

@xbl916

Description

@xbl916

Below is my startup script.
START_CMD="vllm serve
/model/llm/Qwen3.5-35B-A3B-AWQ
--served-model-name Qwen3.5-VL
--gpu-memory-utilization 0.95
--max-num-seqs 8
--tensor-parallel-size 2
--generation-config auto
--reasoning-parser qwen3
--enable-auto-tool-choice --tool-call-parser qwen3_coder
--speculative-config {"method":"dflash","num_speculative_tokens":15,"model":"/model/llm/Qwen3.5-35B-A3B-DFlash"}
--distributed-executor-backend mp
--allowed-local-media-path /
--chat-template-content-format auto
--enable-prefix-caching
--kv-cache-dtype auto
--max-num-batched-tokens 8192
--max-model-len auto
--attention-backend flash_attn
--host 0.0.0.0
--port 9997"

I noticed that its speed has become even slower, suggesting that the adoption rate of speculation is very low.
(APIServer pid=11250) INFO 04-08 07:59:24 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 3.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=11250) INFO 04-08 07:59:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.50, Accepted throughput: 1.00 tokens/s, Drafted throughput: 30.00 tokens/s, Accepted: 10 tokens, Drafted: 300 tokens, Per-position acceptance rate: 0.300, 0.200, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 3.3%

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions