GGUF-native DFlash speculative decoding runtime for local models.
DFlash uses a lightweight block diffusion draft model to propose multiple tokens in parallel, verified by the target model in one forward pass. This is faster than autoregressive speculative decoding because draft latency does not scale with block size.
dflash-robot is based on Luce DFlash, the first GGUF port of DFlash, with adapter-based architecture for expanding to any GGUF model with a compatible DFlash draft.
v0.2.0 — Qwen3.6-27B + Qwen3.6-35B-A3B on RTX 3090:
| Model | AR tok/s | DFlash tok/s | Speedup | Status |
|---|---|---|---|---|
| Qwen3.6-27B | 37.73 | 90.90 | 2.41x | v0.1.0, cross-gen draft |
| Qwen3.6-35B-A3B | 106.11 | 28.29 | 0.27x | v0.2.0, matched draft, MoE overhead |
Qwen3.6-35B-A3B DFlash produces correct output but MoE verification overhead dominates on RTX 3090 (31.5ms verify per step vs 3.9ms draft compute). Future optimization: reduce MoE expert count per verify step or use draft quantization to lower VRAM pressure.
- NVIDIA GPU: sm_75+ (RTX 2080 Ti, RTX 3090, RTX 4090, A10, A40, H100)
- CUDA 12+
- 22+ GB VRAM for Qwen3.6-27B
- ~80 GB disk
git clone --recurse-submodules https://github.com/am423/dflash-robot.git
cd dflash-robot
# Build
CUDAToolkit_ROOT=/path/to/cuda \
CUDACXX=/path/to/nvcc \
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build -j$(nproc)
# Download models
huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
huggingface-cli download z-lab/Qwen3.5-27B-DFlash model.safetensors --local-dir models/draft/
# Python env
python3 -m venv .venv
.venv/bin/pip install transformers datasets huggingface_hub safetensors sentencepiece jinja2
# Run
DFLASH_TARGET=models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=models/draft/model.safetensors \
.venv/bin/python scripts/run.py --prompt "def fibonacci(n):" --n-gen 128
# Inspect any GGUF
./build/dflash_inspect --target model.gguf --jsondflash-robot uses an adapter pattern for model generality:
- ModelAdapter: Reads GGUF target metadata, builds target prefill/verify graphs, exposes hidden states, manages KV/SSM cache rollback
- DraftAdapter: Loads draft models, validates compatibility, builds draft graph, provides candidate tokens
- RuntimeOrchestrator: Target prefill, hidden-state extraction, draft proposal, target verification, acceptance, rollback, metrics
- CompatibilityRegistry: Classifies any GGUF/draft pair as supported/incompatible/missing
See docs/model-support.md for the full matrix.
Key distinction: "any GGUF" means runtime-generic target support. Acceleration still requires a compatible trained DFlash draft model.
- docs/building.md — Build instructions
- docs/model-support.md — Supported models and expansion roadmap
- docs/architecture/gap-analysis.md — z-lab vs Luce vs dflash-robot
- docs/research/paper-to-code-map.md — DFlash paper to code mapping
- docs/research/luce-hardcoded-assumptions.md — Qwen35-specific assumptions audit
- Batch size 1, single-user local inference
- Greedy decoding only (temperature/top_p accepted but ignored in server)
- Qwen3.6-27B path only (Qwen35 adapter); more adapters planned
- No draft training pipeline yet
- CUDA/NVIDIA only
Based on Luce DFlash (MIT License). Algorithm: DFlash: Block Diffusion for Flash Speculative Decoding. See UPSTREAM.md for full attribution.
MIT — see LICENSE.