Skip to content

am423/dflash-robot

Repository files navigation

dflash-robot

GGUF-native DFlash speculative decoding runtime for local models.

DFlash uses a lightweight block diffusion draft model to propose multiple tokens in parallel, verified by the target model in one forward pass. This is faster than autoregressive speculative decoding because draft latency does not scale with block size.

dflash-robot is based on Luce DFlash, the first GGUF port of DFlash, with adapter-based architecture for expanding to any GGUF model with a compatible DFlash draft.

Current Status

v0.2.0 — Qwen3.6-27B + Qwen3.6-35B-A3B on RTX 3090:

Model AR tok/s DFlash tok/s Speedup Status
Qwen3.6-27B 37.73 90.90 2.41x v0.1.0, cross-gen draft
Qwen3.6-35B-A3B 106.11 28.29 0.27x v0.2.0, matched draft, MoE overhead

Qwen3.6-35B-A3B DFlash produces correct output but MoE verification overhead dominates on RTX 3090 (31.5ms verify per step vs 3.9ms draft compute). Future optimization: reduce MoE expert count per verify step or use draft quantization to lower VRAM pressure.

Requirements

  • NVIDIA GPU: sm_75+ (RTX 2080 Ti, RTX 3090, RTX 4090, A10, A40, H100)
  • CUDA 12+
  • 22+ GB VRAM for Qwen3.6-27B
  • ~80 GB disk

Quick Start

git clone --recurse-submodules https://github.com/am423/dflash-robot.git
cd dflash-robot

# Build
CUDAToolkit_ROOT=/path/to/cuda \
CUDACXX=/path/to/nvcc \
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build -j$(nproc)

# Download models
huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
huggingface-cli download z-lab/Qwen3.5-27B-DFlash model.safetensors --local-dir models/draft/

# Python env
python3 -m venv .venv
.venv/bin/pip install transformers datasets huggingface_hub safetensors sentencepiece jinja2

# Run
DFLASH_TARGET=models/Qwen3.6-27B-Q4_K_M.gguf \
DFLASH_DRAFT=models/draft/model.safetensors \
.venv/bin/python scripts/run.py --prompt "def fibonacci(n):" --n-gen 128

# Inspect any GGUF
./build/dflash_inspect --target model.gguf --json

Architecture

dflash-robot uses an adapter pattern for model generality:

  • ModelAdapter: Reads GGUF target metadata, builds target prefill/verify graphs, exposes hidden states, manages KV/SSM cache rollback
  • DraftAdapter: Loads draft models, validates compatibility, builds draft graph, provides candidate tokens
  • RuntimeOrchestrator: Target prefill, hidden-state extraction, draft proposal, target verification, acceptance, rollback, metrics
  • CompatibilityRegistry: Classifies any GGUF/draft pair as supported/incompatible/missing

Model Support

See docs/model-support.md for the full matrix.

Key distinction: "any GGUF" means runtime-generic target support. Acceleration still requires a compatible trained DFlash draft model.

Documentation

Limitations

  • Batch size 1, single-user local inference
  • Greedy decoding only (temperature/top_p accepted but ignored in server)
  • Qwen3.6-27B path only (Qwen35 adapter); more adapters planned
  • No draft training pipeline yet
  • CUDA/NVIDIA only

Upstream

Based on Luce DFlash (MIT License). Algorithm: DFlash: Block Diffusion for Flash Speculative Decoding. See UPSTREAM.md for full attribution.

License

MIT — see LICENSE.

About

GGUF-native DFlash speculative decoding runtime for local models

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors