ATOM (AiTer Optimized Model) is a lightweight vLLM-like implementation, focusing on integration and optimization based on AITER.
- ROCm Optimized: Built on AMD's ROCm platform with AITER kernels (ASM, CK, Triton)
- OpenAI-Compatible API: Drop-in server with
/v1/chat/completionsand/v1/completionsendpoints - Piecewise torch.compile: 4 compilation levels with CUDA graph capture for low-latency decode
- Multi-GPU Parallelism: Tensor parallelism (TP), data parallelism (DP), and expert parallelism (EP) with MORI all-to-all
- Quantization: FP8, MXFP4, INT8, INT4 with auto-detection from HuggingFace configs
- Speculative Decoding: Multi-Token Prediction (MTP) with EAGLE proposer
- Prefix Caching: xxhash64-based KV cache block sharing across sequences
| Model Family | HF Architecture | Dense/MoE | Notes |
|---|---|---|---|
| Llama | LlamaForCausalLM |
Dense | Llama 2, Llama 3, Llama 3.1 |
| Qwen3 | Qwen3ForCausalLM |
Dense | |
| Qwen3-MoE | Qwen3MoeForCausalLM |
MoE | 128 experts, top-8 routing |
| Qwen3-Next | Qwen3NextForCausalLM |
MoE | Hybrid full attention + Gated DeltaNet |
| DeepSeek V2/V3 | DeepseekV3ForCausalLM |
MoE | MLA attention, MTP speculative decoding |
| Mixtral | MixtralForCausalLM |
MoE | 8 experts, top-2 routing |
| GLM-4-MoE | Glm4MoeForCausalLM |
MoE | |
| GLM-5 | GlmMoeDsaForCausalLM |
MoE | MLA attention, similar to DeepSeek V3.2. See recipe |
| GPT-OSS | GptOssForCausalLM |
MoE | Sliding window + attention sinks |
| Kimi-K2 | via --trust-remote-code |
MoE | See recipe |
- AMD GPU with ROCm support
- Docker
Pre-built image with AITER + ATOM ready to use:
docker pull rocm/atom-dev:latest
docker run -it --network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME:/home/$USER \
-v /mnt:/mnt \
-v /data:/data \
--shm-size=16G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
rocm/atom-dev:latestdocker pull rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0
docker run -it --network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME:/home/$USER \
-v /mnt:/mnt \
-v /data:/data \
--shm-size=16G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0pip install amd-aiter
git clone https://github.com/ROCm/ATOM.git && pip install ./ATOMThe default optimization level is 3 (piecewise torch.compile with CUDA graphs).
python -m atom.examples.simple_inference --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8Note: First-time execution may take approximately 10 minutes for model compilation.
Start an OpenAI-compatible server:
# Single GPU
python -m atom.entrypoints.openai_server --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8
# Multi-GPU with tensor parallelism
python -m atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-R1 --kv_cache_dtype fp8 -tp 8
# With MTP speculative decoding
python -m atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-R1 --kv_cache_dtype fp8 -tp 8 \
--method mtp --num-speculative-tokens 3rocm.github.io/ATOM/benchmark-dashboard
The dashboard tracks nightly performance across models and configurations:
- Interactive vs Throughput β tok/s/user vs tok/s/gpu tradeoff across concurrency levels
- Throughput & Latency trends β Output throughput, TTFT, TPOT over time, grouped by model
- Regression detection β Automatic alerts when throughput drops >5% or latency increases >10%
- Profiler trace collection β On regression, automatically re-runs with PyTorch profiler and uploads traces
Models tracked: DeepSeek-R1-0528 (BF16 & MTP3), GLM-5-FP8, gpt-oss-120b
For more information, visit InferenceX.
Run an online throughput benchmark against a running server:
python -m atom.benchmarks.benchmark_serving \
--model=deepseek-ai/DeepSeek-R1 --backend=vllm --base-url=http://localhost:8000 \
--dataset-name=random \
--random-input-len=1024 --random-output-len=1024 \
--random-range-ratio=0.8 \
--num-prompts=1280 --max-concurrency=128 \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el"Launch the server with --torch-profiler-dir and --mark-trace:
python -m atom.entrypoints.openai_server \
--model deepseek-ai/DeepSeek-R1 --kv_cache_dtype fp8 -tp 8 \
--torch-profiler-dir ./trace --mark-traceCollect traces via benchmark --profile flag (auto start/stop):
python -m atom.benchmarks.benchmark_serving \
--model=deepseek-ai/DeepSeek-R1 --backend=vllm --base-url=http://localhost:8000 \
--dataset-name=random --random-input-len=1024 --random-output-len=1024 \
--num-prompts=128 --max-concurrency=128 \
--request-rate=inf --ignore-eos --profileOr control profiling manually on a running server:
curl -X POST http://127.0.0.1:8000/start_profile
# ... run your workload ...
curl -X POST http://127.0.0.1:8000/stop_profile# Kernel breakdown per layer β Excel
python tools/parse_trace.py ./trace/rank_0/DeepSeek-R1_ts_*.json.gz --layer 3
# Performance summary β Markdown report
python tools/analyze_trace_summary.py ./trace/rank_0/DeepSeek-R1_ts_*.json.gz| Output | Description |
|---|---|
prefill_breakdown.xlsx |
Per-kernel duration, call count, pct%, module grouping, cross-layer averages |
decode_breakdown.xlsx |
Same for decode phase, with CUDAGraph kernel mapping |
performance_summary.md |
Prefill/decode/draft step timing, iteration breakdown |
pip install lm-eval[api]
# Start server, then run evaluation
lm_eval --model local-completions \
--model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
--tasks gsm8k --num_fewshot 5Full documentation: rocm.github.io/ATOM/docs
| Topic | Description | Guide |
|---|---|---|
| Architecture | System overview, request lifecycle, component design | Architecture Guide |
| Configuration | Config classes, CLI arguments, environment variables | Configuration Guide |
| Model Support | Supported models, weight loading, adding new architectures | Model Support Guide |
| Model Operations | AITER kernel integration, linear/attention/MoE/norm wrappers | Model Ops Guide |
| Scheduling & KV Cache | Batch scheduling, block allocation, prefix caching | Scheduling Guide |
| Compilation | torch.compile levels, CUDA graphs, piecewise compilation | Compilation Guide |
| Distributed | Tensor/data/expert parallelism, multi-GPU deployment | Distributed Guide |
| Serving & Benchmarks | OpenAI API server, benchmarking, profiling, speculative decoding | Serving Guide |
| Environment Variables | All ATOM_* variable definitions |
Env Vars |
Deployment Recipes:
- DeepSeek-R1 β BF16/MXFP4 with MTP speculative decoding on 8 GPUs
- Qwen3-235B-A22B β TP8 + EP with FP8 KV cache
- Qwen3-Next β Hybrid GDN + MoE architecture
- Kimi-K2-Thinking β MXFP4 MoE on 4 GPUs
- GLM-5 β FP8 MoE with MLA on 8 GPUs
- GPT-OSS-120B β Single GPU or DP+EP on 2 GPUs
Framework Integration:
- vLLM Plugin Backend β ATOM as out-of-tree plugin for vLLM
- SGLang Model Backend β ATOM as model implementation backend for SGLang
This project was adapted from nano-vllm.
We welcome issues and contributions! Please use the GitHub Issues page to report bugs or request features: https://github.com/ROCm/ATOM/issues

