Add AMD ROCm/HIP Support and Strix Halo Optimizations by AyaSakura-comp · Pull Request #118 · antirez/ds4

AyaSakura-comp · 2026-05-13T16:57:51Z

Pull Request Description: Add AMD ROCm/HIP Support and Strix Halo Optimizations

Overview

This PR introduces a complete AMD ROCm/HIP backend to DwarfStar 4, optimized specifically for hardware with unified memory architectures like the AMD Strix Halo (gfx1151). It migrates the project from its original CUDA dependency to a portable HIP implementation while maintaining functional parity and performance.

Key Changes

ROCm/HIP Backend Migration:
- Ported ds4_cuda.cu to ds4_hip.cpp and transitioned all symbol dependencies from CUDA/cuBLAS to HIP/hipBLAS.
- Updated the Makefile to detect and support the ROCm stack using hipcc.
Strix Halo / HSA Optimizations:
- Zero-Copy Memory Access: Configured the engine to use HSA direct access (Zero-Copy) by default on AMD hardware. This avoids duplicating 83+ GiB of model weights in system RAM, significantly reducing memory overhead.
- Vectorized Kernels: Optimized F16 and F32 GEMV kernels using vectorized loads and warp-shuffle reductions for improved decoding throughput.
- Hardware Intrinsics: Replaced scalar loops with AMD-specific hardware dot-product intrinsics (v_dot4_i32_i8).
Unified Tooling:
- Added build.sh: A one-click script for ROCm compilation.
- Added rocm_start_server.sh: A unified script that handles stale process cleanup, system cache flushing, and optimized server launch.
Verification:
- Successfully validated with the rocm-regression long-context smoke test.
- End-to-end testing performed using DeepSeek-V4-Flash Q2-imatrix weights.

Performance Benchmarks (AMD Strix Halo / Radeon Graphics)

Decoding Speed: 8.09 – 13.24 tokens/sec (Non-MTP, Zero-Copy mode).
Prefill Latency: ~4.45s for short prompts (post-warmup).
Startup: ~16s weight warmup for 83.60 GiB mapping.

How to Test

Build: ./build.sh
Start: ./rocm_start_server.sh

Verify:

curl -X POST http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "ds4flash",
        "messages": [{"role": "user", "content": "Hello, how are you?"}],
        "max_tokens": 50
    }'

Summary of Work Done

Full Backend Port: Replaced all CUDA/cuBLAS APIs with HIP/hipBLAS equivalents.
Environmental Cleanup: Renamed all CUDA-specific environment variables to DS4_HIP_* (e.g., DS4_HIP_PREFILL_CHUNK).
Driver Compatibility: Added robust hipHostRegister fallbacks for diverse ROCm driver environments.
Unified Startup Flow: Fused cleanup and server launch into a single, reliable maintenance script.
Documentation Integrity: Updated README.md with dedicated ROCm onboarding instructions.

…ADME

chihmin added 9 commits May 13, 2026 23:24

Migrate Linux GPU backend to ROCm/HIP for Strix Halo (gfx1151)

65836ad

Refine ROCm/CUDA build separation and clean up HIP backend

b01ab2a

Remove binary executable from repository

88a7e12

Remove binary data file

dab77b2

Port CUDA to ROCm/HIP, add build and unified start scripts, update RE…

5ed137a

…ADME

Remove legacy patches directory

349de47

Remove accidental log file 'udo sync'

c73fbeb

Remove perf.data

148d687

Add PR description markdown file

963cb87

colinyoyo26 approved these changes May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMD ROCm/HIP Support and Strix Halo Optimizations#118

Add AMD ROCm/HIP Support and Strix Halo Optimizations#118
AyaSakura-comp wants to merge 9 commits into
antirez:rocmfrom
AyaSakura-comp:main

AyaSakura-comp commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AyaSakura-comp commented May 13, 2026

Pull Request Description: Add AMD ROCm/HIP Support and Strix Halo Optimizations

Overview

Key Changes

Performance Benchmarks (AMD Strix Halo / Radeon Graphics)

How to Test

Summary of Work Done

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants