Skip to content

Add RCCL clustering playbook for two STX Halo systems#195

Draft
abdmalik-amd wants to merge 1 commit into
mainfrom
abdmalik/clustering-rccl
Draft

Add RCCL clustering playbook for two STX Halo systems#195
abdmalik-amd wants to merge 1 commit into
mainfrom
abdmalik/clustering-rccl

Conversation

@abdmalik-amd
Copy link
Copy Markdown
Collaborator

  • Adds a new playbook for clustering two STX Halo systems using RCCL with vLLM
  • Covers VRAM allocation (amd-ttm), vLLM installation with ROCm, Ray cluster setup, and serving Qwen3.5-397B-A17B-GPTQ-Int4 across two nodes via tensor parallelism

Note: This playbook is still being tested. Last validated with RCCL via rocm-systems and vLLM 0.17.0+rocm700 wheels, PR marked as draft accordingly

@abdmalik-amd abdmalik-amd marked this pull request as draft April 2, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant