Running TorchTitan on Crusoe Cloud's Terraform SLURM Solution or Crusoe Managed Slurm

Forked from the TorchTitan repo (https://github.com/pytorch/torchtitan).
Default branch is release-v0.2.2 which is the latest release that still supports the .toml config file method (easiest for our use case).

Torchtitan is a PyTorch native platform designed for rapid experimentation and large-scale training of generative AI models. As a minimal clean-room implementation of PyTorch native scaling techniques, torchtitan provides a flexible foundation for developers to build upon. With torchtitan extension points, one can easily create custom extensions tailored to specific needs.

We will do multinode fine tuning of Llama3_8b and Llama3_70b using the C4 dataset. The results section at the bottom of the page will be extended to cover multiple GPUs and cluster and batch sizes. The dataset can be downloaded during training iteself, or pre-downloaded to local filesystem/NFS, or saved in Crusoe Object St

Prerequites

We expect the following to be set up as part of your environment:

Terraform Slurm solution on Crusoe: https://github.com/crusoecloud/slurm, or:
Crusoe Managed Slurm (on Kubernetes): https://docs.crusoecloud.com/orchestration/slurm/overview

Other pre-requisites include:

We expect to have a Slurm cluster set up with compute nodes using any of the GPU VMs provided by Crusoe (but as shown in the results section, most of our benchmarking is done on the following: H100, H200, B200, and GB200)
Shared home and/or data directories mounted to all the nodes, where you can clone the repo and install dependencies
If desired, Crusoe object storage
HuggingFace account with access to Llama 3.1 8B model

Setup

Please follow the setup for the specific GPU compute node types as shown below:

Clone the torchtitan repository

git clone https://github.com/crusoecloud/torchtitan

Create a Python virtual environment, activate, and install dependencies. For GB200, create a Python virtual environment on one of the compute nodes (and not the login or head node)

# install uv (uv is pre-installed on Crusoe Managed Slurm)
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
uv pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128 --force-reinstall # or cu130 for CUDA 13
uv pip install --pre torchtitan --index-url https://download.pytorch.org/whl/nightly/cu128 # or cu130 for CUDA 13

Download Llama 3.1 8B and 70B tokenizers from Huggingface

#Download tokenizer from HuggingFace
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=<your huggingface token>
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-70B --assets tokenizer --hf_token=<your huggingface token>

Running the job

Edit your chosen slurm fine (the 8B or 70B version) to have the

** For quicker performance on repeated training runs: download C4 data set to cluster's /data volume **

cd /data
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
GIT_LFS_SKIP_SMUDGE=1
git clone https://huggingface.co/datasets/allenai/c4
cd c4
#this next step takes about an hour
git lfs pull --include "en/*"

Update the .toml training config files with * dataset_path = "/data/c4/" * in the [training] section

** For testing with Crusoe Object Storage ** Follow the steps above to create a local download of the C4 dataset tar.gz files, then copy those files into your Crusoe object storage bucket. Update dataset_path in the .toml training config files to be the bucket containing the files e.g s3://my-c4-bucket/. Also, update torchtitan/hf_datasets/test_datasets.py to change the S3 endpoint URL to match the correct object storage URL of the Crusoe Cloud region where you are testing (set to us-east1-a by default). S3 credentials and config are read from the standard location ~/.aws/credentials and ~/.aws/config.

Run the job from the slurm login node: sbatch multinode_trainer.sbatch. Run watch squeue to see that the job is running - if it doesn’t go to R status it could be that you didn’t have sufficient nodes in idle state, or that you request resources that no node has (e.g too many GPU or CPU per node)

When the job is running, its logs will be written to slurm-x.out where x is the ID of the slurm job. The outputs below show the performance to be expected on a cluster where all the nodes are in the same InfiniBand Network or IMEX partition (for GB200).

Torchtitan performance results for various cluster configs

GPU Type	NVIDIA Driver/CUDA	Compute nodes in training job	Batch size	Performance indicators at step 100 of Torchtitan test described on this page
GB200	580/CUDA 13.0	17 (68 GPU)	5	step: 100 loss: 6.4277 grad_norm: 2.9062 memory: 158.61GiB(86.20%) tps: 17,581 tflops: 1,018.20 mfu: 45.25%
GB200	580/CUDA 13.0	9 (36 GPU)	5	step: 100 loss: 8.2954 grad_norm: 29.3788 memory: 159.31GiB(86.58%) tps: 17,975 tflops: 1,040.99 mfu: 46.27%
GB200	580/CUDA 13.0	3 (12 GPU)	5	step: 100 loss: 6.5572 grad_norm: 3.6680 memory: 161.65GiB(87.85%) tps: 18,284 tflops: 1,058.93 mfu: 47.06%
B200	580/CUDA 13.0	2 (16 GPU)	5	step: 100 loss: 6.1751 grad_norm: 1.4532 memory: 1cl61.51GiB(90.56%) tps: 17,500 tflops: 1,013.53 mfu: 45.05%
B200	580/CUDA 13.0	1 (8 GPU)	5	step: 100 loss: 6.3314 grad_norm: 4.0237 memory: 165.16GiB(92.60%) tps: 17,469 tflops: 1,011.68 mfu: 44.96%
H200	570/CUDA 12.8	16 (128 GPU)	4	step: 100 loss: 7.2984 grad_norm: 5.6297 memory: 128.17GiB(91.66%) tps: 9,044 tflops: 523.76 mfu: 52.96%
H200	570/CUDA 12.8	8 (64 GPU)	4	step: 100 loss: 6.14398 grad_norm: 1.8245 memory: 128.79GiB(92.11%) tps: 8,844 tflops: 512.21 mfu: 51.79%
H200	570/CUDA 12.8	4 (32 GPU)	4	step: 100 loss: 6.15755 grad_norm: 1.5348 memory: 129.80GiB(92.84%) tps: 8,895 tflops: 515.14 mfu: 52.09%
H100	570/CUDA 12.8	16 (128 GPU)	2	step: 100 loss: 6.3626 grad_norm: 5.2285 memory: 66.38GiB(83.83%) tps: 7,169 tflops: 415.21 mfu: 41.98%
H100	570/CUDA 12.8	4 (32 GPU)	2	step: 100 loss: 6.35632 grad_norm: 4.4844 memory: 67.96GiB(85.82%) tps: 7,582 tflops: 439.14 mfu: 44.40%
H100	580/CUDA 13.0	2 (16 GPU)	2	step: 100 loss: 6.47847 grad_norm: 2.8199 memory: 69.85GiB(88.21%) tps: 8,714 tflops: 504.69 mfu: 51.03%

When stress testing nodes and looking for issues, we are looking for consistent results across all GPUS plus high memory utilization and MFU

Small-scale GB200 vs B200 performance comparison for fine-tuning Llama3.1-70B with TorchTitan

Cluster setups: 2 x B200.8x node vs 4 x GB200.4x nodes; C4 training dataset stored on the same shared volume.
Wall clock time for 1000 steps:
B200: 40 mins 58 sec
GB200: 48 minutes 42 sec

Name		Name	Last commit message	Last commit date
Latest commit History 1,150 Commits
.ci/docker		.ci/docker
.github		.github
assets		assets
benchmarks		benchmarks
docs		docs
scripts		scripts
tests		tests
torchtitan		torchtitan
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
llama3_70B.toml		llama3_70B.toml
llama3_8B.toml		llama3_8B.toml
multinode_trainer-llama3-70B.slurm		multinode_trainer-llama3-70B.slurm
multinode_trainer-llama3-8B.slurm		multinode_trainer-llama3-8B.slurm
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run_train.sh		run_train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running TorchTitan on Crusoe Cloud's Terraform SLURM Solution or Crusoe Managed Slurm

Prerequites

Setup

Running the job

Torchtitan performance results for various cluster configs

When stress testing nodes and looking for issues, we are looking for consistent results across all GPUS plus high memory utilization and MFU

Small-scale GB200 vs B200 performance comparison for fine-tuning Llama3.1-70B with TorchTitan

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Running TorchTitan on Crusoe Cloud's Terraform SLURM Solution or Crusoe Managed Slurm

Prerequites

Setup

Running the job

Torchtitan performance results for various cluster configs

When stress testing nodes and looking for issues, we are looking for consistent results across all GPUS plus high memory utilization and MFU

Small-scale GB200 vs B200 performance comparison for fine-tuning Llama3.1-70B with TorchTitan

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages