Distributed LLM training: code samples

The code samples on how to distribute the LLM training between GPUs/nodes. The code samples are written from the first principle.

Files

train_ffns.py: distributed training of Transformer's FFN sublocks (currently implemented: DDP, FSDP and TP).

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
README.md		README.md
setup.sh		setup.sh
test_mp_barrier_gpus.py		test_mp_barrier_gpus.py
test_nccl.py		test_nccl.py
test_torch_cuda_stream.py		test_torch_cuda_stream.py
test_torch_distributed.py		test_torch_distributed.py
train_ffns.py		train_ffns.py