Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm
Jinrui Zhang
1Department of Computing, The Hong Kong Polytechnic University
2OPPO Research Institute
π§ jin-rui.zhang@connect.polyu.hk
SPES (SParse Expert Sync) is a cutting-edge, memory-efficient decentralized training framework designed for pretraining MoE LLMs across geographically distributed GPU nodes.
Unlike conventional paradigms that demand high-bandwidth interconnects, SPES enables the collaborative pretraining of Mixture-of-Experts models where nodes operate semi-independently.
| Feature | Description |
|---|---|
| π Decentralized Training | Operates without high-speed cross-node interconnects. Each node functions as an independent training unit with local DDP. |
| πΎ Memory Efficiency | Nodes only maintain gradients/optimizer states for their local subset of experts, drastically reducing memory footprint. |
| β‘ Sparse Sync | Utilizes a lightweight gRPC parameter server to synchronize only trained parameters periodically. |
| π Smart Merging | Implements intelligent weighted merging with a decaying alpha schedule to ensure stable convergence during knowledge transfer. |
- Release Training Code
- Release pretrained model checkpoints
- Release training logs
Pretrained checkpoints are available in the SPES Hugging Face collection.
| Model | Description | Weights |
|---|---|---|
SPES-2B |
2B model trained from scratch. | π€ Hugging Face |
SPES-7B |
7B model trained from scratch. | π€ Hugging Face |
SPES-9B |
9B model initialized from Qwen3-1.7B. | π€ Hugging Face |
- Python:
>= 3.10 - CUDA:
>= 12.1(Tested on 12.4) - PyTorch:
2.5.1 - Hardware: NVIDIA GPUs (Tested on A100/A800/L40S)
# 1. Clone the repository
git clone https://github.com/zjr2000/SPES.git
cd SPES
# 2. Install PyTorch (Adjust CUDA version if necessary)
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
# 3. Install SPES and core dependencies
pip install -e '.[all]'
# 4. Install gRPC components
pip install grpcio==1.73.1 grpcio-tools==1.73.1 protobuf==6.31.0To run benchmarks using the LM Evaluation Harness:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
pip install "lm_eval[hf]"SPES utilizes tokenized numpy memmap files (.npy) for high-performance data loading.
Convert your .jsonl or .parquet files using the provided script:
python data_process_scripts/tokenize_data.py \
--file_glob "/path/to/your/data/*.jsonl" \
--tokenizer_name_or_path "Qwen/Qwen2.5-0.5B" \
--output_prefix "/path/to/output/tokenized_" \
--text_field "text" \
--processes 8 \
--batch_size 500 \
--max_shard_bytes 4294967296 \
--dtype "uint32"Create a manifest file for the training configuration:
bash data_process_scripts/list_processed_files.sh /path/to/tokenized/data /path/to/output/file_list.txtPoint your YAML configuration file (in configs/) to file_list.txt.
SPES uses a Client-Server architecture:
- Parameter Server: Manages expert synchronization.
- Training Clients: Independent nodes performing local training.
Key SPES parameters in your YAML config:
using_spes: true
spes_config:
num_peers: 4 # Total training nodes
peer_id: 0 # Current node ID (0-indexed)
num_train_experts_per_node: 2 # Local experts per node
sync_steps: 100 # Sync frequency
server_addr: 127.0.0.1:50051 # Parameter Server Address1. Start Parameter Server
bash run_scripts/run_parameter_server.sh2. Start Training Clients (On each node)
# Example: Launching on Node 1
bash run_scripts/run_single_node.sh 1
# Optional: Resume from checkpoint
bash run_scripts/run_single_node.sh 0 --resumeFor SLURM or other schedulers where RANK, MASTER_ADDR, and NPROC_PER_NODE are set automatically:
bash run_scripts/run_cluster.shThis script automatically handles server startup on Rank 0 and isolates DDP to the local node.
Convert the sharded FSDP checkpoints to HuggingFace format:
# Syntax: <RUN_DIR> <SAVE_STEP> <MODEL_SIZE>
bash eval_scripts/convet_model_to_hf_unshard.sh output/spes_moe_3b_9b/node0 10000 A3B-9BEvaluate using lm-evaluation-harness:
bash eval_scripts/eval_full.sh <MODEL_PATH> <MODEL_NAME>Feel free to open an issue or email to us if you have any questions!
Email: [jin-rui.zhang@connect.polyu.hk]
If you find SPES useful in your research, please consider citing:
@article{zhang2026pretraining,
title={Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm},
author={Zhang, Jinrui and Xiao, Chaodong and Wu, Aoqi and Zhang, Xindong and Zhang, Lei},
journal={arXiv preprint arXiv:2602.11543},
year={2026}
}This project stands on the shoulders of giants. We explicitly thank the following projects and teams:
- OLMo (Allen Institute for AI): Our codebase is built upon the excellent modeling, training, and inference code provided by the Ai2 team.
- MegaBlocks (Databricks): We utilize MegaBlocks for efficient "dropless" Mixture-of-Experts (MoE) training and sparse operations.
- LM Evaluation Harness (EleutherAI): Used for our few-shot evaluation framework and benchmarking.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.