⚡ SPES: SParse Expert Synchronization

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Jinrui Zhang$^{1,2}$, Chaodong Xiao$^{1,2}$, Aoqi Wu$^{1,2}$, Xindong Zhang$^2$, Lei Zhang$^{1,2}$

¹Department of Computing, The Hong Kong Polytechnic University
²OPPO Research Institute

📖 Introduction

SPES (SParse Expert Sync) is a cutting-edge, memory-efficient decentralized training framework designed for pretraining MoE LLMs across geographically distributed GPU nodes.

Unlike conventional paradigms that demand high-bandwidth interconnects, SPES enables the collaborative pretraining of Mixture-of-Experts models where nodes operate semi-independently.

🌟 Key Features

Feature	Description
🌐 Decentralized Training	Operates without high-speed cross-node interconnects. Each node functions as an independent training unit with local DDP.
💾 Memory Efficiency	Nodes only maintain gradients/optimizer states for their local subset of experts, drastically reducing memory footprint.
⚡ Sparse Sync	Utilizes a lightweight gRPC parameter server to synchronize only trained parameters periodically.
🔀 Smart Merging	Implements intelligent weighted merging with a decaying alpha schedule to ensure stable convergence during knowledge transfer.

🚧 Roadmap & Status

Release Training Code
Release pretrained model checkpoints
Release training logs

🤗 Model Weights

Pretrained checkpoints are available in the SPES Hugging Face collection.

Model	Description	Weights
`SPES-2B`	2B model trained from scratch.	🤗 Hugging Face
`SPES-7B`	7B model trained from scratch.	🤗 Hugging Face
`SPES-9B`	9B model initialized from Qwen3-1.7B.	🤗 Hugging Face

🔧 Installation

Prerequisites

Python: >= 3.10
CUDA: >= 12.1 (Tested on 12.4)
PyTorch: 2.5.1
Hardware: NVIDIA GPUs (Tested on A100/A800/L40S)

Quick Install

# 1. Clone the repository
git clone https://github.com/zjr2000/SPES.git
cd SPES

# 2. Install PyTorch (Adjust CUDA version if necessary)
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124

# 3. Install SPES and core dependencies
pip install -e '.[all]'

# 4. Install gRPC components
pip install grpcio==1.73.1 grpcio-tools==1.73.1 protobuf==6.31.0

Evaluation Dependencies

To run benchmarks using the LM Evaluation Harness:

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
pip install "lm_eval[hf]"

📦 Data Preparation

SPES utilizes tokenized numpy memmap files (.npy) for high-performance data loading.

1. Tokenize Raw Data

Convert your .jsonl or .parquet files using the provided script:

python data_process_scripts/tokenize_data.py \
    --file_glob "/path/to/your/data/*.jsonl" \
    --tokenizer_name_or_path "Qwen/Qwen2.5-0.5B" \
    --output_prefix "/path/to/output/tokenized_" \
    --text_field "text" \
    --processes 8 \
    --batch_size 500 \
    --max_shard_bytes 4294967296 \
    --dtype "uint32"

2. Generate File List

Create a manifest file for the training configuration:

bash data_process_scripts/list_processed_files.sh /path/to/tokenized/data /path/to/output/file_list.txt

3. Update Config

Point your YAML configuration file (in configs/) to file_list.txt.

🚀 How to Run

SPES uses a Client-Server architecture:

Parameter Server: Manages expert synchronization.
Training Clients: Independent nodes performing local training.

⚙️ Configuration

Key SPES parameters in your YAML config:

using_spes: true
spes_config:
  num_peers: 4                  # Total training nodes
  peer_id: 0                    # Current node ID (0-indexed)
  num_train_experts_per_node: 2 # Local experts per node
  sync_steps: 100               # Sync frequency
  server_addr: 127.0.0.1:50051  # Parameter Server Address

Option A: Manual Launch (Step-by-Step)

1. Start Parameter Server

bash run_scripts/run_parameter_server.sh

2. Start Training Clients (On each node)

# Example: Launching on Node 1
bash run_scripts/run_single_node.sh 1

# Optional: Resume from checkpoint
bash run_scripts/run_single_node.sh 0 --resume

Option B: Cluster Launch (Automated)

For SLURM or other schedulers where RANK, MASTER_ADDR, and NPROC_PER_NODE are set automatically:

bash run_scripts/run_cluster.sh

This script automatically handles server startup on Rank 0 and isolates DDP to the local node.

📊 Evaluation

1. Convert Checkpoints

Convert the sharded FSDP checkpoints to HuggingFace format:

# Syntax: <RUN_DIR> <SAVE_STEP> <MODEL_SIZE>
bash eval_scripts/convet_model_to_hf_unshard.sh output/spes_moe_3b_9b/node0 10000 A3B-9B

2. Run Benchmarks

Evaluate using lm-evaluation-harness:

bash eval_scripts/eval_full.sh <MODEL_PATH> <MODEL_NAME>

📧 Contact

Feel free to open an issue or email to us if you have any questions!

Email: [jin-rui.zhang@connect.polyu.hk]

📝 Citation

If you find SPES useful in your research, please consider citing:

@article{zhang2026pretraining,
  title={Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm},
  author={Zhang, Jinrui and Xiao, Chaodong and Wu, Aoqi and Zhang, Xindong and Zhang, Lei},
  journal={arXiv preprint arXiv:2602.11543},
  year={2026}
}

🙏 Acknowledgements

This project stands on the shoulders of giants. We explicitly thank the following projects and teams:

OLMo (Allen Institute for AI): Our codebase is built upon the excellent modeling, training, and inference code provided by the Ai2 team.
MegaBlocks (Databricks): We utilize MegaBlocks for efficient "dropless" Mixture-of-Experts (MoE) training and sparse operations.
LM Evaluation Harness (EleutherAI): Used for our few-shot evaluation framework and benchmarking.

📄 License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ SPES: SParse Expert Synchronization

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

📖 Introduction

🌟 Key Features

🚧 Roadmap & Status

🤗 Model Weights

🔧 Installation

Prerequisites

Quick Install

Evaluation Dependencies

📦 Data Preparation

1. Tokenize Raw Data

2. Generate File List

3. Update Config

🚀 How to Run

⚙️ Configuration

Option A: Manual Launch (Step-by-Step)

Option B: Cluster Launch (Automated)

📊 Evaluation

1. Convert Checkpoints

2. Run Benchmarks

📧 Contact

📝 Citation

🙏 Acknowledgements

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
data_process_scripts		data_process_scripts
eval_scripts		eval_scripts
evaluation		evaluation
olmo_data		olmo_data
run_scripts		run_scripts
scripts		scripts
spes		spes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

⚡ SPES: SParse Expert Synchronization

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

📖 Introduction

🌟 Key Features

🚧 Roadmap & Status

🤗 Model Weights

🔧 Installation

Prerequisites

Quick Install

Evaluation Dependencies

📦 Data Preparation

1. Tokenize Raw Data

2. Generate File List

3. Update Config

🚀 How to Run

⚙️ Configuration

Option A: Manual Launch (Step-by-Step)

Option B: Cluster Launch (Automated)

📊 Evaluation

1. Convert Checkpoints

2. Run Benchmarks

📧 Contact

📝 Citation

🙏 Acknowledgements

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages