GitHub - GAIR-NLP/polymath-learning

Polymath Learning

One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

Comprehensive learning targets at learning on a comprehensive set of samples to achieve better imitation.

Polymath learning targets at maximizing data efficiency in unlocking reasoning ability of LLMs by learning intensively using one high-quality sample to lift the broader reasoning ability.

Highlights

Data Efficiency: Achieves superior performance compared to training with datasets of thousands of samples by focusing on sample quality and design rather than quantity.
Cross-Domain Generalization: A single mathematical reasoning sample can trigger broad improvements in domains far beyond its original subject.
Sample Engineering: The deliberate selection and synthesis of training samples to unlock model capabilities efficiently.

Polymath Samples

The natural polymath samples are selected from the training set of MATH.
The synthetic polymath samples are generated by directly instructing strong LLMs.

Training

Environment Setup

git clone https://github.com/GAIR-NLP/polymath-learning.git
cd polymath-learning

conda create -n polymath
conda active polymath
pip install -r requirements.txt

Configuration

Setup the configuration in train/config/exp.conf, specially

DATASET_NAME: the dataset to conduct training (should match the folder in train/data or add customized ones).
WANDB_DIR: the path to save the wandb result (create it if necessary).
POLICY_MODEL: name of the model.
POLICY_MODEL_PATH: path to the model checkpoint.

Experiments

./code/train/train.sh ./code/train/config/<config>.conf

where <config>.conf specifies the name of the configuration file you create.

Evaluation

python code/eval/eval.py --result_path=<path to the result json file>

customize --open_ended_sources and --mcp_sources to add or remove sources with open-ended answers or multiple-choice answers.

Results

Polymath learning with either natural samples or the synthetic sample demonstrate stronger multidiscipline reasoning ability than training with thousands of samples (comprehensive learning) in Qwen2.5-7b-base.

Comparison between polymath learning (Synthetic Pirme, Prealgebra), other one shot sample ($\pi_{1}$) from DeepScaleR, and comprehensive learning (MATH, LIMR).

Citation

@misc{li2026sampleruleallextreme,
      title={One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling}, 
      author={Yiyuan Li and Zhen Huang and Yanan Wu and Weixun Wang and Xuefeng Li and Yijia Luo and Wenbo Su and Bo Zheng and Pengfei Liu},
      year={2026},
      eprint={2601.03111},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.03111}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
code		code
data		data
image		image
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Polymath Learning

One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

Highlights

Polymath Samples

Training

Environment Setup

Configuration

Experiments

Evaluation

Results

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

GAIR-NLP/polymath-learning

Folders and files

Latest commit

History

Repository files navigation

Polymath Learning

One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

Highlights

Polymath Samples

Training

Environment Setup

Configuration

Experiments

Evaluation

Results

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages