Comprehensive learning targets at learning on a comprehensive set of samples to achieve better imitation.
Polymath learning targets at maximizing data efficiency in unlocking reasoning ability of LLMs by learning intensively using one high-quality sample to lift the broader reasoning ability.
- Data Efficiency: Achieves superior performance compared to training with datasets of thousands of samples by focusing on sample quality and design rather than quantity.
- Cross-Domain Generalization: A single mathematical reasoning sample can trigger broad improvements in domains far beyond its original subject.
- Sample Engineering: The deliberate selection and synthesis of training samples to unlock model capabilities efficiently.
- The natural polymath samples are selected from the training set of MATH.
- The synthetic polymath samples are generated by directly instructing strong LLMs.
git clone https://github.com/GAIR-NLP/polymath-learning.git
cd polymath-learning
conda create -n polymath
conda active polymath
pip install -r requirements.txtSetup the configuration in train/config/exp.conf, specially
DATASET_NAME: the dataset to conduct training (should match the folder intrain/dataor add customized ones).WANDB_DIR: the path to save the wandb result (create it if necessary).POLICY_MODEL: name of the model.POLICY_MODEL_PATH: path to the model checkpoint.
./code/train/train.sh ./code/train/config/<config>.confwhere <config>.conf specifies the name of the configuration file you create.
python code/eval/eval.py --result_path=<path to the result json file>customize --open_ended_sources and --mcp_sources to add or remove sources with open-ended answers or multiple-choice answers.
Polymath learning with either natural samples or the synthetic sample demonstrate stronger multidiscipline reasoning ability than training with thousands of samples (comprehensive learning) in Qwen2.5-7b-base.

Comparison between polymath learning (Synthetic Pirme, Prealgebra), other one shot sample (
@misc{li2026sampleruleallextreme,
title={One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling},
author={Yiyuan Li and Zhen Huang and Yanan Wu and Weixun Wang and Xuefeng Li and Yijia Luo and Wenbo Su and Bo Zheng and Pengfei Liu},
year={2026},
eprint={2601.03111},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.03111},
}
