Skip to content

GAIR-NLP/polymath-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Polymath Learner

Polymath Learning

One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

Comprehensive learning targets at learning on a comprehensive set of samples to achieve better imitation.

Polymath learning targets at maximizing data efficiency in unlocking reasoning ability of LLMs by learning intensively using one high-quality sample to lift the broader reasoning ability.

highlights Highlights

  • Data Efficiency: Achieves superior performance compared to training with datasets of thousands of samples by focusing on sample quality and design rather than quantity.
  • Cross-Domain Generalization: A single mathematical reasoning sample can trigger broad improvements in domains far beyond its original subject.
  • Sample Engineering: The deliberate selection and synthesis of training samples to unlock model capabilities efficiently.

sample Polymath Samples

  • The natural polymath samples are selected from the training set of MATH.
  • The synthetic polymath samples are generated by directly instructing strong LLMs.

training Training

Environment Setup

git clone https://github.com/GAIR-NLP/polymath-learning.git
cd polymath-learning

conda create -n polymath
conda active polymath
pip install -r requirements.txt

Configuration

Setup the configuration in train/config/exp.conf, specially

  • DATASET_NAME: the dataset to conduct training (should match the folder in train/data or add customized ones).
  • WANDB_DIR: the path to save the wandb result (create it if necessary).
  • POLICY_MODEL: name of the model.
  • POLICY_MODEL_PATH: path to the model checkpoint.

Experiments

./code/train/train.sh ./code/train/config/<config>.conf

where <config>.conf specifies the name of the configuration file you create.

evaluation Evaluation

python code/eval/eval.py --result_path=<path to the result json file>

customize --open_ended_sources and --mcp_sources to add or remove sources with open-ended answers or multiple-choice answers.

leaderboard Results

Polymath learning with either natural samples or the synthetic sample demonstrate stronger multidiscipline reasoning ability than training with thousands of samples (comprehensive learning) in Qwen2.5-7b-base. Performance Breakdown

Comparison between polymath learning (Synthetic Pirme, Prealgebra), other one shot sample ($\pi_{1}$) from DeepScaleR, and comprehensive learning (MATH, LIMR). Training Dynamics

citation Citation

@misc{li2026sampleruleallextreme,
      title={One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling}, 
      author={Yiyuan Li and Zhen Huang and Yanan Wu and Weixun Wang and Xuefeng Li and Yijia Luo and Wenbo Su and Bo Zheng and Pengfei Liu},
      year={2026},
      eprint={2601.03111},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.03111}, 
}

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •