- [2026.02] Our paper is now available on arXiv! Check it out at arXiv:2602.05630
REAL (Rewards as Labels) is a novel framework for Reinforcement Learning with Verifiable Rewards (RLVR) that reformulates policy optimization as a classification problem. By treating verifiable rewards as categorical labels rather than scalar weights, REAL addresses fundamental gradient mismatches in existing GRPO-style methods and achieves superior training stability and performance on mathematical reasoning tasks.
Figure: Overview of our REAL framework
- 🎯 Identifies Critical Issues: Reveals Gradient Misassignment in Positives and Gradient Domination in Negatives in GRPO-style methods
- 🔄 Novel Perspective: Reformulates RLVR as classification by treating rewards as categorical labels
- 📈 State-of-the-Art Performance:
- +6.7% over DAPO on 1.5B models
- +6.2% over DAPO and +1.7% over GSPO on 7B models
- 🛡️ Superior Stability: Maintains stable training without entropy collapse or explosion
Figure: Performance comparison on the DeepScaleR-Preview-Dataset using the DeepSeek-R1-Distill-Qwen-1.5B / 7B backbone
Figure: Training dynamics comparison - REAL maintains stable entropy and achieves consistent improvement
REAL reformulates policy optimization by treating verifiable rewards (
where
REAL induces monotonic and bounded gradient weighting with magnitude upper-bounded by
This effectively mitigates gradient issues in GRPO while ensuring stable optimization.
Figure: Gradient magnitude comparison - REAL provides monotonic and bounded gradients
git clone https://github.com/Red-RL/REAL.git
cd REAL
conda create -n real python=3.10
conda activate real
bash scripts/install.shBefore executing the training script, make sure to modify your model path and wandb API key according to your local environment and account configuration.
bash scripts/train/1.5b_real.shModification Instructions
- To switch to the DeepSeek-R1-Distill-Qwen-7B or other models, replace the value of the MODEL_PATH variable in the script with the actual storage path of the 7B model.
MODEL_PATH="path/to/DeepSeek-R1-Distill-Qwen-7B"- To use the DAPO-17k-Math dataset, replace the data.train_files path in the script from ./datasets/deepscaler/data/train.parquet to ./datasets/dapo17k/data/dapo-math-17k.parquet.
# Original dataset path
--data.train_files=./datasets/deepscaler/data/train.parquet \
# Modified path for DAPO-17k-Math dataset
--data.train_files=./datasets/dapo17k/data/dapo-math-17k.parquet \This evaluation script supports datasets: aime, aime25, math, amc, minerva, olympiad_bench. Modify EXP_NAMES and STEPS in the script to match your setup—they must have the same number of elements with one-to-one correspondence, as these variables are used to locate model checkpoints.
bash scripts/eval/run_eval.shIf you find our work helpful for your research, please consider citing:
@article{zhai2026real,
title={Rewards as Labels: Revisiting RLVR from a Classification Perspective},
author={Zhai, Zepeng and Chen, Meilin and Zhao, Jiaxuan and Qian, Junlang and Shen, Lei and Lu, Yuan},
journal={arXiv preprint arXiv:2602.05630},
year={2026}
}
