Rewards as Labels: Revisiting RLVR from a Classification Perspective

🔥 Latest News

[2026.02] Our paper is now available on arXiv! Check it out at arXiv:2602.05630

📢 Overview

REAL (Rewards as Labels) is a novel framework for Reinforcement Learning with Verifiable Rewards (RLVR) that reformulates policy optimization as a classification problem. By treating verifiable rewards as categorical labels rather than scalar weights, REAL addresses fundamental gradient mismatches in existing GRPO-style methods and achieves superior training stability and performance on mathematical reasoning tasks.

Figure: Overview of our REAL framework

Key Highlights

🎯 Identifies Critical Issues: Reveals Gradient Misassignment in Positives and Gradient Domination in Negatives in GRPO-style methods
🔄 Novel Perspective: Reformulates RLVR as classification by treating rewards as categorical labels
📈 State-of-the-Art Performance:
- +6.7% over DAPO on 1.5B models
- +6.2% over DAPO and +1.7% over GSPO on 7B models
🛡️ Superior Stability: Maintains stable training without entropy collapse or explosion

Figure: Performance comparison on the DeepScaleR-Preview-Dataset using the DeepSeek-R1-Distill-Qwen-1.5B / 7B backbone

Figure: Training dynamics comparison - REAL maintains stable entropy and achieves consistent improvement

🔍 Method

REAL reformulates policy optimization by treating verifiable rewards ($r \in {0,1}$) as categorical labels, enabling a natural classification objective:

where $\bar{s}$ is the length-normalized relative log-probability and $\tau$ is the temperature parameter.

Why REAL Works

REAL induces monotonic and bounded gradient weighting with magnitude upper-bounded by $\frac{1}{\tau}$:

This effectively mitigates gradient issues in GRPO while ensuring stable optimization.

Figure: Gradient magnitude comparison - REAL provides monotonic and bounded gradients

🔧 Installation

git clone https://github.com/Red-RL/REAL.git
cd REAL
conda create -n real python=3.10
conda activate real
bash scripts/install.sh

🚀 Training

Before executing the training script, make sure to modify your model path and wandb API key according to your local environment and account configuration.

bash scripts/train/1.5b_real.sh

Modification Instructions

To switch to the DeepSeek-R1-Distill-Qwen-7B or other models, replace the value of the MODEL_PATH variable in the script with the actual storage path of the 7B model.

MODEL_PATH="path/to/DeepSeek-R1-Distill-Qwen-7B"

To use the DAPO-17k-Math dataset, replace the data.train_files path in the script from ./datasets/deepscaler/data/train.parquet to ./datasets/dapo17k/data/dapo-math-17k.parquet.

# Original dataset path
--data.train_files=./datasets/deepscaler/data/train.parquet \

# Modified path for DAPO-17k-Math dataset
--data.train_files=./datasets/dapo17k/data/dapo-math-17k.parquet \

🚀 Evaluation

This evaluation script supports datasets: aime, aime25, math, amc, minerva, olympiad_bench. Modify EXP_NAMES and STEPS in the script to match your setup—they must have the same number of elements with one-to-one correspondence, as these variables are used to locate model checkpoints.

bash scripts/eval/run_eval.sh

📖 Citation

If you find our work helpful for your research, please consider citing:

@article{zhai2026real,
  title={Rewards as Labels: Revisiting RLVR from a Classification Perspective},
  author={Zhai, Zepeng and Chen, Meilin and Zhao, Jiaxuan and Qian, Junlang and Shen, Lei and Lu, Yuan},
  journal={arXiv preprint arXiv:2602.05630},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
datasets		datasets
deepscaler		deepscaler
scripts		scripts
verl		verl
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rewards as Labels: Revisiting RLVR from a Classification Perspective

🔥 Latest News

📢 Overview

Key Highlights

🔍 Method

Why REAL Works

🔧 Installation

🚀 Training

🚀 Evaluation

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rewards as Labels: Revisiting RLVR from a Classification Perspective

🔥 Latest News

📢 Overview

Key Highlights

🔍 Method

Why REAL Works

🔧 Installation

🚀 Training

🚀 Evaluation

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages