Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Say Goodbye to Scalar Reward Model, Exploring the Aha Moment in General RL

Introduction

OpenRS (Open Rubric System) is an LLM-as-a-Judge evaluation framework that replaces traditional Reward Models with adaptive, fine-grained rubric-based evaluation. The core idea is to use large language models to evaluate response quality through adaptive, query-type-specific rubrics — enabling multi-dimensional scoring with interpretable verdicts.

The framework supports three evaluation paradigms:

Paradigm	Scenario	Description
Pairwise	Chat, Code, Safety, etc.	A/B bi-directional comparison with weighted multi-criteria scoring
Verifiable	Math, Factuality	Ground-truth verification first, fallback to Pairwise if inconclusive
Precise IF	Instruction Following	Hard constraint checking, fallback to Pairwise on ties

Figure 1: OpenRS Evaluation Pipeline — From pairwise responses, through verifiable and adaptive rubric generation, to multi-criteria scoring.

Key Features

🎯 Open Rubric: 50+ query-type-specific rubrics with weighted criteria (critical / core / important / highlight)
⚖️ Bi-directional Debiasing: Swaps A/B order to eliminate position bias
🔍 Critical Flaw Veto: Fatal errors override all other scoring dimensions
📊 4 Benchmarks: JudgeBench, PPE, RewardBench V2, RMBench

Main Results

We evaluate five judge models across four benchmarks:

Table 1: Accuracy (%) of different judge models across four benchmarks.

Installation

git clone https://github.com/WyxBUPT-22/OpenRS.git
cd OpenRS
pip install -r requirements.txt

Dependencies: openai, tenacity, json5, json-repair, tqdm

Quick Start

1. Configure API

OpenRS is compatible with any OpenAI-compatible inference backend (vLLM, SGLang, Ollama, etc.):

export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="your-api-key"
export OPENAI_MODEL_NAME="your-model-name"

2. Run Evaluation

JudgeBench / PPE

python judgebench_and_ppe.py \
    --input data/judgebench/gpt.jsonl \
    --output-dir results/judgebench \
    --annotation judgebench_gpt \
    --workers 50

RewardBench V2

python rewardbench_v2.py \
    --input data/rewardbench_v2/rewardbench_v2.jsonl \
    --output-dir results/rewardbench_v2 \
    --annotation rbv2 \
    --workers 10

RMBench

python rmbench.py \
    --input data/rmbench/rmbench.jsonl \
    --output results/rmbench_results.jsonl \
    --workers 10

Common Arguments

Argument	Description	Default
`--input`	Input data path	required
`--output-dir`	Output directory	`./results`
`--workers`	Concurrent threads	10–50
`--temperature`	Generation temperature	0.0
`--limit`	Max items to process (0=all)	0
`--no-resume`	Disable checkpoint resume	False
`--stats-only`	Report stats without running	False

Evaluation Pipeline

Input Data → Evaluation Router → Model Call → Score Parsing → Result Aggregation → Report

Scoring Mechanism

Each criterion is weighted by importance:

Category	Weight	Description
Critical Flaw	Veto	Fatal errors → immediate verdict, all other scores ignored
Core	×5	Key quality dimensions
Important	×2	Meaningful but non-critical factors
Highlight	×1	Bonus items

Dataset-Specific Logic

JudgeBench / PPE: Full evaluate_pair per sample — Verifiable check → bi-directional Pairwise
RewardBench V2: 1-vs-N comparison, subset-specific routing (Chat, Math, Safety, Precise IF, Focus); ties excluded from accuracy
RMBench: 9 pairs (3 chosen × 3 rejected variants) × 2 orders = 18 evaluations per sample; stratified by Easy / Normal / Hard

Project Structure

OpenRS/
├── tools.py                    # API calls, JSON parsing, file I/O
├── evaluator.py                # Core evaluation interface (evaluate_pair)
├── evaluator_precise_if.py     # Precise IF (Instruction Following) evaluator
├── robust_utils.py             # Robustness utilities (Unicode/JSON tolerance)
│
├── judgebench_and_ppe.py       # JudgeBench / PPE evaluation script
├── rewardbench_v2.py           # RewardBench V2 evaluation script
├── rmbench.py                  # RMBench evaluation script
│
├── prompts/
│   ├── pairwise_prompts/       # 50+ category-specific pairwise rubrics (.md)
│   ├── pointwise_prompts/      # Precise IF prompts
│   └── verifiable_prompts/     # Ground-truth verification prompts
│
├── data/                       # Evaluation datasets
│   ├── judgebench/
│   ├── ppe/
│   ├── rewardbench_v2/
│   └── rmbench/
│
├── requirements.txt
└── LICENSE                     # Apache License 2.0

Output Format

After evaluation, results are organized by verdict:

results/
├── all_results_{annotation}.jsonl            # All results
├── verifiable_good_cases_{annotation}.jsonl   # Verifiable: chosen wins
├── verifiable_bad_cases_{annotation}.jsonl    # Verifiable: rejected wins
├── pairwise_good_cases_{annotation}.jsonl     # Pairwise: chosen wins
├── pairwise_bad_cases_{annotation}.jsonl      # Pairwise: rejected wins
├── pairwise_same_cases_{annotation}.jsonl     # Pairwise: tie
├── error_cases_{annotation}.jsonl             # Evaluation errors
└── summary_{annotation}.json                  # Aggregated statistics

Citation

If you find this work useful, please cite:

@misc{jia2026openrubricsystemscaling,
      title={Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric}, 
      author={Ruipeng Jia and Yunyi Yang and Yuxin Wu and Yongbo Gai and Siyuan Tao and Mengyu Zhou and Jianhe Lin and Xiaoxi Jiang and Guanjun Jiang},
      year={2026},
      eprint={2602.14069},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.14069}, 
}

License

This project is licensed under the Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Introduction

Key Features

Main Results

Installation

Quick Start

1. Configure API

2. Run Evaluation

Common Arguments

Evaluation Pipeline

Scoring Mechanism

Dataset-Specific Logic

Project Structure

Output Format

Citation

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assests		assests
data		data
prompts		prompts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
evaluator.py		evaluator.py
evaluator_precise_if.py		evaluator_precise_if.py
judgebench_and_ppe.py		judgebench_and_ppe.py
requirements.txt		requirements.txt
rewardbench_v2.py		rewardbench_v2.py
rmbench.py		rmbench.py
robust_utils.py		robust_utils.py
tools.py		tools.py

License

Qwen-Applications/OpenRS

Folders and files

Latest commit

History

Repository files navigation

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Introduction

Key Features

Main Results

Installation

Quick Start

1. Configure API

2. Run Evaluation

Common Arguments

Evaluation Pipeline

Scoring Mechanism

Dataset-Specific Logic

Project Structure

Output Format

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages