Say Goodbye to Scalar Reward Model, Exploring the Aha Moment in General RL
OpenRS (Open Rubric System) is an LLM-as-a-Judge evaluation framework that replaces traditional Reward Models with adaptive, fine-grained rubric-based evaluation. The core idea is to use large language models to evaluate response quality through adaptive, query-type-specific rubrics — enabling multi-dimensional scoring with interpretable verdicts.
The framework supports three evaluation paradigms:
| Paradigm | Scenario | Description |
|---|---|---|
| Pairwise | Chat, Code, Safety, etc. | A/B bi-directional comparison with weighted multi-criteria scoring |
| Verifiable | Math, Factuality | Ground-truth verification first, fallback to Pairwise if inconclusive |
| Precise IF | Instruction Following | Hard constraint checking, fallback to Pairwise on ties |
Figure 1: OpenRS Evaluation Pipeline — From pairwise responses, through verifiable and adaptive rubric generation, to multi-criteria scoring.
- 🎯 Open Rubric: 50+ query-type-specific rubrics with weighted criteria (critical / core / important / highlight)
- ⚖️ Bi-directional Debiasing: Swaps A/B order to eliminate position bias
- 🔍 Critical Flaw Veto: Fatal errors override all other scoring dimensions
- 📊 4 Benchmarks: JudgeBench, PPE, RewardBench V2, RMBench
We evaluate five judge models across four benchmarks:
Table 1: Accuracy (%) of different judge models across four benchmarks.
git clone https://github.com/WyxBUPT-22/OpenRS.git
cd OpenRS
pip install -r requirements.txtDependencies: openai, tenacity, json5, json-repair, tqdm
OpenRS is compatible with any OpenAI-compatible inference backend (vLLM, SGLang, Ollama, etc.):
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="your-api-key"
export OPENAI_MODEL_NAME="your-model-name"JudgeBench / PPE
python judgebench_and_ppe.py \
--input data/judgebench/gpt.jsonl \
--output-dir results/judgebench \
--annotation judgebench_gpt \
--workers 50RewardBench V2
python rewardbench_v2.py \
--input data/rewardbench_v2/rewardbench_v2.jsonl \
--output-dir results/rewardbench_v2 \
--annotation rbv2 \
--workers 10RMBench
python rmbench.py \
--input data/rmbench/rmbench.jsonl \
--output results/rmbench_results.jsonl \
--workers 10| Argument | Description | Default |
|---|---|---|
--input |
Input data path | required |
--output-dir |
Output directory | ./results |
--workers |
Concurrent threads | 10–50 |
--temperature |
Generation temperature | 0.0 |
--limit |
Max items to process (0=all) | 0 |
--no-resume |
Disable checkpoint resume | False |
--stats-only |
Report stats without running | False |
Input Data → Evaluation Router → Model Call → Score Parsing → Result Aggregation → Report
Each criterion is weighted by importance:
| Category | Weight | Description |
|---|---|---|
| Critical Flaw | Veto | Fatal errors → immediate verdict, all other scores ignored |
| Core | ×5 | Key quality dimensions |
| Important | ×2 | Meaningful but non-critical factors |
| Highlight | ×1 | Bonus items |
- JudgeBench / PPE: Full
evaluate_pairper sample — Verifiable check → bi-directional Pairwise - RewardBench V2: 1-vs-N comparison, subset-specific routing (Chat, Math, Safety, Precise IF, Focus); ties excluded from accuracy
- RMBench: 9 pairs (3 chosen × 3 rejected variants) × 2 orders = 18 evaluations per sample; stratified by Easy / Normal / Hard
OpenRS/
├── tools.py # API calls, JSON parsing, file I/O
├── evaluator.py # Core evaluation interface (evaluate_pair)
├── evaluator_precise_if.py # Precise IF (Instruction Following) evaluator
├── robust_utils.py # Robustness utilities (Unicode/JSON tolerance)
│
├── judgebench_and_ppe.py # JudgeBench / PPE evaluation script
├── rewardbench_v2.py # RewardBench V2 evaluation script
├── rmbench.py # RMBench evaluation script
│
├── prompts/
│ ├── pairwise_prompts/ # 50+ category-specific pairwise rubrics (.md)
│ ├── pointwise_prompts/ # Precise IF prompts
│ └── verifiable_prompts/ # Ground-truth verification prompts
│
├── data/ # Evaluation datasets
│ ├── judgebench/
│ ├── ppe/
│ ├── rewardbench_v2/
│ └── rmbench/
│
├── requirements.txt
└── LICENSE # Apache License 2.0
After evaluation, results are organized by verdict:
results/
├── all_results_{annotation}.jsonl # All results
├── verifiable_good_cases_{annotation}.jsonl # Verifiable: chosen wins
├── verifiable_bad_cases_{annotation}.jsonl # Verifiable: rejected wins
├── pairwise_good_cases_{annotation}.jsonl # Pairwise: chosen wins
├── pairwise_bad_cases_{annotation}.jsonl # Pairwise: rejected wins
├── pairwise_same_cases_{annotation}.jsonl # Pairwise: tie
├── error_cases_{annotation}.jsonl # Evaluation errors
└── summary_{annotation}.json # Aggregated statistics
If you find this work useful, please cite:
@misc{jia2026openrubricsystemscaling,
title={Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric},
author={Ruipeng Jia and Yunyi Yang and Yuxin Wu and Yongbo Gai and Siyuan Tao and Mengyu Zhou and Jianhe Lin and Xiaoxi Jiang and Guanjun Jiang},
year={2026},
eprint={2602.14069},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.14069},
}This project is licensed under the Apache License 2.0.