Skip to content

Qwen-Applications/OpenRS

Repository files navigation

Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Say Goodbye to Scalar Reward Model, Exploring the Aha Moment in General RL

GitHub arXiv License 中文


Introduction

OpenRS (Open Rubric System) is an LLM-as-a-Judge evaluation framework that replaces traditional Reward Models with adaptive, fine-grained rubric-based evaluation. The core idea is to use large language models to evaluate response quality through adaptive, query-type-specific rubrics — enabling multi-dimensional scoring with interpretable verdicts.

The framework supports three evaluation paradigms:

Paradigm Scenario Description
Pairwise Chat, Code, Safety, etc. A/B bi-directional comparison with weighted multi-criteria scoring
Verifiable Math, Factuality Ground-truth verification first, fallback to Pairwise if inconclusive
Precise IF Instruction Following Hard constraint checking, fallback to Pairwise on ties


Figure 1: OpenRS Evaluation Pipeline — From pairwise responses, through verifiable and adaptive rubric generation, to multi-criteria scoring.

Key Features

  • 🎯 Open Rubric: 50+ query-type-specific rubrics with weighted criteria (critical / core / important / highlight)
  • ⚖️ Bi-directional Debiasing: Swaps A/B order to eliminate position bias
  • 🔍 Critical Flaw Veto: Fatal errors override all other scoring dimensions
  • 📊 4 Benchmarks: JudgeBench, PPE, RewardBench V2, RMBench

Main Results

We evaluate five judge models across four benchmarks:


Table 1: Accuracy (%) of different judge models across four benchmarks.

Installation

git clone https://github.com/WyxBUPT-22/OpenRS.git
cd OpenRS
pip install -r requirements.txt

Dependencies: openai, tenacity, json5, json-repair, tqdm

Quick Start

1. Configure API

OpenRS is compatible with any OpenAI-compatible inference backend (vLLM, SGLang, Ollama, etc.):

export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="your-api-key"
export OPENAI_MODEL_NAME="your-model-name"

2. Run Evaluation

JudgeBench / PPE
python judgebench_and_ppe.py \
    --input data/judgebench/gpt.jsonl \
    --output-dir results/judgebench \
    --annotation judgebench_gpt \
    --workers 50
RewardBench V2
python rewardbench_v2.py \
    --input data/rewardbench_v2/rewardbench_v2.jsonl \
    --output-dir results/rewardbench_v2 \
    --annotation rbv2 \
    --workers 10
RMBench
python rmbench.py \
    --input data/rmbench/rmbench.jsonl \
    --output results/rmbench_results.jsonl \
    --workers 10

Common Arguments

Argument Description Default
--input Input data path required
--output-dir Output directory ./results
--workers Concurrent threads 10–50
--temperature Generation temperature 0.0
--limit Max items to process (0=all) 0
--no-resume Disable checkpoint resume False
--stats-only Report stats without running False

Evaluation Pipeline

Input Data → Evaluation Router → Model Call → Score Parsing → Result Aggregation → Report

Scoring Mechanism

Each criterion is weighted by importance:

Category Weight Description
Critical Flaw Veto Fatal errors → immediate verdict, all other scores ignored
Core ×5 Key quality dimensions
Important ×2 Meaningful but non-critical factors
Highlight ×1 Bonus items

Dataset-Specific Logic

  • JudgeBench / PPE: Full evaluate_pair per sample — Verifiable check → bi-directional Pairwise
  • RewardBench V2: 1-vs-N comparison, subset-specific routing (Chat, Math, Safety, Precise IF, Focus); ties excluded from accuracy
  • RMBench: 9 pairs (3 chosen × 3 rejected variants) × 2 orders = 18 evaluations per sample; stratified by Easy / Normal / Hard

Project Structure

OpenRS/
├── tools.py                    # API calls, JSON parsing, file I/O
├── evaluator.py                # Core evaluation interface (evaluate_pair)
├── evaluator_precise_if.py     # Precise IF (Instruction Following) evaluator
├── robust_utils.py             # Robustness utilities (Unicode/JSON tolerance)
│
├── judgebench_and_ppe.py       # JudgeBench / PPE evaluation script
├── rewardbench_v2.py           # RewardBench V2 evaluation script
├── rmbench.py                  # RMBench evaluation script
│
├── prompts/
│   ├── pairwise_prompts/       # 50+ category-specific pairwise rubrics (.md)
│   ├── pointwise_prompts/      # Precise IF prompts
│   └── verifiable_prompts/     # Ground-truth verification prompts
│
├── data/                       # Evaluation datasets
│   ├── judgebench/
│   ├── ppe/
│   ├── rewardbench_v2/
│   └── rmbench/
│
├── requirements.txt
└── LICENSE                     # Apache License 2.0

Output Format

After evaluation, results are organized by verdict:

results/
├── all_results_{annotation}.jsonl            # All results
├── verifiable_good_cases_{annotation}.jsonl   # Verifiable: chosen wins
├── verifiable_bad_cases_{annotation}.jsonl    # Verifiable: rejected wins
├── pairwise_good_cases_{annotation}.jsonl     # Pairwise: chosen wins
├── pairwise_bad_cases_{annotation}.jsonl      # Pairwise: rejected wins
├── pairwise_same_cases_{annotation}.jsonl     # Pairwise: tie
├── error_cases_{annotation}.jsonl             # Evaluation errors
└── summary_{annotation}.json                  # Aggregated statistics

Citation

If you find this work useful, please cite:

@misc{jia2026openrubricsystemscaling,
      title={Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric}, 
      author={Ruipeng Jia and Yunyi Yang and Yuxin Wu and Yongbo Gai and Siyuan Tao and Mengyu Zhou and Jianhe Lin and Xiaoxi Jiang and Guanjun Jiang},
      year={2026},
      eprint={2602.14069},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.14069}, 
}

License

This project is licensed under the Apache License 2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages