Task-adaptive rubrics and dense reward signals for LLM agent trajectory evaluation
📄 Paper • Core Idea • Installation • Quick Start • Architecture • Related projects • Citation
AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation
LLM-as-Judge evaluation fails on agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. AdaRubric generates task-specific rubrics on the fly, scores trajectories step-by-step with confidence-weighted per-dimension feedback, and filters preference pairs with the novel DimensionAwareFilter — a provably necessary condition for preventing high-scoring dimensions from masking dimension-level failures.
| Metric | Value |
|---|---|
| Human correlation (Pearson r) | 0.79 (+0.16 over best static baseline) |
| Inter-run reliability (Krippendorff's α) | 0.83 (deployment-grade) |
| DPO task success gain over Prometheus | +6.8–+8.5 pp across WebArena / ToolBench / AgentBench |
| Transfer to SWE-bench code repair | +4.9 pp resolve rate (zero rubric engineering) |
| PPO convergence acceleration | +6.6 pp SR at 5K steps |
Standard LLM evaluation applies static dimensions (Helpfulness, Fluency, Safety) regardless of task type. For goal-directed agent tasks — multi-step tool calls, API orchestration, code repair — a static rubric systematically mis-measures quality.
AdaRubric addresses this with a three-stage pipeline:
TaskDescription
│
▼
┌─────────────┐ ┌────────────────────┐ ┌──────────────────┐
│ Stage 1 │ │ Stage 2 │ │ Stage 3 │
│ Rubric │────▶│ Trajectory │────▶│ Data │
│ Generator │ │ Evaluator │ │ Filter │
│ (LLM→R(T)) │ │ (per-step×per-dim) │ │ (DimAwareFilter) │
└─────────────┘ └────────────────────┘ └──────────────────┘
│ │ │
DynamicRubric {s_{k,j}, c_{k,j}} DPO Pairs
(N dimensions) (score + confidence) (margin-gated)
- Rubric Generator — Given a task description, an LLM generates N orthogonal evaluation dimensions with calibrated 5-point scoring criteria. Rubrics are cached per task type (>95% API cost reduction).
- Trajectory Evaluator — Each (Thought → Action → Observation) step is scored per-dimension with a confidence weight
c_{k,j} ∈ [0,1]. Three pluggable aggregators: Weighted Mean (default), Geometric Mean, Min Score. - Data Filter — Four composable filters curate high-quality DPO preference pairs. The key innovation is DimensionAwareFilter: a trajectory with a perfect average score can still fail catastrophically on a single dimension — DAFilter provably prevents this.
git clone https://github.com/alphadl/AdaRubrics.git
cd AdaRubrics
pip install -e ".[dev]"Set OPENAI_API_KEY in your environment (or pass via config). YAML config support requires pip install pyyaml.
import asyncio
from adarubric import AdaRubricPipeline, TaskDescription, Trajectory, TrajectoryStep
from adarubric.config import AdaRubricConfig
task = TaskDescription(
task_id="demo-001",
instruction=(
"Use the weather API to check if it will rain in Tokyo tomorrow, "
"and if so, suggest indoor activities."
),
domain="Personal Assistant",
expected_tools=["weather_api", "activity_search"],
)
trajectory = Trajectory(
trajectory_id="traj-demo-001",
task_id="demo-001",
steps=[
TrajectoryStep(
step_id=0,
thought="I need to check tomorrow's weather in Tokyo first.",
action="weather_api",
action_input={"city": "Tokyo", "date": "tomorrow"},
observation="Tomorrow: 70% chance of rain, high 18°C, low 12°C.",
),
TrajectoryStep(
step_id=1,
thought="It's likely to rain. Let me find indoor activities.",
action="activity_search",
action_input={"city": "Tokyo", "type": "indoor", "limit": 5},
observation="1. TeamLab Borderless, 2. Tokyo National Museum, 3. Akihabara arcades...",
),
],
)
pipeline = AdaRubricPipeline.from_config(AdaRubricConfig())
result = asyncio.run(pipeline.run(task, [trajectory], num_dimensions=4))
print(f"Rubric dimensions: {result.rubric.dimension_names}")
print(f"Global score: {result.mean_score:.2f}/5.0")
print(f"Survival rate: {result.survival_rate:.0%}")Run the full example:
export OPENAI_API_KEY="sk-..."
python examples/quickstart.py| Strategy | Behavior | Use Case |
|---|---|---|
WeightedMeanAggregator |
Confidence-weighted mean with optional recency decay (λ) | Default — balanced evaluation |
GeometricMeanAggregator |
Geometric mean — penalises low outliers | Tasks requiring balanced per-step performance |
MinScoreAggregator |
Global score = worst dimension | Safety-critical evaluations |
| Filter | Behavior |
|---|---|
AbsoluteThresholdFilter |
Fixed overall score cutoff |
PercentileFilter |
Keep top-k% of batch |
DimensionAwareFilter |
Per-dimension minimums — blocks quality masking |
CompositeFilter |
Logical AND of multiple filters |
adarubric/
├── core/ # Data models, exceptions, types
├── llm/ # LLM client abstraction (OpenAI, vLLM)
├── generator/ # Dynamic rubric generation + prompts
├── evaluator/ # Trajectory evaluation + aggregation
├── filter/ # Composable filtering strategies
├── analysis/ # Reliability (Krippendorff's α) and consistency
├── io/ # Trajectory/evaluation serialization, DPO export
├── reward/ # Score scalers, step reward assignment, DPO pair generation
├── pipeline.py # End-to-end orchestration
└── config.py # Layered configuration
pytest tests/ -v- AgentHER — Hindsight Experience Replay for LLM agents: relabels failed trajectories into valid training data (SFT/DPO). Pairs naturally with AdaRubric — use AdaRubric to score relabelled trajectories and filter by quality before training.
- AgentSynth — Synthetic agent data pipeline (forward + back-translation, execution-based reject sampling). Score and filter synthesized trajectories with AdaRubric before training.
- trajectory_tokenization — ReAct with trajectory tokenization: compresses long (Thought, Action, Observation) histories for long-horizon tasks. Addresses context length; AdaRubric addresses trajectory quality.
If you find AdaRubric useful, please cite:
@article{ding2025adarubric,
title = {AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation},
author = {Liang Ding},
year = {2025},
url = {https://github.com/alphadl/AdaRubrics}
}Apache 2.0
