From 70f5b018d5bfea1d4e2b1707fd64c356771865b3 Mon Sep 17 00:00:00 2001 From: "Vitaly D." Date: Mon, 8 Jun 2026 13:47:12 +0300 Subject: [PATCH] docs(research): add Agent Bench Lab evaluation handoff --- docs/08-metrics-and-benchmarks.md | 20 +++++++++++++++++++ research-radar/README.md | 7 +++++-- .../codex-experiment-proposal-prompt.md | 4 ++++ research-radar/guardrails.md | 4 ++++ research-radar/scoring.md | 2 ++ research-radar/templates/digest.json | 3 +++ research-radar/templates/digest.md | 3 +++ .../templates/experiment-proposal.md | 6 ++++++ 8 files changed, 47 insertions(+), 2 deletions(-) diff --git a/docs/08-metrics-and-benchmarks.md b/docs/08-metrics-and-benchmarks.md index 5105fb3..d75b114 100644 --- a/docs/08-metrics-and-benchmarks.md +++ b/docs/08-metrics-and-benchmarks.md @@ -79,6 +79,26 @@ Task categories: 9. avoid rejected hypothesis; 10. verify diagnostic improvement. +## Agent Bench Lab handoff + +`code-intel-kernel` owns the hypothesis, evidence contract, and expected workflow signal. +Agent Bench Lab owns benchmark task families, scorers, run records, and compare protocol. + +Every Research Radar experiment proposal should map the idea to: + +```text +hypothesis +expected_signal +candidate Agent Bench Lab suite or task family +public smoke check vs private holdout need +run-validity or harness blocker +baseline setup +candidate setup +comparison metric +``` + +If Agent Bench Lab cannot evaluate the idea yet, keep that as a benchmark-layer blocker. Do not implement the idea in `code-intel-kernel` merely because it is interesting. + ## Execution-based evaluation Prefer execution-based checks where possible: diff --git a/research-radar/README.md b/research-radar/README.md index 59c6e05..8ba30a9 100644 --- a/research-radar/README.md +++ b/research-radar/README.md @@ -11,8 +11,9 @@ core runtime paused -> research-radar/ -> daily digest -> human approval - -> experiment proposal + -> experiment proposal with Agent Bench Lab evaluation handoff -> only then Codex prototype + -> Agent Bench Lab run/compare when the benchmark layer is ready ``` ## Current Scope @@ -22,7 +23,7 @@ Research Radar v0.1 tracks public sources that may affect `code-intel-kernel`: - structural retrieval and repo intelligence; - LSP diagnostics, references, and disambiguation; - Tree-sitter and parser infrastructure; -- code intelligence benchmarks; +- code intelligence benchmarks and Agent Bench Lab evaluation handoff; - Codebase-Memory, RIG/SPADE, SWE-bench, and adjacent systems. The v0.1 scaffold is config and docs only. R2-A adds a bounded collector for reports/state only; it still does not modify runtime code or implement ideas. @@ -38,6 +39,7 @@ The v0.1 scaffold is config and docs only. R2-A adds a bounded collector for rep - `research-radar/reports/YYYY-MM-DD.json` 6. Do not modify source code. 7. Do not propose an implementation unless the item scores at least 85 and has an available artifact. +8. If an experiment candidate is proposed, state whether Agent Bench Lab can evaluate it, which suite or task family would be needed, and what benchmark-layer blockers remain. For local manual runs, use dry-run first: @@ -61,3 +63,4 @@ It must never modify runtime code, import external code, create prototypes, comm ## Output Rule Daily output is candidate evidence. It cannot trigger code changes automatically. +Experiment candidates may define an Agent Bench Lab evaluation handoff, but they still require human approval before prototype work or benchmark repo changes. diff --git a/research-radar/codex-experiment-proposal-prompt.md b/research-radar/codex-experiment-proposal-prompt.md index 9d71b02..568f52a 100644 --- a/research-radar/codex-experiment-proposal-prompt.md +++ b/research-radar/codex-experiment-proposal-prompt.md @@ -17,6 +17,9 @@ Required fields: - minimal_reversible_change - expected_signal - evaluation_plan +- agent_bench_lab_fit +- agent_bench_lab_eval_handoff +- agent_bench_lab_blockers - fixtures_or_benchmarks_needed - contract_risk - licensing_attribution_notes @@ -25,3 +28,4 @@ Required fields: - reason_not_to_implement_immediately The proposal must explain why the experiment should remain separate from mainline feature work until approved. +The proposal must not assume Agent Bench Lab is complete. If the benchmark layer cannot evaluate the idea yet, record the blocker instead of converting the idea into implementation work. diff --git a/research-radar/guardrails.md b/research-radar/guardrails.md index dac5644..af8a400 100644 --- a/research-radar/guardrails.md +++ b/research-radar/guardrails.md @@ -16,6 +16,7 @@ Research Radar is intake, not implementation. - Treat external repositories as research input, not dependencies, unless separately approved. - Generated experiment proposals must include a stop condition. - Generated experiment proposals must include a reason not to implement immediately. +- Generated experiment proposals must state whether Agent Bench Lab can evaluate the expected signal, or why it cannot yet. - Codex App scheduled automation may write only `research-radar/reports/**` and `research-radar/state/**`. - Codex App scheduled automation must fail instead of continuing if runtime or configuration files change unexpectedly. - Codex App scheduled automation must not commit automatically. @@ -28,6 +29,8 @@ Research Radar is intake, not implementation. - Do not use generated code from papers or repos without attribution and license review. - Do not turn Research Radar into `where-to-edit`, roadmap automation, or a repo-owned scheduler. - Do not use Research Radar automation to create patches, code-intelligence features, or runtime changes. +- Do not treat Agent Bench Lab as a `code-intel-kernel` runtime dependency. +- Do not edit, run, publish, or create tasks in Agent Bench Lab from a Research Radar item without separate human approval. ## Human Approval Gate @@ -39,6 +42,7 @@ A daily digest may propose one experiment candidate only when: - licensing status is recorded; - security concerns are recorded; - minimal reversible change is clear; +- Agent Bench Lab fit or blocker is clear; - stop condition is clear. Even then, implementation requires explicit human approval. diff --git a/research-radar/scoring.md b/research-radar/scoring.md index c19cdc9..456a566 100644 --- a/research-radar/scoring.md +++ b/research-radar/scoring.md @@ -12,6 +12,7 @@ Use these dimensions as a checklist. Weights are intentionally rough for v0.1. - Evidence quality: Are claims backed by code, data, benchmarks, or clear methodology? - Source credibility: Is the source a known lab, maintained repo, benchmark, or peer-reviewed venue? - Reproducibility: Can the result be checked locally without private access or fragile services? +- Agent Bench Lab fit: Can the expected signal be evaluated through an existing or clearly proposed Agent Bench Lab suite, task family, scorer, or compare protocol? - Local-first fit: Can the idea work without hosted dependencies or login-gated APIs? - Rust/Rust-compatible feasibility: Does it fit a Rust-first kernel or expose a clean protocol/data boundary? - Safety/security risk: Does it avoid unsafe scraping, untrusted code execution, or unclear licensing? @@ -33,3 +34,4 @@ Score does not override guardrails. - Do not copy code without license review. - Do not scrape login-gated or restricted sources unless explicitly configured. - Do not treat benchmark claims as validated until locally reviewed. +- Do not raise an item to experiment proposal if the Agent Bench Lab evaluation path is unknown and no explicit benchmark-layer blocker is recorded. diff --git a/research-radar/templates/digest.json b/research-radar/templates/digest.json index e5b1b93..220f31a 100644 --- a/research-radar/templates/digest.json +++ b/research-radar/templates/digest.json @@ -54,6 +54,9 @@ "hypothesis": "", "minimal_reversible_change": "", "evaluation_plan": "", + "agent_bench_lab_fit": "", + "agent_bench_lab_eval_handoff": "", + "agent_bench_lab_blockers": "", "stop_condition": "", "reason_not_to_implement_immediately": "" }, diff --git a/research-radar/templates/digest.md b/research-radar/templates/digest.md index c3d920c..8c9fdca 100644 --- a/research-radar/templates/digest.md +++ b/research-radar/templates/digest.md @@ -30,6 +30,9 @@ Date: `YYYY-MM-DD` - Hypothesis: - Minimal reversible change: - Evaluation plan: +- Agent Bench Lab fit: +- Agent Bench Lab evaluation handoff: +- Agent Bench Lab blockers: - Stop condition: - Reason not to implement immediately: diff --git a/research-radar/templates/experiment-proposal.md b/research-radar/templates/experiment-proposal.md index d091c63..2f0e3b2 100644 --- a/research-radar/templates/experiment-proposal.md +++ b/research-radar/templates/experiment-proposal.md @@ -16,6 +16,12 @@ ## Evaluation Plan +## Agent Bench Lab Fit + +## Agent Bench Lab Evaluation Handoff + +## Agent Bench Lab Blockers + ## Fixtures or Benchmarks Needed ## Contract Risk