Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions docs/08-metrics-and-benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,26 @@ Task categories:
9. avoid rejected hypothesis;
10. verify diagnostic improvement.

## Agent Bench Lab handoff

`code-intel-kernel` owns the hypothesis, evidence contract, and expected workflow signal.
Agent Bench Lab owns benchmark task families, scorers, run records, and compare protocol.

Every Research Radar experiment proposal should map the idea to:

```text
hypothesis
expected_signal
candidate Agent Bench Lab suite or task family
public smoke check vs private holdout need
run-validity or harness blocker
baseline setup
candidate setup
comparison metric
```

If Agent Bench Lab cannot evaluate the idea yet, keep that as a benchmark-layer blocker. Do not implement the idea in `code-intel-kernel` merely because it is interesting.

## Execution-based evaluation

Prefer execution-based checks where possible:
Expand Down
7 changes: 5 additions & 2 deletions research-radar/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ core runtime paused
-> research-radar/
-> daily digest
-> human approval
-> experiment proposal
-> experiment proposal with Agent Bench Lab evaluation handoff
-> only then Codex prototype
-> Agent Bench Lab run/compare when the benchmark layer is ready
```

## Current Scope
Expand All @@ -22,7 +23,7 @@ Research Radar v0.1 tracks public sources that may affect `code-intel-kernel`:
- structural retrieval and repo intelligence;
- LSP diagnostics, references, and disambiguation;
- Tree-sitter and parser infrastructure;
- code intelligence benchmarks;
- code intelligence benchmarks and Agent Bench Lab evaluation handoff;
- Codebase-Memory, RIG/SPADE, SWE-bench, and adjacent systems.

The v0.1 scaffold is config and docs only. R2-A adds a bounded collector for reports/state only; it still does not modify runtime code or implement ideas.
Expand All @@ -38,6 +39,7 @@ The v0.1 scaffold is config and docs only. R2-A adds a bounded collector for rep
- `research-radar/reports/YYYY-MM-DD.json`
6. Do not modify source code.
7. Do not propose an implementation unless the item scores at least 85 and has an available artifact.
8. If an experiment candidate is proposed, state whether Agent Bench Lab can evaluate it, which suite or task family would be needed, and what benchmark-layer blockers remain.

For local manual runs, use dry-run first:

Expand All @@ -61,3 +63,4 @@ It must never modify runtime code, import external code, create prototypes, comm
## Output Rule

Daily output is candidate evidence. It cannot trigger code changes automatically.
Experiment candidates may define an Agent Bench Lab evaluation handoff, but they still require human approval before prototype work or benchmark repo changes.
4 changes: 4 additions & 0 deletions research-radar/codex-experiment-proposal-prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ Required fields:
- minimal_reversible_change
- expected_signal
- evaluation_plan
- agent_bench_lab_fit
- agent_bench_lab_eval_handoff
- agent_bench_lab_blockers
- fixtures_or_benchmarks_needed
- contract_risk
- licensing_attribution_notes
Expand All @@ -25,3 +28,4 @@ Required fields:
- reason_not_to_implement_immediately

The proposal must explain why the experiment should remain separate from mainline feature work until approved.
The proposal must not assume Agent Bench Lab is complete. If the benchmark layer cannot evaluate the idea yet, record the blocker instead of converting the idea into implementation work.
4 changes: 4 additions & 0 deletions research-radar/guardrails.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Research Radar is intake, not implementation.
- Treat external repositories as research input, not dependencies, unless separately approved.
- Generated experiment proposals must include a stop condition.
- Generated experiment proposals must include a reason not to implement immediately.
- Generated experiment proposals must state whether Agent Bench Lab can evaluate the expected signal, or why it cannot yet.
- Codex App scheduled automation may write only `research-radar/reports/**` and `research-radar/state/**`.
- Codex App scheduled automation must fail instead of continuing if runtime or configuration files change unexpectedly.
- Codex App scheduled automation must not commit automatically.
Expand All @@ -28,6 +29,8 @@ Research Radar is intake, not implementation.
- Do not use generated code from papers or repos without attribution and license review.
- Do not turn Research Radar into `where-to-edit`, roadmap automation, or a repo-owned scheduler.
- Do not use Research Radar automation to create patches, code-intelligence features, or runtime changes.
- Do not treat Agent Bench Lab as a `code-intel-kernel` runtime dependency.
- Do not edit, run, publish, or create tasks in Agent Bench Lab from a Research Radar item without separate human approval.

## Human Approval Gate

Expand All @@ -39,6 +42,7 @@ A daily digest may propose one experiment candidate only when:
- licensing status is recorded;
- security concerns are recorded;
- minimal reversible change is clear;
- Agent Bench Lab fit or blocker is clear;
- stop condition is clear.

Even then, implementation requires explicit human approval.
2 changes: 2 additions & 0 deletions research-radar/scoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Use these dimensions as a checklist. Weights are intentionally rough for v0.1.
- Evidence quality: Are claims backed by code, data, benchmarks, or clear methodology?
- Source credibility: Is the source a known lab, maintained repo, benchmark, or peer-reviewed venue?
- Reproducibility: Can the result be checked locally without private access or fragile services?
- Agent Bench Lab fit: Can the expected signal be evaluated through an existing or clearly proposed Agent Bench Lab suite, task family, scorer, or compare protocol?
- Local-first fit: Can the idea work without hosted dependencies or login-gated APIs?
- Rust/Rust-compatible feasibility: Does it fit a Rust-first kernel or expose a clean protocol/data boundary?
- Safety/security risk: Does it avoid unsafe scraping, untrusted code execution, or unclear licensing?
Expand All @@ -33,3 +34,4 @@ Score does not override guardrails.
- Do not copy code without license review.
- Do not scrape login-gated or restricted sources unless explicitly configured.
- Do not treat benchmark claims as validated until locally reviewed.
- Do not raise an item to experiment proposal if the Agent Bench Lab evaluation path is unknown and no explicit benchmark-layer blocker is recorded.
3 changes: 3 additions & 0 deletions research-radar/templates/digest.json
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@
"hypothesis": "",
"minimal_reversible_change": "",
"evaluation_plan": "",
"agent_bench_lab_fit": "",
"agent_bench_lab_eval_handoff": "",
"agent_bench_lab_blockers": "",
Comment on lines +57 to +59

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ensure the collector emits the new handoff fields

When run_daily.py promotes any item with score >= 85, build_experiment_candidate() still returns only the old keys (title, source_url, source_type, hypothesis, minimal_reversible_change, evaluation_plan, stop_condition, and reason_not_to_implement_immediately). With these new required template fields, automated daily JSON/Markdown reports for that scenario will violate the updated Research Radar contract and omit the Agent Bench Lab fit/blocker that the guardrails now require, so the generator should be updated alongside the template.

Useful? React with 👍 / 👎.

"stop_condition": "",
"reason_not_to_implement_immediately": ""
},
Expand Down
3 changes: 3 additions & 0 deletions research-radar/templates/digest.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ Date: `YYYY-MM-DD`
- Hypothesis:
- Minimal reversible change:
- Evaluation plan:
- Agent Bench Lab fit:
- Agent Bench Lab evaluation handoff:
- Agent Bench Lab blockers:
- Stop condition:
- Reason not to implement immediately:

Expand Down
6 changes: 6 additions & 0 deletions research-radar/templates/experiment-proposal.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,12 @@

## Evaluation Plan

## Agent Bench Lab Fit

## Agent Bench Lab Evaluation Handoff

## Agent Bench Lab Blockers

## Fixtures or Benchmarks Needed

## Contract Risk
Expand Down