Skip to content

docs(research): add Agent Bench Lab evaluation handoff#5

Merged
t3chn merged 1 commit into
mainfrom
codex/research-agent-bench-handoff
Jun 8, 2026
Merged

docs(research): add Agent Bench Lab evaluation handoff#5
t3chn merged 1 commit into
mainfrom
codex/research-agent-bench-handoff

Conversation

@t3chn

@t3chn t3chn commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add Agent Bench Lab fit, evaluation handoff, and blocker fields to Research Radar experiment proposal templates.
  • Clarify that Radar output remains intake-only and cannot trigger Agent Bench Lab repo changes without approval.
  • Document how code-intel-kernel R&D hypotheses should map to Agent Bench Lab suites, scorers, run validity, and compare metrics.

Issue

  • N/A

Test plan

  • python3 -m json.tool research-radar/templates/digest.json >/dev/null
  • git diff --check
  • scripts/run-deterministic-tests.sh

@t3chn t3chn marked this pull request as ready for review June 8, 2026 10:49
@t3chn t3chn merged commit 0454071 into main Jun 8, 2026
1 check passed
@t3chn t3chn deleted the codex/research-agent-bench-handoff branch June 8, 2026 10:49

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 70f5b018d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +57 to +59
"agent_bench_lab_fit": "",
"agent_bench_lab_eval_handoff": "",
"agent_bench_lab_blockers": "",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ensure the collector emits the new handoff fields

When run_daily.py promotes any item with score >= 85, build_experiment_candidate() still returns only the old keys (title, source_url, source_type, hypothesis, minimal_reversible_change, evaluation_plan, stop_condition, and reason_not_to_implement_immediately). With these new required template fields, automated daily JSON/Markdown reports for that scenario will violate the updated Research Radar contract and omit the Agent Bench Lab fit/blocker that the guardrails now require, so the generator should be updated alongside the template.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant