Hi, thank you for the great work on SkillLearnBench! The benchmark and three-level evaluation framework are very well designed.
I noticed that the paper evaluates skill generation across six LLMs from two model families (Claude Haiku/Sonnet/Opus and Gemini Flash-Lite/Flash/Pro), and GPT-5-mini appears as the LLM judge for Level 1 & 2 metrics. However, OpenAI models don't appear as agent backbones for skill generation or task solving.
Interestingly, the codebase already includes a fully functional codex agent in agents/__init__.py:
"codex": {
"name": "Codex (OpenAI GPT)",
"env": ["OPENAI_API_KEY"],
"install": "npm install -g @openai/codex",
"run": 'codex exec --dangerously-bypass-approvals-and-sandbox ...',
...
}
And the introduction mentions that skills have become an open standard across platforms including OpenAI Codex — so it seems natural to include it in the evaluation.
Could you share why GPT-family models were not included as agent backbones in the main experiments? One possible reason we thought of: since GPT-5-mini is used as the LLM judge for skill quality metrics, including GPT as a skill-generation agent might introduce evaluation bias — the judge could inadvertently favor skills written in a style similar to its own outputs.
This would be helpful context for practitioners who primarily use OpenAI models and are wondering how well these methods transfer. Thanks in advance!
Hi, thank you for the great work on SkillLearnBench! The benchmark and three-level evaluation framework are very well designed.
I noticed that the paper evaluates skill generation across six LLMs from two model families (Claude Haiku/Sonnet/Opus and Gemini Flash-Lite/Flash/Pro), and GPT-5-mini appears as the LLM judge for Level 1 & 2 metrics. However, OpenAI models don't appear as agent backbones for skill generation or task solving.
Interestingly, the codebase already includes a fully functional
codexagent inagents/__init__.py:And the introduction mentions that skills have become an open standard across platforms including OpenAI Codex — so it seems natural to include it in the evaluation.
Could you share why GPT-family models were not included as agent backbones in the main experiments? One possible reason we thought of: since GPT-5-mini is used as the LLM judge for skill quality metrics, including GPT as a skill-generation agent might introduce evaluation bias — the judge could inadvertently favor skills written in a style similar to its own outputs.
This would be helpful context for practitioners who primarily use OpenAI models and are wondering how well these methods transfer. Thanks in advance!