Why were OpenAI models (Codex / GPT) excluded from the agent backbone evaluation?

Hi, thank you for the great work on SkillLearnBench! The benchmark and three-level evaluation framework are very well designed.

I noticed that the paper evaluates skill generation across six LLMs from two model families (Claude Haiku/Sonnet/Opus and Gemini Flash-Lite/Flash/Pro), and GPT-5-mini appears as the LLM judge for Level 1 & 2 metrics. However, OpenAI models don't appear as **agent backbones** for skill generation or task solving.

Interestingly, the codebase already includes a fully functional `codex` agent in `agents/__init__.py`:

```python
"codex": {
    "name": "Codex (OpenAI GPT)",
    "env": ["OPENAI_API_KEY"],
    "install": "npm install -g @openai/codex",
    "run": 'codex exec --dangerously-bypass-approvals-and-sandbox ...',
    ...
}
```

And the introduction mentions that skills have become an open standard across platforms including **OpenAI Codex** — so it seems natural to include it in the evaluation.

Could you share why GPT-family models were not included as agent backbones in the main experiments? One possible reason we thought of: since GPT-5-mini is used as the LLM judge for skill quality metrics, including GPT as a skill-generation agent might introduce evaluation bias — the judge could inadvertently favor skills written in a style similar to its own outputs.

This would be helpful context for practitioners who primarily use OpenAI models and are wondering how well these methods transfer. Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why were OpenAI models (Codex / GPT) excluded from the agent backbone evaluation? #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why were OpenAI models (Codex / GPT) excluded from the agent backbone evaluation? #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions