feat(benchmark): refresh model lineup and fix N/A scoring#17
Conversation
Adds openai/gpt-5.4, openai/gpt-5.4-nano, openai/gpt-5.3-codex, openai/gpt-oss-120b, openai/gpt-oss-20b and anthropic/claude-opus-4.7 to the default benchmark matrix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Updates the GitHub Actions benchmark workflow to include newly released default LLMs in the benchmark matrix so they run both on manual dispatch and on PR-triggered runs.
Changes:
- Expanded the
workflow_dispatchmodelsinput default to include 5 new OpenAI models andanthropic/claude-opus-4.7. - Updated the matrix fallback model list to match the new default set.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # yamllint disable-line rule:line-length | ||
| default: '["openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]' | ||
| default: '["openai/gpt-5.4","openai/gpt-5.4-nano","openai/gpt-5.3-codex","openai/gpt-oss-120b","openai/gpt-oss-20b","openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.7","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]' |
There was a problem hiding this comment.
The # yamllint disable-line rule:line-length directive is on the line above, so it won’t suppress line-length warnings for the long default: value. If yamllint is used, move the directive onto the default: line (or use a yamllint disable/enable block) so the intended line is actually ignored.
| # yamllint disable-line rule:line-length | ||
| matrix: | ||
| model: ${{ fromJson(inputs.models || '["openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }} | ||
| model: ${{ fromJson(inputs.models || '["openai/gpt-5.4","openai/gpt-5.4-nano","openai/gpt-5.3-codex","openai/gpt-oss-120b","openai/gpt-oss-20b","openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.7","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }} |
There was a problem hiding this comment.
Same as above: the yamllint disable-line comment is on line 30, but the long line that would trigger line-length is model: ... on line 32. If yamllint is run, this won’t have the intended effect; consider moving the directive onto the long line or wrapping this section in a yamllint disable/enable block.
| matrix: | ||
| model: ${{ fromJson(inputs.models || '["openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }} | ||
| model: ${{ fromJson(inputs.models || '["openai/gpt-5.4","openai/gpt-5.4-nano","openai/gpt-5.3-codex","openai/gpt-oss-120b","openai/gpt-oss-20b","openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.7","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }} |
There was a problem hiding this comment.
The default model list is duplicated in two places (the workflow_dispatch input default and the matrix fallback). Now that the list is longer, it’s easier for these to drift out of sync. Consider factoring the JSON string into a YAML anchor/alias (or another single source of truth) so updates only need to be made once.
The model is unavailable on OpenRouter and caused the matrix job to fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The benchmark and aggregate scripts still parsed LLM output with a local _strip_code_fences + json.loads pair, which fails when models wrap JSON in explanatory text or extra content. Production code switched to extract_json (regex-based) in #13; the benchmark scripts were missed, which caused most models to score N/A and rank with composite 0.0/10. Replace the local helper with reddit_digest.nodes.llm_utils.extract_json in both bench_model.py and aggregate.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The helper was inlined in bench_model.py / aggregate.py and was replaced by reddit_digest.nodes.llm_utils.extract_json. Drop the four obsolete tests that imported the removed function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update the README with the run after #17 (5 new models, fixed JSON parsing). Recommend google/gemma-3-12b-it as the best quality/price self-hostable option (composite 0.9627 at \$0.0009 per run, 100%% JSON valid). Note anomalies: gpt-oss-120b summary at 0.0/10, gpt-4.1-nano 17%% JSON, phi-4 67%% JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep README consistent — the rest of the file is in English. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mark Update the table to reflect the 25-model run, add gpt-5.4-nano and the Gemma 4-31B / 3-27B variants, and reframe the recommendation around gemma-3-12b-it as the best self-hostable option (no longer top-2 on composite, but still cheapest at $0.0009/run with 100% JSON validity). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Latest benchmark run is much more conclusive for the Gemma 3 family — gemma-3-27b-it tops the composite ranking (0.9819) and gemma-3-12b-it moves up to #2 (0.9781) with the same 7.9/10 summary quality. Three Gemma 3 variants plus the Gemma 4-31B fill 4 of the top 5 spots. Conclusion preserved: gemma-3-12b-it remains the recommended self-hostable default — same summary quality as the 27B at less than half the cost (\$0.0005 vs \$0.0012/run) and runnable on a single consumer GPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
openai/gpt-5.4,openai/gpt-5.4-nano,openai/gpt-oss-120b,openai/gpt-oss-20b,anthropic/claude-opus-4.7openai/gpt-5.3-codex(unavailable on OpenRouter, the matrix job failed)bench_model.pyandaggregate.pywere still parsing LLM output with a local_strip_code_fences+json.loadspair. PR feat(collector): replace Reddit scraping with MCP server #13 switched the production scorer/summarizer to the regex-basedextract_jsonhelper but the benchmark scripts were missed, so any model wrapping its JSON in extra text was scored as N/A and ranked with a composite of 0.0/10. Both scripts now usereddit_digest.nodes.llm_utils.extract_json.Test plan
LLM Benchmarkworkflow completesmicrosoft/phi-4,openai/gpt-4.1-nano)🤖 Generated with Claude Code