Skip to content

feat(benchmark): refresh model lineup and fix N/A scoring#17

Merged
using-system merged 8 commits into
mainfrom
chore/benchmark-add-new-models
Apr 17, 2026
Merged

feat(benchmark): refresh model lineup and fix N/A scoring#17
using-system merged 8 commits into
mainfrom
chore/benchmark-add-new-models

Conversation

@using-system

@using-system using-system commented Apr 17, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add new OpenAI and Anthropic models to the default benchmark matrix: openai/gpt-5.4, openai/gpt-5.4-nano, openai/gpt-oss-120b, openai/gpt-oss-20b, anthropic/claude-opus-4.7
  • Drop openai/gpt-5.3-codex (unavailable on OpenRouter, the matrix job failed)
  • Fix the N/A scoring regression: bench_model.py and aggregate.py were still parsing LLM output with a local _strip_code_fences + json.loads pair. PR feat(collector): replace Reddit scraping with MCP server #13 switched the production scorer/summarizer to the regex-based extract_json helper but the benchmark scripts were missed, so any model wrapping its JSON in extra text was scored as N/A and ranked with a composite of 0.0/10. Both scripts now use reddit_digest.nodes.llm_utils.extract_json.

Test plan

  • PR-triggered LLM Benchmark workflow completes
  • Aggregate report shows real scores and summary judge ratings (not 0.0/10) for the previously-failing models (e.g. microsoft/phi-4, openai/gpt-4.1-nano)

🤖 Generated with Claude Code

Adds openai/gpt-5.4, openai/gpt-5.4-nano, openai/gpt-5.3-codex,
openai/gpt-oss-120b, openai/gpt-oss-20b and anthropic/claude-opus-4.7
to the default benchmark matrix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 17, 2026 08:31

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the GitHub Actions benchmark workflow to include newly released default LLMs in the benchmark matrix so they run both on manual dispatch and on PR-triggered runs.

Changes:

  • Expanded the workflow_dispatch models input default to include 5 new OpenAI models and anthropic/claude-opus-4.7.
  • Updated the matrix fallback model list to match the new default set.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/benchmark.yml Outdated
Comment on lines +9 to +10
# yamllint disable-line rule:line-length
default: '["openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]'
default: '["openai/gpt-5.4","openai/gpt-5.4-nano","openai/gpt-5.3-codex","openai/gpt-oss-120b","openai/gpt-oss-20b","openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.7","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]'

Copilot AI Apr 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The # yamllint disable-line rule:line-length directive is on the line above, so it won’t suppress line-length warnings for the long default: value. If yamllint is used, move the directive onto the default: line (or use a yamllint disable/enable block) so the intended line is actually ignored.

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/benchmark.yml Outdated
Comment on lines +30 to +32
# yamllint disable-line rule:line-length
matrix:
model: ${{ fromJson(inputs.models || '["openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }}
model: ${{ fromJson(inputs.models || '["openai/gpt-5.4","openai/gpt-5.4-nano","openai/gpt-5.3-codex","openai/gpt-oss-120b","openai/gpt-oss-20b","openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.7","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }}

Copilot AI Apr 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: the yamllint disable-line comment is on line 30, but the long line that would trigger line-length is model: ... on line 32. If yamllint is run, this won’t have the intended effect; consider moving the directive onto the long line or wrapping this section in a yamllint disable/enable block.

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/benchmark.yml Outdated
Comment on lines +31 to +32
matrix:
model: ${{ fromJson(inputs.models || '["openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }}
model: ${{ fromJson(inputs.models || '["openai/gpt-5.4","openai/gpt-5.4-nano","openai/gpt-5.3-codex","openai/gpt-oss-120b","openai/gpt-oss-20b","openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.7","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }}

Copilot AI Apr 17, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default model list is duplicated in two places (the workflow_dispatch input default and the matrix fallback). Now that the list is longer, it’s easier for these to drift out of sync. Consider factoring the JSON string into a YAML anchor/alias (or another single source of truth) so updates only need to be made once.

Copilot uses AI. Check for mistakes.
using-system and others added 2 commits April 17, 2026 10:44
The model is unavailable on OpenRouter and caused the matrix job to fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The benchmark and aggregate scripts still parsed LLM output with a local
_strip_code_fences + json.loads pair, which fails when models wrap JSON
in explanatory text or extra content. Production code switched to
extract_json (regex-based) in #13; the benchmark scripts were missed,
which caused most models to score N/A and rank with composite 0.0/10.

Replace the local helper with reddit_digest.nodes.llm_utils.extract_json
in both bench_model.py and aggregate.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@using-system using-system changed the title chore(benchmark): add latest OpenAI and Anthropic models to defaults feat(benchmark): refresh model lineup and fix N/A scoring Apr 17, 2026
using-system and others added 5 commits April 17, 2026 10:46
The helper was inlined in bench_model.py / aggregate.py and was replaced
by reddit_digest.nodes.llm_utils.extract_json. Drop the four obsolete
tests that imported the removed function.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update the README with the run after #17 (5 new models, fixed JSON
parsing). Recommend google/gemma-3-12b-it as the best quality/price
self-hostable option (composite 0.9627 at \$0.0009 per run, 100%% JSON
valid). Note anomalies: gpt-oss-120b summary at 0.0/10, gpt-4.1-nano
17%% JSON, phi-4 67%% JSON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep README consistent — the rest of the file is in English.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mark

Update the table to reflect the 25-model run, add gpt-5.4-nano and the
Gemma 4-31B / 3-27B variants, and reframe the recommendation around
gemma-3-12b-it as the best self-hostable option (no longer top-2 on
composite, but still cheapest at $0.0009/run with 100% JSON validity).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Latest benchmark run is much more conclusive for the Gemma 3 family —
gemma-3-27b-it tops the composite ranking (0.9819) and gemma-3-12b-it
moves up to #2 (0.9781) with the same 7.9/10 summary quality. Three
Gemma 3 variants plus the Gemma 4-31B fill 4 of the top 5 spots.

Conclusion preserved: gemma-3-12b-it remains the recommended
self-hostable default — same summary quality as the 27B at less than
half the cost (\$0.0005 vs \$0.0012/run) and runnable on a single
consumer GPU.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@using-system using-system merged commit f243ebe into main Apr 17, 2026
8 checks passed
@using-system using-system deleted the chore/benchmark-add-new-models branch April 17, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants