feat(benchmark): refresh model lineup and fix N/A scoring by using-system · Pull Request #17 · using-system/reddit-digest-agent

using-system · 2026-04-17T08:31:31Z

Summary

Add new OpenAI and Anthropic models to the default benchmark matrix: openai/gpt-5.4, openai/gpt-5.4-nano, openai/gpt-oss-120b, openai/gpt-oss-20b, anthropic/claude-opus-4.7
Drop openai/gpt-5.3-codex (unavailable on OpenRouter, the matrix job failed)
Fix the N/A scoring regression: bench_model.py and aggregate.py were still parsing LLM output with a local _strip_code_fences + json.loads pair. PR feat(collector): replace Reddit scraping with MCP server #13 switched the production scorer/summarizer to the regex-based extract_json helper but the benchmark scripts were missed, so any model wrapping its JSON in extra text was scored as N/A and ranked with a composite of 0.0/10. Both scripts now use reddit_digest.nodes.llm_utils.extract_json.

Test plan

PR-triggered LLM Benchmark workflow completes
Aggregate report shows real scores and summary judge ratings (not 0.0/10) for the previously-failing models (e.g. microsoft/phi-4, openai/gpt-4.1-nano)

🤖 Generated with Claude Code

Adds openai/gpt-5.4, openai/gpt-5.4-nano, openai/gpt-5.3-codex, openai/gpt-oss-120b, openai/gpt-oss-20b and anthropic/claude-opus-4.7 to the default benchmark matrix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Updates the GitHub Actions benchmark workflow to include newly released default LLMs in the benchmark matrix so they run both on manual dispatch and on PR-triggered runs.

Changes:

Expanded the workflow_dispatch models input default to include 5 new OpenAI models and anthropic/claude-opus-4.7.
Updated the matrix fallback model list to match the new default set.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T08:35:19Z

        # yamllint disable-line rule:line-length
-        default: '["openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]'
+        default: '["openai/gpt-5.4","openai/gpt-5.4-nano","openai/gpt-5.3-codex","openai/gpt-oss-120b","openai/gpt-oss-20b","openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.7","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]'


The # yamllint disable-line rule:line-length directive is on the line above, so it won’t suppress line-length warnings for the long default: value. If yamllint is used, move the directive onto the default: line (or use a yamllint disable/enable block) so the intended line is actually ignored.

Copilot · 2026-04-17T08:35:20Z

      # yamllint disable-line rule:line-length
      matrix:
-        model: ${{ fromJson(inputs.models || '["openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }}
+        model: ${{ fromJson(inputs.models || '["openai/gpt-5.4","openai/gpt-5.4-nano","openai/gpt-5.3-codex","openai/gpt-oss-120b","openai/gpt-oss-20b","openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.7","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }}


Same as above: the yamllint disable-line comment is on line 30, but the long line that would trigger line-length is model: ... on line 32. If yamllint is run, this won’t have the intended effect; consider moving the directive onto the long line or wrapping this section in a yamllint disable/enable block.

Copilot · 2026-04-17T08:35:20Z

      matrix:
-        model: ${{ fromJson(inputs.models || '["openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }}
+        model: ${{ fromJson(inputs.models || '["openai/gpt-5.4","openai/gpt-5.4-nano","openai/gpt-5.3-codex","openai/gpt-oss-120b","openai/gpt-oss-20b","openai/gpt-4o","openai/gpt-4o-mini","openai/gpt-4.1","openai/gpt-4.1-mini","openai/gpt-4.1-nano","anthropic/claude-opus-4.7","anthropic/claude-opus-4.6","anthropic/claude-haiku-4.5","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it","google/gemma-4-31b-it","mistralai/mistral-small-3.1-24b-instruct","mistralai/mistral-medium-3.1","meta-llama/llama-4-scout","meta-llama/llama-4-maverick","deepseek/deepseek-chat-v3-0324","deepseek/deepseek-v3.2","microsoft/phi-4","x-ai/grok-3-mini","x-ai/grok-4-fast"]') }}


The default model list is duplicated in two places (the workflow_dispatch input default and the matrix fallback). Now that the list is longer, it’s easier for these to drift out of sync. Consider factoring the JSON string into a YAML anchor/alias (or another single source of truth) so updates only need to be made once.

The model is unavailable on OpenRouter and caused the matrix job to fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The benchmark and aggregate scripts still parsed LLM output with a local _strip_code_fences + json.loads pair, which fails when models wrap JSON in explanatory text or extra content. Production code switched to extract_json (regex-based) in #13; the benchmark scripts were missed, which caused most models to score N/A and rank with composite 0.0/10. Replace the local helper with reddit_digest.nodes.llm_utils.extract_json in both bench_model.py and aggregate.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The helper was inlined in bench_model.py / aggregate.py and was replaced by reddit_digest.nodes.llm_utils.extract_json. Drop the four obsolete tests that imported the removed function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update the README with the run after #17 (5 new models, fixed JSON parsing). Recommend google/gemma-3-12b-it as the best quality/price self-hostable option (composite 0.9627 at \$0.0009 per run, 100%% JSON valid). Note anomalies: gpt-oss-120b summary at 0.0/10, gpt-4.1-nano 17%% JSON, phi-4 67%% JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keep README consistent — the rest of the file is in English. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mark Update the table to reflect the 25-model run, add gpt-5.4-nano and the Gemma 4-31B / 3-27B variants, and reframe the recommendation around gemma-3-12b-it as the best self-hostable option (no longer top-2 on composite, but still cheapest at $0.0009/run with 100% JSON validity). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Latest benchmark run is much more conclusive for the Gemma 3 family — gemma-3-27b-it tops the composite ranking (0.9819) and gemma-3-12b-it moves up to #2 (0.9781) with the same 7.9/10 summary quality. Three Gemma 3 variants plus the Gemma 4-31B fill 4 of the top 5 spots. Conclusion preserved: gemma-3-12b-it remains the recommended self-hostable default — same summary quality as the 27B at less than half the cost (\$0.0005 vs \$0.0012/run) and runnable on a single consumer GPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 17, 2026 08:31

Copilot started reviewing on behalf of using-system April 17, 2026 08:31 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

using-system and others added 2 commits April 17, 2026 10:44

chore(benchmark): remove openai/gpt-5.3-codex from defaults

46272b2

The model is unavailable on OpenRouter and caused the matrix job to fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

using-system changed the title ~~chore(benchmark): add latest OpenAI and Anthropic models to defaults~~ feat(benchmark): refresh model lineup and fix N/A scoring Apr 17, 2026

using-system and others added 5 commits April 17, 2026 10:46

docs(benchmark): translate recommendation section to English

9779b21

Keep README consistent — the rest of the file is in English. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

using-system merged commit f243ebe into main Apr 17, 2026
8 checks passed

using-system deleted the chore/benchmark-add-new-models branch April 17, 2026 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): refresh model lineup and fix N/A scoring#17

feat(benchmark): refresh model lineup and fix N/A scoring#17
using-system merged 8 commits into
mainfrom
chore/benchmark-add-new-models

using-system commented Apr 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

using-system commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

using-system commented Apr 17, 2026 •

edited

Loading