Skip to content

Add category-level score summary at end of benchmark run#44

Open
ScuttleBot wants to merge 1 commit intomainfrom
scuttlebot/category-summary
Open

Add category-level score summary at end of benchmark run#44
ScuttleBot wants to merge 1 commit intomainfrom
scuttlebot/category-summary

Conversation

@ScuttleBot
Copy link

Summary

Addresses #41 — adds a summary at the end of benchmark runs that matches the PinchBench website format.

Output Example

================================================================================
🦀 PINCHBENCH SCORE SUMMARY
================================================================================

   Overall Score: 86.0% (19.9 / 23.0)

   CATEGORY               SCORE        TASKS
   --------------------------------------------
   🟢 BASIC              100.0%      1 task
   🟡 CALENDAR            83.0%      1 task
   🟢 CODING             100.0%      1 task
   🟢 COMPREHENSION       93.0%      4 tasks
   🔴 CREATIVE             0.0%      1 task
   ...
   --------------------------------------------
================================================================================

Features

  • Overall score — percentage with earned/possible points
  • Per-category breakdown — grouped by task category
  • Color indicators — 🟢 ≥90%, 🟡 ≥70%, 🔴 <70%
  • Task counts — shows how many tasks in each category

Use Case

Users running benchmarks locally with --no-upload can now compare their results directly against the PinchBench leaderboard without digging through verbose output.

Closes #41

Adds a new summary section after benchmark completion that shows:
- Overall score percentage (matching website format)
- Per-category breakdown with task counts
- Color indicators (🟢 ≥90%, 🟡 ≥70%, 🔴 <70%)

This allows users running locally (with --no-upload) to compare their
results directly against scores on pinchbench.com without digging
through the detailed output.

Closes #41
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 10, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Overview

This PR adds a _log_category_summary() function that groups benchmark task scores by category and logs a formatted summary table. The implementation is clean and well-structured:

  • Edge cases are handled properly (empty task_entries, missing grading data, empty/missing categories)
  • Division-by-zero is guarded with if total_possible > 0 / if data["possible"] > 0 checks
  • The task.category attribute defaults to '' in lib_tasks.py, and the code correctly falls back to "UNCATEGORIZED" for falsy values
  • The function is purely for logging/display and has no side effects on data flow
Files Reviewed (1 file)
  • scripts/benchmark.py - 0 issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Summary at end of test run

2 participants