Add category-level score summary at end of benchmark run by ScuttleBot · Pull Request #44 · pinchbench/skill

ScuttleBot · 2026-03-10T20:20:11Z

Summary

Addresses #41 — adds a summary at the end of benchmark runs that matches the PinchBench website format.

Output Example

================================================================================
🦀 PINCHBENCH SCORE SUMMARY
================================================================================

   Overall Score: 86.0% (19.9 / 23.0)

   CATEGORY               SCORE        TASKS
   --------------------------------------------
   🟢 BASIC              100.0%      1 task
   🟡 CALENDAR            83.0%      1 task
   🟢 CODING             100.0%      1 task
   🟢 COMPREHENSION       93.0%      4 tasks
   🔴 CREATIVE             0.0%      1 task
   ...
   --------------------------------------------
================================================================================

Features

Overall score — percentage with earned/possible points
Per-category breakdown — grouped by task category
Color indicators — 🟢 ≥90%, 🟡 ≥70%, 🔴 <70%
Task counts — shows how many tasks in each category

Use Case

Users running benchmarks locally with --no-upload can now compare their results directly against the PinchBench leaderboard without digging through verbose output.

Closes #41

Adds a new summary section after benchmark completion that shows: - Overall score percentage (matching website format) - Per-category breakdown with task counts - Color indicators (🟢 ≥90%, 🟡 ≥70%, 🔴 <70%) This allows users running locally (with --no-upload) to compare their results directly against scores on pinchbench.com without digging through the detailed output. Closes #41

kilo-code-bot · 2026-03-10T20:21:57Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Overview

This PR adds a _log_category_summary() function that groups benchmark task scores by category and logs a formatted summary table. The implementation is clean and well-structured:

Edge cases are handled properly (empty task_entries, missing grading data, empty/missing categories)
Division-by-zero is guarded with if total_possible > 0 / if data["possible"] > 0 checks
The task.category attribute defaults to '' in lib_tasks.py, and the code correctly falls back to "UNCATEGORIZED" for falsy values
The function is purely for logging/display and has no side effects on data flow

Files Reviewed (1 file)

scripts/benchmark.py - 0 issues

ScuttleBot mentioned this pull request Mar 11, 2026

Summary at end of test run #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add category-level score summary at end of benchmark run#44

Add category-level score summary at end of benchmark run#44
ScuttleBot wants to merge 1 commit intomainfrom
scuttlebot/category-summary

ScuttleBot commented Mar 10, 2026

Uh oh!

kilo-code-bot bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ScuttleBot commented Mar 10, 2026

Summary

Output Example

Features

Use Case

Uh oh!

kilo-code-bot bot commented Mar 10, 2026

Code Review Summary

Overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants