Skip to content

feat: benchmark calibration, semantic validation & improved prompts#68

Open
jmlweb wants to merge 3 commits intomainfrom
feat/analysis-improvements
Open

feat: benchmark calibration, semantic validation & improved prompts#68
jmlweb wants to merge 3 commits intomainfrom
feat/analysis-improvements

Conversation

@jmlweb
Copy link
Owner

@jmlweb jmlweb commented Jan 30, 2026

Summary

This PR adds several improvements to make analysis more reliable, especially for small models:

🎯 Gold Standard Benchmark

  • 50 curated prompts with human-rated quality scores
  • Covers all tiers: excellent (10), good (15), fair (15), poor (10)
  • Includes correlation calculation for model accuracy measurement
  • Used for calibrating scores across different providers

✅ Semantic Validation

  • Validates that scores correlate with issue counts
  • Detects when examples are not found in original prompts
  • Auto-corrects results when validation fails
  • Prevents logically inconsistent outputs

🔄 Temperature Fallback Retry

  • When JSON parsing fails, retries with lower temperatures
  • Sequence: 0.3 → 0.1 → 0.0
  • More deterministic outputs reduce parse failures

📝 Enhanced SYSTEM_PROMPT_MINIMAL

  • More contrastive examples showing score progression
  • Clear examples for each tier (POOR, FAIR, GOOD, EXCELLENT)
  • Better calibrated scoring guidelines

Testing

  • All existing tests pass
  • New benchmark tests with 24 test cases
  • Typecheck passes

…ompts

- Add gold-standard benchmark with 50 curated prompts for calibration
- Add semantic validator to detect score/issue inconsistencies
- Implement temperature fallback retry (0.3 -> 0.1 -> 0.0) for Ollama
- Enhance SYSTEM_PROMPT_MINIMAL with more contrastive examples
- Auto-correct results when semantic validation fails

This improves analysis reliability especially for small models.
- Fix @typescript-eslint/restrict-template-expressions by converting numbers to strings
- Fix @typescript-eslint/no-unnecessary-condition by removing redundant checks
- Fix test expectation to match actual error message
- Sort imports in benchmark/index.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant