feat: benchmark calibration, semantic validation & improved prompts by jmlweb · Pull Request #68 · jmlweb/hyntx

jmlweb · 2026-01-30T13:04:12Z

Summary

This PR adds several improvements to make analysis more reliable, especially for small models:

🎯 Gold Standard Benchmark

50 curated prompts with human-rated quality scores
Covers all tiers: excellent (10), good (15), fair (15), poor (10)
Includes correlation calculation for model accuracy measurement
Used for calibrating scores across different providers

✅ Semantic Validation

Validates that scores correlate with issue counts
Detects when examples are not found in original prompts
Auto-corrects results when validation fails
Prevents logically inconsistent outputs

🔄 Temperature Fallback Retry

When JSON parsing fails, retries with lower temperatures
Sequence: 0.3 → 0.1 → 0.0
More deterministic outputs reduce parse failures

📝 Enhanced SYSTEM_PROMPT_MINIMAL

More contrastive examples showing score progression
Clear examples for each tier (POOR, FAIR, GOOD, EXCELLENT)
Better calibrated scoring guidelines

Testing

All existing tests pass
New benchmark tests with 24 test cases
Typecheck passes

…ompts - Add gold-standard benchmark with 50 curated prompts for calibration - Add semantic validator to detect score/issue inconsistencies - Implement temperature fallback retry (0.3 -> 0.1 -> 0.0) for Ollama - Enhance SYSTEM_PROMPT_MINIMAL with more contrastive examples - Auto-correct results when semantic validation fails This improves analysis reliability especially for small models.

- Fix @typescript-eslint/restrict-template-expressions by converting numbers to strings - Fix @typescript-eslint/no-unnecessary-condition by removing redundant checks - Fix test expectation to match actual error message - Sort imports in benchmark/index.ts

jmlweb added 3 commits January 30, 2026 14:03

fix: update ollama tests for temperature retry behavior

13ac8f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: benchmark calibration, semantic validation & improved prompts#68

feat: benchmark calibration, semantic validation & improved prompts#68
jmlweb wants to merge 3 commits intomainfrom
feat/analysis-improvements

jmlweb commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jmlweb commented Jan 30, 2026

Summary

🎯 Gold Standard Benchmark

✅ Semantic Validation

🔄 Temperature Fallback Retry

📝 Enhanced SYSTEM_PROMPT_MINIMAL

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant