feat(ml): logit-derived confidence on PINJ-ML-001 findings by kurtpayne · Pull Request #205 · kurtpayne/skillscan-security

kurtpayne · 2026-04-26T02:27:37Z

Summary

Adds Finding.logit_confidence — continuous P(predicted_verdict) ∈ [0, 1] derived from softmax over the model's \"benign\" / \"malicious\" token logits at the verdict position.
Adds --ml-threshold FLOAT CLI flag (and SKILLSCAN_ML_THRESHOLD env var) to filter PINJ-ML-001 findings whose logit_confidence falls below the threshold. Default 0.0 = no filtering (existing behaviour).
Replaces the discrete confidence field's 3 buckets (0.9 / 0.95 / 1.0) with a continuous signal that actually separates correct from wrong predictions. Same model, no retraining.
Backward compatible at every layer: older llama-cpp-python that rejects logprobs falls back gracefully; older clients/responses without a logprobs payload yield logit_confidence=None; advisory findings (NO-MODEL/STALE/LARGE-FILE/UNAVAIL) and findings without logit_confidence are never filtered.

Eval evidence

On v4.7's 431-file held-out set (data: skillscan-corpus/eval_results/v47_logit_confidence_eval.json):

signal	range	wrong-prediction distribution
discrete `confidence`	3 buckets {0.9, 0.95, 1.0}	100% of errors at conf=0.95 — invisible
logit_confidence	0.519 → 1.000 (continuous)	4/4 errors at conf ∈ [0.58, 0.76]

Threshold semantics:

`--ml-threshold`	kept_pct	kept_acc	flagged	flagged_acc
0.99	59.5%	100.0%	174	97.7%
0.95	78.8%	100.0%	91	95.6%
0.90	86.7%	100.0%	57	93.0%
0.80	94.0%	100.0%	26	84.6%
0.70	97.4%	99.5%	11	81.8%

--ml-threshold 0.80 drops every model error while accepting 94% of files.

Implementation

Load GGUF with logits_all=True (memory cost; required for logprobs).
Inference passes logprobs=True, top_logprobs=5 alongside the existing GBNF grammar.
_extract_logit_confidence() scans token-level logprobs for the first ben/mal token, softmaxes the candidate-token logits, returns P(predicted_verdict). Handles missing-from-top-K with soft floor (-20).
One-shot retry without logprobs if an older llama-cpp-python rejects the args.
--ml-threshold plumbed through scanner.scan() at three CLI callsites; the filter runs after ml_prompt_injection_findings() returns and only touches PINJ-ML-001 IDs.

Severity mapping is unchanged in this PR — logit_confidence is an additional signal downstream tooling can threshold against. A future PR can fold it into severity demotion (e.g., MED → LOW when logit < 0.7).

Test plan

SKILLSCAN_NO_USER_RULES=1 pytest tests/test_ml_detector.py -q — 27/27 passing (18 existing + 9 new)
Full suite (excluding the 7 stale-rule tests already fixed by test: align test_rules.py with rules-snapshot at 2026.04.25 (unblocks main) #204): 721 passed, 8 skipped
ruff check — clean on all touched files
Manual smoke against the bundled v4 GGUF on a few held-out files (requires skillscan model sync)

🤖 Generated with Claude Code

codecov · 2026-04-26T22:18:08Z

Codecov Report

❌ Patch coverage is 77.96610% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.90%. Comparing base (f72c00e) to head (c58ab0c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/skillscan/ml_detector.py	78.57%	12 Missing ⚠️
src/skillscan/analysis_pkg/_scanner.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #205      +/-   ##
==========================================
+ Coverage   75.87%   75.90%   +0.03%     
==========================================
  Files          41       41              
  Lines        5994     6052      +58     
==========================================
+ Hits         4548     4594      +46     
- Misses       1446     1458      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The discrete `confidence` field the model emits buckets at 0.9 / 0.95 / 1.0 (83% at 0.95) — useless for thresholding because every wrong prediction also lands at 0.95. Eval data on v4.7's 431-file held-out set: all 4 model errors had logit_confidence ∈ [0.58, 0.76]; all 426 correct predictions had logit_confidence ≥ 0.99 except a handful in [0.80, 0.99]. Threshold 0.80 flags 100% of errors while accepting 94% of files. Implementation: - Load the GGUF with `logits_all=True` so llama-cpp-python returns per-token logprobs. - Inference passes `logprobs=True, top_logprobs=5` alongside the existing GBNF grammar. Falls back gracefully (one-shot retry without logprobs) when an older llama-cpp-python rejects the args. - _extract_logit_confidence() finds the verdict-starting token and softmaxes the logp(ben) vs logp(mal) entries to produce continuous P(predicted_verdict) ∈ [0, 1]. Handles the missing-from-top-K case with a soft floor. - Surfaced as Finding.logit_confidence (Optional[float]). Older clients without logprobs payloads get None — fully backward-compatible. Severity mapping is unchanged in this commit; logit_confidence is an additional signal that downstream tooling can threshold against. Future PR can fold it into severity demotion (e.g., MED → LOW when logit < 0.7). Eval evidence: skillscan-corpus/eval_results/v47_logit_confidence_eval.json Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes Item B's user-facing promise: \"Enables --threshold 0.85 for CI gates\". The earlier commit added Finding.logit_confidence; this commit makes it actually usable from the command line. Adds --ml-threshold (also SKILLSCAN_ML_THRESHOLD env var, default 0.0). When > 0, drops PINJ-ML-001 findings whose logit_confidence is below the threshold. Advisory findings (PINJ-ML-NO-MODEL/STALE/LARGE-FILE/UNAVAIL) are never filtered. Findings without logit_confidence (older clients) are also never filtered — backward-safe. Recommended thresholds (per the 431-file v4.7 held-out eval): --ml-threshold 0.99 — keeps 60%, all correct (strictest CI gate) --ml-threshold 0.90 — keeps 87%, all correct --ml-threshold 0.80 — keeps 94%, all correct, drops every model error --ml-threshold 0.70 — keeps 97%, 99.5% correct (lenient) Plumbed through scanner.scan() at three CLI callsites + the underlying _scanner.scan(). No CLI flag at default = no behaviour change for existing users. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kurtpayne mentioned this pull request Apr 26, 2026

feat(scan): defense-in-depth triage hints (Item E) #211

Open

4 tasks

kurtpayne force-pushed the feat/ml-logit-confidence branch from bfd7398 to fd0df9e Compare April 26, 2026 23:10

kurtpayne and others added 2 commits April 26, 2026 17:09

kurtpayne force-pushed the feat/ml-logit-confidence branch from fd0df9e to c58ab0c Compare April 27, 2026 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ml): logit-derived confidence on PINJ-ML-001 findings#205

feat(ml): logit-derived confidence on PINJ-ML-001 findings#205
kurtpayne wants to merge 2 commits into
mainfrom
feat/ml-logit-confidence

kurtpayne commented Apr 26, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kurtpayne commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Eval evidence

Implementation

Test plan

Uh oh!

codecov Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kurtpayne commented Apr 26, 2026 •

edited

Loading

codecov Bot commented Apr 26, 2026 •

edited

Loading