Skip to content

feat(ml): logit-derived confidence on PINJ-ML-001 findings#205

Open
kurtpayne wants to merge 2 commits into
mainfrom
feat/ml-logit-confidence
Open

feat(ml): logit-derived confidence on PINJ-ML-001 findings#205
kurtpayne wants to merge 2 commits into
mainfrom
feat/ml-logit-confidence

Conversation

@kurtpayne
Copy link
Copy Markdown
Owner

@kurtpayne kurtpayne commented Apr 26, 2026

Summary

  • Adds Finding.logit_confidence — continuous P(predicted_verdict) ∈ [0, 1] derived from softmax over the model's \"benign\" / \"malicious\" token logits at the verdict position.
  • Adds --ml-threshold FLOAT CLI flag (and SKILLSCAN_ML_THRESHOLD env var) to filter PINJ-ML-001 findings whose logit_confidence falls below the threshold. Default 0.0 = no filtering (existing behaviour).
  • Replaces the discrete confidence field's 3 buckets (0.9 / 0.95 / 1.0) with a continuous signal that actually separates correct from wrong predictions. Same model, no retraining.
  • Backward compatible at every layer: older llama-cpp-python that rejects logprobs falls back gracefully; older clients/responses without a logprobs payload yield logit_confidence=None; advisory findings (NO-MODEL/STALE/LARGE-FILE/UNAVAIL) and findings without logit_confidence are never filtered.

Eval evidence

On v4.7's 431-file held-out set (data: skillscan-corpus/eval_results/v47_logit_confidence_eval.json):

signal range wrong-prediction distribution
discrete confidence 3 buckets {0.9, 0.95, 1.0} 100% of errors at conf=0.95 — invisible
logit_confidence 0.519 → 1.000 (continuous) 4/4 errors at conf ∈ [0.58, 0.76]

Threshold semantics:

--ml-threshold kept_pct kept_acc flagged flagged_acc
0.99 59.5% 100.0% 174 97.7%
0.95 78.8% 100.0% 91 95.6%
0.90 86.7% 100.0% 57 93.0%
0.80 94.0% 100.0% 26 84.6%
0.70 97.4% 99.5% 11 81.8%

--ml-threshold 0.80 drops every model error while accepting 94% of files.

Implementation

  • Load GGUF with logits_all=True (memory cost; required for logprobs).
  • Inference passes logprobs=True, top_logprobs=5 alongside the existing GBNF grammar.
  • _extract_logit_confidence() scans token-level logprobs for the first ben/mal token, softmaxes the candidate-token logits, returns P(predicted_verdict). Handles missing-from-top-K with soft floor (-20).
  • One-shot retry without logprobs if an older llama-cpp-python rejects the args.
  • --ml-threshold plumbed through scanner.scan() at three CLI callsites; the filter runs after ml_prompt_injection_findings() returns and only touches PINJ-ML-001 IDs.

Severity mapping is unchanged in this PR — logit_confidence is an additional signal downstream tooling can threshold against. A future PR can fold it into severity demotion (e.g., MED → LOW when logit < 0.7).

Test plan

  • SKILLSCAN_NO_USER_RULES=1 pytest tests/test_ml_detector.py -q — 27/27 passing (18 existing + 9 new)
  • Full suite (excluding the 7 stale-rule tests already fixed by test: align test_rules.py with rules-snapshot at 2026.04.25 (unblocks main) #204): 721 passed, 8 skipped
  • ruff check — clean on all touched files
  • Manual smoke against the bundled v4 GGUF on a few held-out files (requires skillscan model sync)

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 26, 2026

Codecov Report

❌ Patch coverage is 77.96610% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.90%. Comparing base (f72c00e) to head (c58ab0c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/skillscan/ml_detector.py 78.57% 12 Missing ⚠️
src/skillscan/analysis_pkg/_scanner.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #205      +/-   ##
==========================================
+ Coverage   75.87%   75.90%   +0.03%     
==========================================
  Files          41       41              
  Lines        5994     6052      +58     
==========================================
+ Hits         4548     4594      +46     
- Misses       1446     1458      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

kurtpayne and others added 2 commits April 26, 2026 17:09
The discrete `confidence` field the model emits buckets at 0.9 / 0.95 / 1.0
(83% at 0.95) — useless for thresholding because every wrong prediction
also lands at 0.95. Eval data on v4.7's 431-file held-out set: all 4 model
errors had logit_confidence ∈ [0.58, 0.76]; all 426 correct predictions had
logit_confidence ≥ 0.99 except a handful in [0.80, 0.99]. Threshold 0.80
flags 100% of errors while accepting 94% of files.

Implementation:
- Load the GGUF with `logits_all=True` so llama-cpp-python returns
  per-token logprobs.
- Inference passes `logprobs=True, top_logprobs=5` alongside the existing
  GBNF grammar. Falls back gracefully (one-shot retry without logprobs)
  when an older llama-cpp-python rejects the args.
- _extract_logit_confidence() finds the verdict-starting token and
  softmaxes the logp(ben) vs logp(mal) entries to produce continuous
  P(predicted_verdict) ∈ [0, 1]. Handles the missing-from-top-K case with
  a soft floor.
- Surfaced as Finding.logit_confidence (Optional[float]). Older clients
  without logprobs payloads get None — fully backward-compatible.

Severity mapping is unchanged in this commit; logit_confidence is an
additional signal that downstream tooling can threshold against. Future
PR can fold it into severity demotion (e.g., MED → LOW when logit < 0.7).

Eval evidence: skillscan-corpus/eval_results/v47_logit_confidence_eval.json

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Item B's user-facing promise: \"Enables --threshold 0.85 for CI gates\".
The earlier commit added Finding.logit_confidence; this commit makes it
actually usable from the command line.

Adds --ml-threshold (also SKILLSCAN_ML_THRESHOLD env var, default 0.0).
When > 0, drops PINJ-ML-001 findings whose logit_confidence is below the
threshold. Advisory findings (PINJ-ML-NO-MODEL/STALE/LARGE-FILE/UNAVAIL)
are never filtered. Findings without logit_confidence (older clients) are
also never filtered — backward-safe.

Recommended thresholds (per the 431-file v4.7 held-out eval):
  --ml-threshold 0.99   — keeps 60%, all correct (strictest CI gate)
  --ml-threshold 0.90   — keeps 87%, all correct
  --ml-threshold 0.80   — keeps 94%, all correct, drops every model error
  --ml-threshold 0.70   — keeps 97%, 99.5% correct (lenient)

Plumbed through scanner.scan() at three CLI callsites + the underlying
_scanner.scan(). No CLI flag at default = no behaviour change for
existing users.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kurtpayne kurtpayne force-pushed the feat/ml-logit-confidence branch from fd0df9e to c58ab0c Compare April 27, 2026 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant