feat(ml): logit-derived confidence on PINJ-ML-001 findings#205
Open
kurtpayne wants to merge 2 commits into
Open
feat(ml): logit-derived confidence on PINJ-ML-001 findings#205kurtpayne wants to merge 2 commits into
kurtpayne wants to merge 2 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #205 +/- ##
==========================================
+ Coverage 75.87% 75.90% +0.03%
==========================================
Files 41 41
Lines 5994 6052 +58
==========================================
+ Hits 4548 4594 +46
- Misses 1446 1458 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
4 tasks
bfd7398 to
fd0df9e
Compare
The discrete `confidence` field the model emits buckets at 0.9 / 0.95 / 1.0 (83% at 0.95) — useless for thresholding because every wrong prediction also lands at 0.95. Eval data on v4.7's 431-file held-out set: all 4 model errors had logit_confidence ∈ [0.58, 0.76]; all 426 correct predictions had logit_confidence ≥ 0.99 except a handful in [0.80, 0.99]. Threshold 0.80 flags 100% of errors while accepting 94% of files. Implementation: - Load the GGUF with `logits_all=True` so llama-cpp-python returns per-token logprobs. - Inference passes `logprobs=True, top_logprobs=5` alongside the existing GBNF grammar. Falls back gracefully (one-shot retry without logprobs) when an older llama-cpp-python rejects the args. - _extract_logit_confidence() finds the verdict-starting token and softmaxes the logp(ben) vs logp(mal) entries to produce continuous P(predicted_verdict) ∈ [0, 1]. Handles the missing-from-top-K case with a soft floor. - Surfaced as Finding.logit_confidence (Optional[float]). Older clients without logprobs payloads get None — fully backward-compatible. Severity mapping is unchanged in this commit; logit_confidence is an additional signal that downstream tooling can threshold against. Future PR can fold it into severity demotion (e.g., MED → LOW when logit < 0.7). Eval evidence: skillscan-corpus/eval_results/v47_logit_confidence_eval.json Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes Item B's user-facing promise: \"Enables --threshold 0.85 for CI gates\". The earlier commit added Finding.logit_confidence; this commit makes it actually usable from the command line. Adds --ml-threshold (also SKILLSCAN_ML_THRESHOLD env var, default 0.0). When > 0, drops PINJ-ML-001 findings whose logit_confidence is below the threshold. Advisory findings (PINJ-ML-NO-MODEL/STALE/LARGE-FILE/UNAVAIL) are never filtered. Findings without logit_confidence (older clients) are also never filtered — backward-safe. Recommended thresholds (per the 431-file v4.7 held-out eval): --ml-threshold 0.99 — keeps 60%, all correct (strictest CI gate) --ml-threshold 0.90 — keeps 87%, all correct --ml-threshold 0.80 — keeps 94%, all correct, drops every model error --ml-threshold 0.70 — keeps 97%, 99.5% correct (lenient) Plumbed through scanner.scan() at three CLI callsites + the underlying _scanner.scan(). No CLI flag at default = no behaviour change for existing users. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fd0df9e to
c58ab0c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Finding.logit_confidence— continuous P(predicted_verdict) ∈ [0, 1] derived from softmax over the model's\"benign\"/\"malicious\"token logits at the verdict position.--ml-threshold FLOATCLI flag (andSKILLSCAN_ML_THRESHOLDenv var) to filterPINJ-ML-001findings whoselogit_confidencefalls below the threshold. Default0.0= no filtering (existing behaviour).confidencefield's 3 buckets (0.9 / 0.95 / 1.0) with a continuous signal that actually separates correct from wrong predictions. Same model, no retraining.llama-cpp-pythonthat rejectslogprobsfalls back gracefully; older clients/responses without alogprobspayload yieldlogit_confidence=None; advisory findings (NO-MODEL/STALE/LARGE-FILE/UNAVAIL) and findings withoutlogit_confidenceare never filtered.Eval evidence
On v4.7's 431-file held-out set (data:
skillscan-corpus/eval_results/v47_logit_confidence_eval.json):confidenceThreshold semantics:
--ml-threshold--ml-threshold 0.80drops every model error while accepting 94% of files.Implementation
logits_all=True(memory cost; required for logprobs).logprobs=True, top_logprobs=5alongside the existing GBNF grammar._extract_logit_confidence()scans token-level logprobs for the firstben/maltoken, softmaxes the candidate-token logits, returns P(predicted_verdict). Handles missing-from-top-K with soft floor (-20).logprobsif an olderllama-cpp-pythonrejects the args.--ml-thresholdplumbed throughscanner.scan()at three CLI callsites; the filter runs afterml_prompt_injection_findings()returns and only touchesPINJ-ML-001IDs.Severity mapping is unchanged in this PR —
logit_confidenceis an additional signal downstream tooling can threshold against. A future PR can fold it into severity demotion (e.g., MED → LOW when logit < 0.7).Test plan
SKILLSCAN_NO_USER_RULES=1 pytest tests/test_ml_detector.py -q— 27/27 passing (18 existing + 9 new)ruff check— clean on all touched filesskillscan model sync)🤖 Generated with Claude Code