ML augmented SMB share hunter. Snaffler successor with a two stage classifier pipeline.
ShareSift ranks files on SMB shares by likelihood of containing credentials or secrets. Stage 1 runs a LightGBM path classifier on every path. Stage 2 runs a Qwen3 1.7B LoRA content classifier on the flagged files to confirm. Run Stage 1 alone or both stages together.
Snaffler catches the obvious patterns like id_rsa, NTDS.dit, and .kdbx. It misses the long tail. Custom scripts on shares, secrets in unusual filenames, and password directories with unconventional names all slip through.
ShareSift adds an ML layer on top. The path classifier beats Snaffler on recall by 29.3 percentage points on the Snaffler blind benchmark. The content classifier closes most of the gap to Biringa and Kul 2025 at one quarter the parameter count.
| Metric | ShareSift | Baseline |
|---|---|---|
| Windows path classifier PR AUC, Snaffler blind benchmark | 0.985 | Snaffler has no ML |
| Linux path classifier PR AUC, Linux rule blind benchmark | 0.99 | Rule pack F1 0.45 |
| Linux F1 vs hand curated rule pack | +52 pp | Rule pack F1 0.45 |
| Content classifier F1 on docx benchmark (v0p6) | 0.776 | v0p5 0.385 |
| Content classifier precision on docx benchmark | 0.974 | 2.6% false positive rate |
| End to end F1 on constructed share benchmark | 0.387 | v0p5 0.166 |
See docs/audit_2026-05-31.md for calibration details and the full audit story.
git clone https://github.com/byevincent/ShareSift.git
cd ShareSift
# Stage 1 only (path classifier, ~100MB)
uv sync
# Both stages (adds ~3GB of torch and transformers)
uv sync --group content-inferenceAdd --group content-training for LoRA fine-tuning. That pulls another 5GB.
Pipe output from your enumeration tool directly into ShareSift.
manspider --target \\fileserver -d corp.local | \
uv run sharesift score-paths --stdin
# Or from a file
uv run sharesift score-paths \
--input enumerated_paths.txt \
--output scored.jsonlOutput is JSONL with path, probability, and tier (Black, Red, Yellow, or null).
{"path": "\\\\fileserver\\Finance\\backups\\creds.kdbx", "probability": 0.987, "tier": "Black"}
{"path": "\\\\fileserver\\Dev\\notes.txt", "probability": 0.523, "tier": "Yellow"}
{"path": "\\\\fileserver\\Marketing\\Q4.pdf", "probability": 0.012, "tier": null}Work through Black first, then Red, then Yellow. Use jq to sort and filter.
find ./downloaded_share -type f | \
uv run sharesift scan-files --stdin \
--output deep_scan.jsonlStage 2 adds content_check and content_excerpt to each record.
{"path": "./downloaded_share/Dev/notes.txt", "path_probability": 0.52, "path_tier": "Yellow", "content_check": "yes", "content_excerpt": "API_KEY = 'sk_live_...'"}Stage 2 runs only on tier flagged paths. Override with --force-content. On CPU this takes 5 to 8 seconds per file. On CUDA it runs in about 150ms.
Two stage pipeline. Each stage runs independently.
┌─────────────────────────────────────────┐
│ Stage 1 router (by path shape) │
│ │
│ UNC path → Windows model │
path list → │ Unix path → Linux model │ → (probability, tier)
│ │
│ LightGBM + char n-grams + │
│ 8 hand features, calibrated │
│ probability, per-model tier band │
└─────────────────────────────────────────┘
│
(tier flagged subset)
↓
┌──────────────────────────┐
│ Stage 2: Qwen3-1.7B LoRA │ → (yes / no on secret presence)
│ via transformers + PEFT │
│ 4-bit base on CUDA │
│ bf16 on CPU │
└──────────────────────────┘
Stage 1 trains on 11,190 Windows and 1,685 Linux records. It scores each path in under one millisecond. Stage 2 is 1.5 to 3.4GB depending on your hardware.
Full design in docs/architecture.md and docs/build_plan.md.
ShareSiftPathRule is a SnaffleRule subclass that plugs the path classifier into pysnaffler's SMB enumeration loop.
uv sync --group pysnaffler-integrationfrom sharesift.pysnaffler_run import build_ruleset
from pysnaffler.snaffler import pySnaffler
# ML only: ShareSift replaces Snaffler's rule pack
ruleset = build_ruleset()
# Hybrid: Snaffler defaults plus ShareSift
ruleset = build_ruleset(include_defaults=True)
snaffler = pySnaffler(ruleset=ruleset, dry_run=True)No validation against real engagement findings. ShareSift labels come from a Claude rule and Codex audit pipeline on public corpus data. That is useful signal, but it is a different class from internal engagement grade ground truth.
Calibration holds in distribution. The tier band precision contracts are reliable on data from the same source as training. On out of distribution data the Windows model ECE rises from 0.007 to 0.30. Treat tier assignments as triage ordering, not probability contracts, when scanning real SMB shares.
Cross source generalization is weaker than the headline numbers suggest. Windows PR AUC drops from 0.97 to 0.76 when you train on Stack Exchange and test on GitHub Code Search. Real SMB shares are a third distribution that neither training nor evaluation covers.
The content classifier sits 13 F1 points below Biringa and Kul 2025. The gap comes from model size. Mistral 7B is four times larger than Qwen3 1.7B, and ShareSift targets an RTX 4070 deployment.
Rare credential categories are undertrained. Private keys, SSH credentials, cloud credentials, and IAC each have three or fewer training records. Recall on those classes is weak.
CPUs without AVX 512 are slow. Benchmarked at 5 to 8 seconds per file on a Ryzen 5 3600.
See docs/audit_2026-05-31.md and docs/audit_2026-05-30.md for the full audit history.
src/sharesift/ runtime package
path.py PathClassifier router (Windows and Linux models)
content.py content classifier wrapper
tier.py probability to tier band
features.py char n-gram and hand features
prompt.py content classifier chat template formatter
pysnaffler_rule.py SnaffleRule plugin
pysnaffler_run.py build_ruleset() helper
cli.py score-paths and scan-files entry point
src/eval/ training and evaluation scripts
tools/ training, dataset builders, audit tools
docs/ engineering log and architecture docs
models/ trained model weights
Apache 2.0. See NOTICE for GPLv3 components (vendored Snaffler ruleset and pysnaffler).
This is an active solo build. Track major design decisions in docs/build_plan.md and docs/journal.md. Open an issue before sending a PR.