fix(security): disk-persistent prompt-injection hardening (inject_claude_md content-sanitization + path guards) by WRG-11 · Pull Request #47 · WRG-11/instinct

WRG-11 · 2026-05-31T19:34:52Z

Summary — R89-132b (security)

inject_claude_md embedded promoted-rule fields (pattern/category/explain) into a CLAUDE.md file with no neutralization. An attacker who promotes a rule via observe(explain="legit\n- malicious instruction") × 10 could inject arbitrary Markdown instruction lines into ~/.claude/CLAUDE.md — prompt injection via disk that later Claude sessions load.

A path guard alone is not sufficient: the headline target (~/.claude/CLAUDE.md) is already a legitimate .md destination, so blocking the path would break the tool. The real defense is content-sanitization at the sink.

3-layer fix

Layer	Finding	Change
L1 (primary)	INSTINCT-M-001 (MED)	`_sanitize_inline()` neutralizes every field embedded into a CLAUDE.md bullet — collapses CR/LF/control chars → space (no line break-out), backticks → `'` (no code-span escape), and breaks the block's own `<!--`/`-->` fences (no smuggled `instinct:end` marker). Applied to `pattern`, `category`, `explain`. Defends even rows poisoned by another write path / pre-fix.
L2 (def-in-depth)	INSTINCT-M-001 / L-001 (LOW)	`.md`/`.mdx` suffix + symlink guard on `inject_claude_md`; `.md`/`.mdx` suffix guard on `import_claude_md` checked before the existence check, closing the "file not found: <path>" existence oracle.
L3	INSTINCT-L-002 (LOW)	`_validate_category` coerces unknown categories → `"other"` and `observe()` now uses the return value. The old code discarded it, so the coercion was a no-op — fixed here.

Adapted from the H R89-130h reference patches, not blind-applied: H's M-001 suffix/symlink guard does not close the headline attack (a .md target passes it), so the primary control here is sink-side content-sanitization. The L-002 patch as written was a no-op until observe() was wired to use the validated category.

Tests (TDD, RED→GREEN proven)

tests/test_injection_hardening_r89_132b.py — 11 new tests:

8 negatives: explain/pattern newline injection, poisoned-DB-row sink defense, end-marker smuggling, category coerce-on-observe, non-.md suffix reject (inject + import), symlink reject.
3 "legit still works": valid category preserved, normal .md inject, normal .md import.

RED proven: with the store.py fix stashed, exactly the 8 security tests fail; the 3 legit-use tests pass in both states (non-vacuous). GREEN after. Full suite: 145 passed. Legitimate ~/.claude/CLAUDE.md use is preserved.

Notes

One pre-existing ruff I001 (cosmetic whitespace before a # type: ignore on the tomli fallback line in load_config) is left untouched — it predates this change and is unrelated to the security fix. All added code is ruff-clean.
Do not merge — pending security review + operator merge.

…(M-001 + L-001/L-002) inject_claude_md embedded promoted-rule fields (pattern/category/explain) into CLAUDE.md with no neutralization. An attacker who promotes a rule via observe(explain="x\n- malicious instruction") x THRESHOLD_RULE could inject arbitrary Markdown instruction lines into ~/.claude/CLAUDE.md (prompt injection via disk; later sessions load it). A path guard alone is insufficient — the headline target is already a legitimate .md file, so the real defense is at the content sink. 3-layer fix: - L1 (primary): _sanitize_inline() neutralizes EVERY embedded field at the inject_claude_md sink — collapses CR/LF/controls to a space (no line break-out), neutralizes backticks (no code-span escape) and the block's own HTML-comment fences (no smuggled instinct:end marker). Defends even rows poisoned by another write path / pre-fix. - L2 (defense-in-depth): .md/.mdx suffix + symlink guard on inject_claude_md; .md/.mdx suffix guard on import_claude_md, checked BEFORE the existence check so the "file not found" existence oracle is closed (INSTINCT-L-001). - L3: _validate_category coerces unknown categories to "other" AND observe() now USES the return value — the old code discarded it, so the coercion was a no-op (INSTINCT-L-002). Tests: 11 new (8 injection/guard negatives + 3 legit-still-works). RED proven (exactly the 8 security tests fail pre-fix), GREEN after. Full suite 145 passed. Legitimate ~/.claude/CLAUDE.md use preserved.

WRG-11 merged commit fd2b000 into main May 31, 2026
12 checks passed

WRG-11 deleted the fix/r89-132b-injection-hardening branch May 31, 2026 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(security): disk-persistent prompt-injection hardening (inject_claude_md content-sanitization + path guards)#47

fix(security): disk-persistent prompt-injection hardening (inject_claude_md content-sanitization + path guards)#47
WRG-11 merged 1 commit into
mainfrom
fix/r89-132b-injection-hardening

WRG-11 commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WRG-11 commented May 31, 2026

Summary — R89-132b (security)

3-layer fix

Tests (TDD, RED→GREEN proven)

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant