Cross-cutting: chat-template backdoor classification (LLM03 ↔ LLM04) + horizontal runtime governance pattern (LLM03/04/06/08) #68

KeystoneSmartQuotes · 2026-05-06T19:12:06Z

KeystoneSmartQuotes
May 6, 2026

Following the April 29 thread discussion in #team-genai-top-10-llm-llm04 (with Mark Roxberry, Anitha Dakamarri, Ariel Fogel, and Tom Kompare on chat-template scope) and the May 4 horizontal cross-reference document I circulated to that channel ("Action-Boundary Policy Evaluation: A Cross-Cutting Cross-Reference for LLM 2026"), opening this thread for Sprint 2 cross-linking discussion.

Two cross-cutting items:

LLM03 ↔ LLM04 reciprocal cross-reference for chat-template backdoors

The LLM04 merged Description correctly captures chat-template backdoors as a bundled-non-weight-artifact concern under the persistence axis. The LLM03 entry covers the supply-chain origin (distributors, registries, model files). Per Tom Kompare's threat-actor / entry-point decomposition: chat-template backdoors are jointly classified — supply-chain origin (LLM03) plus inference-time persistence (LLM04). Readers arriving at either entry would benefit from a brief reciprocal pointer to the other.

The LLM04 substantive deep-dive should stay where it is. A short cross-reference language pair (LLM03 to LLM04 for "see LLM04 for inference-pipeline persistence detail" and LLM04 to LLM03 for "see LLM03 for supply-chain origin context") preserves entry navigation without duplication.

Horizontal runtime defense-in-depth pattern across LLM03 / LLM04 / LLM06 / LLM08

Several entries describe the same architectural defense pattern from different angles:

LLM03 — runtime evaluation for artifact-borne persistence that passes signature verification
LLM04 — runtime evaluation for inference-time-triggered persistence (the chat-template intake mitigation covers intake; runtime is the complement; see issue 39 for the LLM04-specific proposal)
LLM06 — Mitigation 7 Complete Mediation provides the canonical control body ("an independent pre-execution policy decision point with graduated enforcement: audit, warn, block, escalate")
LLM08 — runtime evaluation for actions taken on retrieved content where retrieval-layer validation is insufficient

The horizontal pattern is consistent. Keeping the canonical control body in LLM06 Mitigation 7 and having LLM03/04/08 reference it via short cross-reference language (rather than each entry duplicating the control description) preserves consistency and avoids fragmentation.

Cross-reference language proposals for each direction are in the document circulated to the LLM04 channel May 4. Happy to share the draft document directly if helpful for Sprint 2 working group review.

Evidence and prior context:

April 29, 2026 LLM04 channel thread (Q1/Q2 framing on chat-template scope)
May 4, 2026 horizontal cross-reference document circulated to LLM04 channel (publicly endorsed by Mark Roxberry on May 5)
Fogel et al. arXiv:2602.04653 sections 4 and 5 (chat-template properties demonstrating weight-based-backdoor signature; cited in LLM04 Reference 20 and Scenario 6)
Hubinger et al. arXiv:2401.05566 (Sleeper Agents; cited in LLM04 Scenario 5)
2026 LLM06 Excessive Agency Mitigation 7 (Complete Mediation, merged May 2026)

rocklambros · 2026-05-07T21:07:49Z

rocklambros
May 7, 2026
Maintainer

Whew! Thanks for the comprehensive write-up!

Two quick reads, with the deeper conversation moving to Slack/our call on 5/20.

I agree with the chat-template cross-reference, in principle. Tom's threat-actor/entry-point lens looks right for vulnerabilities that span supply-chain origins and inference-time exploitation. @KeystoneSmartQuotes can you please attach the May 4 cross-reference document directly to this discussion so we're all reading from the same draft. One question worth raising... should the LLM03/LLM04 pointer name chat templates specifically, or generalize to bundled non-weight inference-time artifacts (templates plus tokenizer_config.json, generation_config.json, processor configs, and custom modeling code under trust_remote_code)? The wider attack path "class" (if you want to call it that) shares enough mechanics that picking one risks looking selective.

Now, for the horizontal runtime defense-in-depth pattern, I want to push back on anchoring four entries in LLM06 Mitigation 7. The more we anchor there, the more it looks like an Agentic Top 10 entry. Remember, our charter is "line in the sand is model-in-isolation vs. system-coordinating-components.

Async in Slack, be sure to tag me and Steve if you are looking for specific input from us (so many Slacks... so... so many Slacks...). If you want to sync up on a call before 5/20, happy to do so.

1 reply

KeystoneSmartQuotes May 8, 2026
Author

Thanks @rocklambros — both points landed cleanly. Attaching two
artifacts: @virtualsteve-star, @rossja

May 4th LLM_2026_Cross_Reference_Draft.docx

LLM_2026_Cross_Reference_Draft_Rev6.docx

LLM_2026_Cross_Reference_Draft.docx — the May 4 working draft you asked for.

LLM_2026_Cross_Reference_Draft_Rev6.docx — revision incorporating both pushbacks: bundled-artifacts scope (LLM03/LLM04 pointer generalized per your list); peer cross-reference structure replacing LLM06-anchoring (LLM06 Mitigation 7 keeps its current text; the cross-reference itself becomes the shared target). A few minor citation fixes also caught in self-review — change log inside the doc.

Won't be on the 5/20 call. Will tag you and @virtualsteve-star in the relevant Slack threads going forward.

— Boone

musaabhasan · 2026-05-09T08:38:56Z

musaabhasan
May 9, 2026

The cross-reference makes sense because chat-template backdoors sit at an awkward boundary: they are supply-chain artifacts, but their runtime effect is behavioral control.

I would classify them with two linked labels rather than forcing a single home:

primary entry point: supply chain or model/package artifact governance
runtime impact: prompt injection, instruction hierarchy compromise, unsafe tool use, or data exfiltration

That lets LLM03 describe how the artifact enters the system, while LLM04 explains how the embedded instruction changes model behavior after loading. The same pattern also applies to poisoned tool descriptions, malicious skill files, and unsafe default system prompts.

For the runtime governance pattern, I would recommend one concrete control language across the cross-reference: separate trusted instructions, untrusted content, tool metadata, and generated outputs into explicit trust zones, then enforce action-boundary checks before side effects. That makes the mitigation actionable across LLM03, LLM04, LLM06, and LLM08 rather than only taxonomic.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-cutting: chat-template backdoor classification (LLM03 ↔ LLM04) + horizontal runtime governance pattern (LLM03/04/06/08) #68

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Cross-cutting: chat-template backdoor classification (LLM03 ↔ LLM04) + horizontal runtime governance pattern (LLM03/04/06/08) #68

Uh oh!

KeystoneSmartQuotes May 6, 2026

Replies: 2 comments · 1 reply

Uh oh!

rocklambros May 7, 2026 Maintainer

Uh oh!

KeystoneSmartQuotes May 8, 2026 Author

Uh oh!

musaabhasan May 9, 2026

KeystoneSmartQuotes
May 6, 2026

Replies: 2 comments 1 reply

rocklambros
May 7, 2026
Maintainer

KeystoneSmartQuotes May 8, 2026
Author

musaabhasan
May 9, 2026