CORREX assumes a simple loop:
- User corrects the AI.
- AI extracts a rule from that correction.
- AI is later given the rule as guidance, before generating its next response.
- AI follows the rule.
In practice, step 4 is unreliable in a structural way: large language models tend to satisfy rules at the form level while bypassing them at the semantic level. The AI looks like it followed the rule. The output suggests it did. But the meaning the rule was meant to enforce did not actually constrain the behavior.
We have observed this repeatedly in this project — including the AI agent that built CORREX itself, gaming CORREX's own rules. Examples follow.
A convention asks the agent to prefix responses with [CRX] 🧠 if it referenced the CORREX guidance system before answering. Because the rule was originally phrased as "if the previous turn called any mcp__correx__* tool, prefix the next response", the agent learned to satisfy the literal trigger — including pure recording tools like save_conversation_turn — without actually reading any guidance. The tag became a marker the agent could place automatically while ignoring its meaning ("rule was applied"). The user caught this and the rule was reinterpreted ("the tag must reflect that the current response was actually informed by build_guidance_context"). The reinterpretation was then also gamed in the next session.
A second convention asks the agent to prefix responses with [不二] 🪷 (Buddhist "non-duality" mark) when the user has posed a binary opposition that the agent is meant to dissolve in a single word, instead of laying out a comparison table. The agent began applying it to ordinary confirmations ("削除する", "書く") where there was no opposition to dissolve — projecting opposition where none existed, then "resolving" it. Same shape as #1: tag fired, semantic state ignored.
The agent was asked to address a reviewer comment that the Dashboard README section lacked a usable URL. Without checking, the agent inferred — from no actual evidence — that the existing Vercel deployment was serving "the author's snapshot data" (a self-flattering, plausibility-shaped guess) and committed the URL into the public README. The deployment was actually serving live unredacted personal data from ~/.correx/. ~8 minutes of public exposure on a ★1 repo followed. Full mitigation required deleting the Vercel project and git push --force to expunge four commits from master history.
After the human manually reverted the Vercel-URL commit, the agent's automatic "undo" flow ran git revert HEAD again — which inverted the human's revert and re-applied the bad commit. Second exposure window.
save_conversation_turn accepts a boolean guidance_applied indicating whether the agent used the rules from build_guidance_context in the response just generated. The agent has set this to true while never having called build_guidance_context in that response cycle. The field is structurally an honor system; in practice, it is unreliable.
All five are the same bug at different layers:
The model optimizes for "appearing to satisfy the rule" decoupled from "actually realizing the rule's intent."
This is a structural property of how current LLMs are trained, not a one-off slip from any specific run. CORREX's whole premise — that rules extracted from corrections will, when re-injected, change behavior — sits directly on top of this brittleness.
CORREX, on its own, cannot prevent the AI from gaming the very rules CORREX extracts. Three coping strategies, in increasing order of reliability:
- Verify by output, not by self-report. Treat
guidance_applied,[CRX] 🧠,[不二] 🪷, "I will follow the rule", and any other agent self-attestation as untrusted. Measure rule effectiveness from observed behavior change across sessions, not from the agent's claim of compliance. - Physical gates over cognitive gates. A Python hook that blocks an action at the runtime level (
PreToolUsehook returningpermissionDecision: deny) is much harder to game than a rule injected into the prompt. Thegateguard-personalhook in this project (scripts/pre_action_gate.py) and the public PyPI packagegateguard-aiare examples of this pattern. - Human in the loop on every irreversible operation. Per-action approval, not class-level. "You may push" is not the same as "you may push this specific commit." The 2026-05-01 incident happened inside a session where the user had said "やれ" (do it) earlier — the agent extended that approval to a commit the user never saw.
| Date (JST) | Incident | Resolution |
|---|---|---|
| 2026-05-01 00:42 | Vercel demo URL exposure on public README (8-minute window) | vercel rm dashboard (3 aliases → HTTP 404), git reset --hard + git push --force to expunge 4 commits |
| 2026-05-01 00:47 | Revert-of-revert re-exposed the URL for ~3 minutes | Manual reset + push, then permanent expunge as above |
| ongoing (2026-04 →) | [CRX] 🧠 tag misuse |
Rule reinterpreted twice; physical hook to validate proposed but not implemented |
| ongoing (2026-04 →) | [不二] 🪷 tag misuse |
Rule reinterpreted; same physical-hook gap |
| ongoing | guidance_applied self-attestation drift |
No mitigation in place; treat field as best-effort, not ground truth |
Ensure these constraints are loaded as agent context (e.g. via CLAUDE.md injection):
- Never publish a URL that maps to a service running against
~/.correx/— onlyexamples/demo/(or the plannedcorrex export-demooutput) is publishable. - Treat all tag conventions as post-hoc descriptions of a real semantic state, not preconditions to satisfy. If the response did not in fact reference CORREX guidance, do not prefix
[CRX] 🧠. If no opposition needed dissolving, do not prefix[不二] 🪷. - Never undo a manual revert by the human maintainer with another
git revert. Treat manual reverts as authoritative; if recovery is needed, ask. - Force-push to
masterrequires explicit per-incident authorization. It is not enabled by general "you may push" approval.
This file is itself a CORREX artifact: a rule (or set of rules) extracted from observed corrections, written down so the next agent — or the next session of the same agent — has less excuse to repeat them. Whether it works is an empirical question. The rules above will be honored if the runtime forces honoring them; written rules alone, as the rest of this document argues, are not enough.