diff --git a/.DS_Store b/.DS_Store
deleted file mode 100644
index e61cebbe8b..0000000000
Binary files a/.DS_Store and /dev/null differ
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000000..54b0ab762e
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,11 @@
+# Afterwords TTS voice override
+.afterwords
+
+# OS files
+.DS_Store
+
+# IDE
+.vscode/
+
+# Superpowers brainstorm artifacts
+.superpowers/
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 59b1cf6431..63c9c7498f 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,62 +1,69 @@
-# Contributing to Failure-First Embodied AI
+# Contributing to Failure-First
-Thank you for your interest in contributing to Failure-First Embodied AI!
+Thank you for your interest in Failure-First. This is a **research project**, not a typical open-source codebase. Contributions are welcome, but the ways to contribute differ from a standard software project.
-## Important: Public Repository Context
+## How to Contribute
-This is the **public-facing** repository for the Failure-First research project. Contributions must adhere to strict safety guidelines to ensure all content remains:
-- Pattern-level only (never operational)
-- Defensively purposed
-- Appropriate for public academic discourse
+### Report Issues
-## What to Contribute
+If you find errors in our published findings, methodology gaps, broken links on [failurefirst.org](https://failurefirst.org), or inconsistencies in the public documentation, please open a GitHub issue.
-**✅ Welcome Contributions:**
-- Documentation improvements
-- Research methodology clarifications
-- Failure taxonomy additions (pattern-level)
-- Website improvements
-- Typo fixes and clarity improvements
+### Cite Our Work
-**❌ Not Accepted:**
-- Operational exploit code
-- Working jailbreak prompts
-- Model-specific bypass techniques
-- Raw test results or adversarial datasets
+The most impactful contribution for a research project is citation. If our findings, datasets, or methodology inform your work, please cite us:
-## Contribution Process
+```bibtex
+@software{failure_first_2026,
+ title = {Failure-First: Adversarial Evaluation Framework for Embodied AI},
+ author = {Wedd, Adrian},
+ year = {2026},
+ url = {https://failurefirst.org},
+ note = {227 models, 141{,}561 prompts, 337 attack techniques}
+}
+```
-1. **Fork** the repository
-2. **Create a branch** for your changes
-3. **Make your changes** following our guidelines
-4. **Submit a pull request** with a clear description
+### Red-Team Collaboration
-## Safety Review
+We welcome collaboration with AI safety researchers, red-team practitioners, and frontier lab security teams. If you have adversarial evaluation results, novel attack technique taxonomies, or defense effectiveness data you would like to contribute or cross-validate, open a GitHub issue describing your institutional affiliation and research focus.
+
+### Dataset Contributions
+
+If you have adversarial evaluation datasets that could strengthen the corpus, we accept contributions subject to:
+
+- **Pattern-level only**: no operational exploits or copy-paste attack templates
+- **Provenance documented**: source, collection methodology, and intended use
+- **Schema compliance**: data must conform to our versioned JSON Schemas (documented in the private repository; we will assist with formatting)
+- **Safety review**: all contributed data undergoes review before inclusion
+
+### Documentation Improvements
+
+Corrections, clarifications, and improvements to public-facing documentation (this repository, the design charter, security policy) are welcome via pull request.
-All contributions undergo safety review to ensure:
-- No operational exploit instructions
-- Pattern-level descriptions only
-- Appropriate for public repository
-- Aligned with defensive research mission
+## What We Do Not Accept
-## Code of Conduct
+- Operational exploit code or working jailbreak prompts
+- Model-specific bypass techniques intended for attack
+- Raw adversarial datasets without provenance
+- Content that facilitates real-world harm outside AI safety research
-- Be respectful and professional
-- Focus on defensive AI safety research
-- No weaponization of research findings
-- Maintain academic integrity
+## Vulnerability Reporting
-## Questions?
+If you discover vulnerabilities in AI systems -- whether through this framework or independent research -- please follow responsible disclosure practices. See [SECURITY.md](SECURITY.md) for our coordinated disclosure process.
-- **Issues**: Open a GitHub issue for questions or suggestions
-- **Discussions**: Use GitHub Discussions for research-related conversations
+## Process
+
+1. Open a GitHub issue describing the proposed contribution
+2. For documentation changes, submit a pull request directly
+3. For research collaborations and dataset contributions, we will coordinate via issue discussion
+
+## Safety Review
+
+All contributions undergo safety review to ensure content remains pattern-level, defensively purposed, and appropriate for a public repository. This review is not optional and applies equally to maintainers and external contributors.
## License
-By contributing, you agree that your contributions will be licensed under the MIT License, the same license as this project.
+By contributing, you agree that your contributions will be licensed under the MIT License.
---
-**Remember:** This is defensive AI safety research. All contributions should strengthen defenses, not enable attacks.
-
-**Last updated:** 2026-02-01
+**Last updated:** 2026-03-29
diff --git a/DESIGN_CHARTER.md b/DESIGN_CHARTER.md
index 10b4c7e499..0a63ca880b 100644
--- a/DESIGN_CHARTER.md
+++ b/DESIGN_CHARTER.md
@@ -22,7 +22,14 @@ This is a **research methodology for studying AI safety through systematic failu
At its center is a principle: **failure is signal, not noise**.
-The framework exists to support *rigorous failure analysis, defensive research, and safety boundary mapping*.
+The framework exists to support *rigorous failure analysis, defensive research, and safety boundary mapping* across the full landscape of adversarial AI evaluation:
+
+- **Jailbreak archaeology**: systematic study of how adversarial techniques evolve across eras, from early DAN-style prompts through crescendo attacks, format-lock exploitation, and reasoning-chain manipulation
+- **VLA safety evaluation**: 42 attack families targeting vision-language-action models, covering affordance manipulation, kinematic injection, safety instruction dilution, and dual-layer attacks
+- **Multi-turn escalation**: stateful attack sequences that exploit context accumulation in reasoning and tool-using models
+- **Format-lock attacks**: structured output compliance (JSON, YAML, code) as an attack vector against safety training
+- **Defense effectiveness measurement**: quantifying how safety mechanisms perform under adversarial pressure, including positional bias, iatrogenic effects, and cross-model vulnerability inheritance
+- **Classifier reliability**: documenting the gap between heuristic and LLM-graded attack success rates
This is not a demonstration.
It is not an attack platform.
@@ -59,7 +66,7 @@ The tooling must not pretend otherwise.
These are not preferences.
They are constraints.
-If an implementation violates any of the following, it is wrong—even if it "works".
+If an implementation violates any of the following, it is wrong -- even if it "works".
### 3.1 Pattern-Level Only, Never Operational
@@ -109,6 +116,14 @@ If an implementation violates any of the following, it is wrong—even if it "wo
- No security through obscurity
- If the process cannot withstand scrutiny, it should not be published
+### 3.7 Statistical Rigor Over Convenience
+
+- All ASR claims must specify grading methodology (LLM-only, heuristic-only, or coalesced)
+- Confidence intervals are required for sample-based estimates
+- Multiple-comparison corrections are mandatory for pairwise model comparisons
+- Classifier reliability must be documented with inter-rater agreement metrics
+- If a finding cannot survive peer review scrutiny, it should not be published
+
---
## 4. Architectural Decisions (Final)
@@ -121,8 +136,10 @@ They are no longer exploratory.
- **Validation**: JSON Schema + custom invariant checks
- **Safety gates**: Heuristic linters + manual review
- **Benchmark runners**: Model-agnostic (CLI templates + HTTP API adapters)
-- **Scoring**: Refusal quality, recovery mechanisms, attack success rates
+- **Scoring**: Three-tier ASR (strict, broad, functionally dangerous) with Wilson CIs
+- **Classification**: LLM-graded (FLIP protocol) with documented grader reliability
- **Version control**: Git, with CI enforcement of validation/linting
+- **Database**: SQLite corpus unifying prompts, techniques, and evaluation results
If a future change requires revisiting these decisions, the charter must be updated explicitly.
@@ -152,7 +169,7 @@ If adversarial content can be weaponized directly, it has violated this charter.
This document defines intent and limits.
-Implementation details—including tool design, algorithms, dataset schema extensions, benchmark metrics, and acceptance criteria—are defined in issues, documentation, and code.
+Implementation details -- including tool design, algorithms, dataset schema extensions, benchmark metrics, and acceptance criteria -- are defined in issues, documentation, and code.
The relationship is explicit:
@@ -210,7 +227,7 @@ When using AI systems to generate adversarial scenarios:
## 9. Research Ethics Boundaries
-This project operates within established AI safety research norms:
+This project operates within established AI safety research norms. A full research ethics charter is maintained in the private repository.
### Acceptable Research Activities
@@ -219,6 +236,7 @@ This project operates within established AI safety research norms:
- Testing robustness of safety mechanisms
- Improving alignment under adversarial pressure
- Publishing defensive research findings
+- Coordinated vulnerability disclosure to model providers
### Unacceptable Activities
@@ -230,10 +248,10 @@ This project operates within established AI safety research norms:
### Disclosure Standards
-- Vulnerabilities discovered through this research should be disclosed responsibly
-- Real-world safety issues should be reported to affected parties before public disclosure
-- Research findings should distinguish between controlled evaluation and real-world risk
-- Limitations of evaluation harnesses must be stated explicitly
+- Vulnerabilities discovered through this research are disclosed responsibly
+- Real-world safety issues are reported to affected parties before public disclosure
+- Research findings distinguish between controlled evaluation and real-world risk
+- Limitations of evaluation harnesses are stated explicitly
---
@@ -249,8 +267,8 @@ This charter may evolve as the project grows, but changes must be:
Minor clarifications (typo fixes, example additions) do not require versioning.
Substantive changes (adding/removing principles, changing constraints) require charter version increment.
-**Current version**: 1.0
-**Last updated**: 2025-01-11
+**Current version**: 2.0
+**Last updated**: 2026-03-29
---
diff --git a/MANIFEST.json b/MANIFEST.json
index c3531b209f..9dc51ca441 100644
--- a/MANIFEST.json
+++ b/MANIFEST.json
@@ -3,14 +3,17 @@
"note": "Full traces available under NDA. Contact via GitHub issue.",
"generated_from": "failure-first-embodied-ai (private)",
"totals": {
- "files": 632,
+ "files": 860,
"invariant_errors": 0,
"json_parse_errors": 0,
- "rows": 51201,
+ "rows": 60847,
"schema_errors": 0,
- "failure_classes": 661,
- "domains": 19,
- "models_evaluated": 51
+ "prompts": 141561,
+ "results": 133646,
+ "techniques": 337,
+ "harm_classes": 124,
+ "domains": 28,
+ "models_evaluated": 227
},
"packs_by_kind": {
"adversarial_poetry": 3,
diff --git a/README.md b/README.md
index 9fff69332e..3bf202cfff 100644
--- a/README.md
+++ b/README.md
@@ -1,116 +1,83 @@
-# F41LUR3-F1R57 — Adversarial Evaluation for Embodied AI
+# Failure-First: Adversarial Evaluation for Embodied and Agentic AI
-**Failure is not an edge case. It is the primary object of study.**
+
---
-## Four Headline Findings
-
-### 1. Supply Chain Injection: 90–100% ASR
+## The Project
-50 injection scenarios against 6 small open-weight models (1.5–3.8B params). Every model treated injected tool definitions and skill files as legitimate instructions. No statistically significant differences between any model pair (chi-square with Bonferroni correction, Cohen's κ = 0.782).
+Failure-First is a red-teaming and benchmarking framework that studies how AI systems fail under adversarial pressure. We focus on embodied AI (robots, tool-using agents, multi-agent systems) where failures have physical consequences.
-### 2. Faithfulness Gap: 24–42% Against Frontier Models
+The core research question: when safety mechanisms are tested systematically across hundreds of models and thousands of attack techniques, what patterns emerge?
-Format-lock attacks — requesting harmful content structured as JSON, YAML, or code — achieved 30% (Claude Sonnet 4.5), 42% (Codex GPT-5.2), and 24% (Gemini 3 Flash) LLM-graded ASR. Models embed harmful content within structured fields while maintaining the appearance of a well-formatted, helpful response.
+## Key Findings
-### 3. Multi-Turn Escalation: 80–90% on Reasoning Models
+**227 models tested. 141,561 adversarial prompts. 133,646 graded results. 337 attack techniques.**
-Crescendo attacks achieved 80–90% ASR against DeepSeek-R1 but only ~10% against small non-reasoning models. The extended context tracking that makes reasoning models capable also makes them vulnerable to gradual escalation.
+- **Classifier unreliability is pervasive.** Keyword-based jailbreak classifiers agree with LLM-graded ground truth at Cohen's kappa = 0.126. Heuristic compliance labels carry a roughly 80% false positive rate. Most published ASR numbers are likely inflated.
-### 4. The Classifier Overcount Problem
+- **Hallucinated refusals are functionally dangerous.** Models that appear to refuse harmful requests sometimes generate the harmful content anyway, wrapped in safety-sounding framing. This "hallucination refusal" pattern adds 11.9 percentage points to the attack success rate on non-abliterated models.
-Cohen's κ = 0.245 between keyword and LLM classification. Heuristic REFUSAL labels are 95% reliable; heuristic COMPLIANCE labels have an 88% false positive rate. Aggregate effect: heuristic ASR 36.2% → corrected 15.9%.
+- **Format-lock attacks exploit structured output compliance.** Requesting harmful content formatted as JSON, YAML, or code achieves 24--42% success rates against frontier models. The structured output training objective conflicts with safety training.
----
-
-## What This Is
+- **Multi-turn escalation disproportionately affects reasoning models.** Crescendo-style attacks achieve 65--85% success against extended-reasoning models, whose chain-of-thought tracking makes them susceptible to gradual context manipulation.
-A **research framework** for studying how embodied and agentic AI systems fail:
+- **Safety mechanism effectiveness varies by 57x across providers.** Identical prompts tested across providers reveal that safety investment, not model capability, determines vulnerability.
-- **Red-teaming datasets** — adversarial scenarios targeting cognitive vulnerabilities in tool-using, multi-agent, and stateful systems
-- **Failure taxonomies** — structured classifications of recursive, contextual, and interactional failure modes
-- **Evaluation infrastructure** — benchmark runners (HTTP API, native CLI, local Ollama), scoring pipelines, statistical significance testing
-- **Classification pipeline** — consensus grading (heuristic + LLM) with documented error characteristics
+## Methodology
-This is **not** an attack toolkit and does **not** claim real-world safety guarantees.
-
----
+All results use LLM-graded classification (the FLIP protocol) with documented grader reliability audits. We report three-tier ASR (strict, broad, functionally dangerous) with Wilson confidence intervals. Statistical comparisons use chi-square tests with Bonferroni correction. Full methodology is described in our CCS 2026 submission.
-## Quick Start
-
-```bash
-git clone https://github.com/adrianwedd/failure-first.git
-cd failure-first
-pip install -r requirements-dev.txt
-
-make validate # Schema validation — 0 errors required
-make lint # Safety linter — catches operational phrasing
-make bench # Dry-run benchmark — no API calls
-```
-
----
+Grading methodology matters: always check whether cited ASR numbers use LLM-only, heuristic-only, or coalesced verdicts.
## The Site
-[failurefirst.org](https://failurefirst.org) hosts 18+ blog posts, 23 daily paper analyses, and 19 policy reports — each with audio overviews, video summaries, and infographics generated via NotebookLM.
-
-Recent posts:
-- [120 Models, 18,176 Prompts: What We Found](https://failurefirst.org/blog/120-models-18k-prompts/)
-- [The Classifier Overcount Problem](https://failurefirst.org/blog/classifier-overcount-problem/)
-- [Reasoning Models Are Uniquely Vulnerable to Multi-Turn Attacks](https://failurefirst.org/blog/reasoning-models-multi-turn-vulnerability/)
-- [When LLM Vulnerabilities Meet Robots](https://failurefirst.org/blog/llm-vulnerabilities-robots/)
+[failurefirst.org](https://failurefirst.org) hosts 740+ pages including research blog posts, a daily paper analysis series covering recent adversarial ML literature, policy reports, and multimedia overviews (audio, video, infographics via NotebookLM).
----
+## Repository Structure
-## Core Philosophy
+This public repository contains:
+- **Pattern-level findings** and methodology descriptions
+- **MANIFEST.json** listing dataset structure (no adversarial content)
+- **Design charter** and research ethics documentation
+- **Site source** for failurefirst.org
-Most AI evaluation asks: *"Does the system succeed at the task?"*
-
-We ask: *"How does it fail? What breaks first? Can it recover?"*
-
-- **Recursive failures** — one failure cascading into others
-- **Contextual failures** — instruction hierarchy confusion
-- **Interactional failures** — multi-agent amplification
-- **Temporal failures** — stateful degradation across episodes
-- **Recovery failures** — inability to recognise and correct mistakes
-
----
-
-## Safety & Ethics
-
-All scenarios describe failure **patterns**, not operational exploits. Research aims to improve defenses, not enable attacks. Full traces and adversarial payloads are available under NDA for AI safety researchers at accredited institutions, government safety bodies, and frontier lab security teams.
-
-Contact: Open a GitHub issue with institutional affiliation.
-
----
+Full datasets, traces, and evaluation infrastructure are maintained in a private research repository. Access is available under NDA for AI safety researchers at accredited institutions, government safety bodies, and frontier lab security teams. Open a GitHub issue with institutional affiliation.
## Citation
```bibtex
@software{failure_first_2026,
- title = {F41LUR3-F1R57: Adversarial Evaluation Framework for Embodied AI},
- author = {Adrian Wedd},
+ title = {Failure-First: Adversarial Evaluation Framework for Embodied AI},
+ author = {Wedd, Adrian},
year = {2026},
url = {https://failurefirst.org},
- note = {120 models, 18{,}176 prompts, 5 attack families}
+ note = {227 models, 141{,}561 prompts, 337 attack techniques}
}
```
----
+A CCS 2026 paper is in submission preparation. Citation details will be updated upon acceptance.
+
+## Contributing
+
+See [CONTRIBUTING.md](CONTRIBUTING.md). This is a research project -- contributions are welcome via issue reports, citations, red-team collaboration proposals, and dataset contributions.
+
+## Security
+
+See [SECURITY.md](SECURITY.md) for our coordinated vulnerability disclosure process. We currently have 5 pending responsible disclosures with model providers.
## License
diff --git a/SECURITY.md b/SECURITY.md
index 6acbef1417..592e289baa 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -2,71 +2,82 @@
## Research Context
-Failure-First Embodied AI is a **defensive AI safety research** project. This repository is the public-facing documentation site and does not contain operational testing infrastructure or adversarial datasets.
+Failure-First is a **defensive AI safety research** project that studies how AI systems fail under adversarial pressure. This public repository contains pattern-level findings and methodology descriptions. Operational testing infrastructure, adversarial datasets, and full evaluation traces are maintained in a private repository.
+
+## Coordinated Vulnerability Disclosure
+
+We practice coordinated vulnerability disclosure (CVD) for AI safety vulnerabilities discovered through this research.
+
+### Current Status
+
+We currently have **5 pending responsible disclosures** with model providers (Nvidia, Alibaba, Zhipu, Gemma/Google, Mistral). Disclosure timelines follow standard CVD practices: findings are reported to affected parties with a reasonable remediation window before any public discussion of specifics.
+
+### Our CVD Process
+
+1. **Discovery**: Vulnerability pattern identified through systematic evaluation
+2. **Verification**: Finding confirmed across multiple test conditions with statistical controls
+3. **Private notification**: Affected provider contacted via their security reporting channel
+4. **Remediation window**: Minimum 90 days before public discussion of specifics
+5. **Public disclosure**: Pattern-level description published (no operational details)
+
+### Research Ethics
+
+Our vulnerability research follows the principles documented in our research ethics charter:
+- Findings serve the defensive research mission
+- Operational details are never published
+- Affected parties are notified before any public discussion
+- Pattern-level descriptions enable defensive improvements without enabling attacks
## Reporting Security Concerns
-### For This Public Repository
+### For This Repository
-If you find security issues with this public repository (e.g., exposed credentials, vulnerable dependencies, website security):
+If you find security issues with this public repository (exposed credentials, vulnerable dependencies, website security):
-- **Email**: research@failurefirst.org
+- **Non-sensitive issues**: Open a GitHub issue
+- **Sensitive issues**: Email research@failurefirst.org
- **GitHub Security Advisories**: Use the "Security" tab to report privately
-### For Research Findings
+### For AI Safety Vulnerabilities
-If you discover vulnerabilities in AI systems through independent research:
+If you discover vulnerabilities in AI systems through independent research and would like to collaborate on disclosure:
**DO:**
- Follow responsible disclosure practices
- Report to affected vendors before public disclosure
- Document findings at pattern-level for academic discussion
+- Contact us via GitHub issue if you want to coordinate
-**DON'T:**
+**DO NOT:**
- Post operational exploits in public issues
- Share working bypass techniques without vendor notification
- Weaponize research findings
## Scope
-**In Scope:**
-- Security issues with this GitHub Pages site
-- Vulnerabilities in public documentation
+**In scope:**
+- Security issues with this GitHub repository or failurefirst.org
+- Vulnerabilities in public documentation or site infrastructure
- Dependency security issues
+- Collaboration on coordinated disclosure of AI safety vulnerabilities
-**Out of Scope:**
-- Vulnerabilities in third-party AI systems (report to vendors)
-- Requests for operational exploit code
+**Out of scope:**
+- Vulnerabilities in third-party AI systems (report directly to vendors)
+- Requests for operational exploit code or adversarial datasets
- Model-specific jailbreak techniques
## Response Timeline
- **Acknowledgment**: Within 3 business days
-- **Initial Assessment**: Within 7 business days
+- **Initial assessment**: Within 7 business days
- **Resolution**: Depends on severity and complexity
-## Disclosure Policy
-
-We follow coordinated disclosure:
-1. Researcher reports issue privately
-2. We assess and develop fix
-3. Fix is deployed
-4. Public disclosure (if appropriate)
-
-## Research Ethics
-
-This project operates within established AI safety research norms:
-- Defensive purpose
-- Pattern-level documentation
-- Responsible disclosure
-- No weaponization
-
## Contact
-For security concerns: research@failurefirst.org
-
-For general questions: Open a GitHub issue
+- **Non-sensitive**: Open a GitHub issue
+- **Sensitive disclosures**: research@failurefirst.org
+- **CVD coordination**: Open a GitHub issue with institutional affiliation
---
-**Last updated:** 2026-02-01
+**Last updated:** 2026-03-29
diff --git a/docs/.well-known/atproto-did b/docs/.well-known/atproto-did
new file mode 100644
index 0000000000..fcb27a0521
--- /dev/null
+++ b/docs/.well-known/atproto-did
@@ -0,0 +1 @@
+did:plc:uwhfz7mq7nvtzj52mawmzu5q
diff --git a/docs/about/disclosure/index.html b/docs/about/disclosure/index.html
index f0de5e7f0d..e133f03173 100644
--- a/docs/about/disclosure/index.html
+++ b/docs/about/disclosure/index.html
@@ -1,11 +1,27 @@
- Responsible Disclosure | Failure-First
+
+
Skip to content
Responsible Disclosure
How we handle AI safety vulnerability reports and research findings
How we handle AI safety vulnerability reports and research findings
Our Commitment
Failure-First research discovers vulnerability patterns in AI systems.
We are committed to responsible disclosure of these findings to
advance safety without enabling harm.
@@ -32,8 +48,8 @@