Turn any repository into an agentic QA lab. Works with Claude · Codex · Gemini · Copilot. Bun-first. Enterprise-ready.
Not a test runner. An operating system for agentic QA.
A standardized framework that turns coding agents into QA engineers guided by risk maps, invariants, scenarios, probes, oracles, and replay. It is not a prompt. It is the reusable framework that makes the prompt operational, reproducible, versionable, and adaptable to every project.
- Why this exists
- What makes it different
- Quick start (junior-friendly)
- The mental model in 7 words
- Multi-agent
- Architecture at a glance
- Roadmap
- Status
- Documentation
- Contributing
- Security
- License
- Maintainers
Coding agents (Claude Code, Codex CLI, Gemini CLI, GitHub Copilot CLI) are great at writing code. They are poor QA engineers by default: they will gladly add a feature without imagining how a malicious user might exploit it, how a second tenant might leak across, or how the LLM tool-calling layer can be tricked into refunding a payment without confirmation.
agentic-qa-kit provides the operating system the agent needs to behave like a senior QA engineer on your project:
- An explicit risk map with severity, invariants, probes, and oracles
- Pre-built scenario packs for APIs, web UIs, LLM agents, security, migrations
- Adapters that install the right skills for Claude / Codex / Gemini / Copilot
- A runner that executes profiles deterministically (smoke, exploratory, security, release-gate)
- Findings with three-level reproducibility, bug-level deterministic replay, and suggested regression tests
- Optional admin panel (React) and server (Bun/Node) for multi-team self-hosted deployments
- 🧠 Multi-agent native — Claude · Codex · Gemini · Copilot first-class adapters, not "Claude with the others bolted on". Adapter capability negotiation, so each agent uses its best primitives (subagents, skills, slash commands, hooks).
- 🎯 Deterministic replay where it matters — three-level reproducibility (bug / scenario / agent). The kit never lies about LLM determinism. Bug-level deterministic replay is required for any release-gate verified finding.
- 🔒 Sandbox by design — container-per-scenario isolation default for security and release-gate profiles. Egress allowlists. Tool-call budgets. Resource limits. Cost kill-switches.
- 💰 Cost governance built-in — per-org / project / profile / scenario budgets in USD and tokens, hard kill-switches, attribution to risk areas. No more "an agent loop burned $400 overnight".
- 🏠 BYOK + on-prem LLM — bring your own Anthropic/OpenAI keys, or use vLLM / Bedrock private / Azure OpenAI VNet / llama.cpp. Air-gap deploy supported.
- 📋 OWASP Top 10 Agentic (2026) built-in security pack. Plus STRIDE / FMEA risk discovery (v0.6).
- 🧾 Hash-chained audit log + WORM export. SOC2 / ISO 27001 / GDPR / HIPAA alignment on the roadmap (v0.3 self-hosted, v1.0 GA).
- 🔁 Process-first governance — every PR follows a documented loop with Copilot Code Review. Lessons captured in
docs/LESSON.mdfor permanent improvement.
Status note: the kit reached v1.0 GA (24-task roadmap complete) and is now at v1.1. The 18 workspace packages (
@aqa/schemas,@aqa/kit,@aqa/runner,@aqa/reporter,@aqa/server,@aqa/admin,@aqa/compliance,@aqa/methodology, …) ship from this monorepo. Detailed walk-through:docs/getting-started.md.
Preview the v0.1.0 quick start (click to expand)
# macOS / Linux
curl -fsSL https://bun.sh/install | bash
# Windows (PowerShell)
powershell -c "irm bun.sh/install.ps1 | iex"cd /path/to/your/project
bun add -d agentic-qa-kitIf you don't have a project yet, clone
examples/bun-apifrom this repo (available in v0.1.0).
bunx aqa initDetects your stack and creates .aqa/ with testing.md, risk-map.yaml, profiles.yaml, and scenarios for the packs your project matches.
bunx aqa install-agent-files --targets claude,codex,gemini,copilotThis generates CLAUDE.md + .claude/skills/aqa-*, AGENTS.md + .agents/skills/, GEMINI.md + .gemini/skills/, .github/copilot-instructions.md + .github/skills/.
bunx aqa run --profile smokeA 10-minute, non-destructive sweep. When it finishes:
bunx aqa reportYou'll see findings like:
AQA-2026-0001 [P1] Cross-tenant data leak (verified, 3/3 deterministic replay)
AQA-2026-0002 [P3] Missing rate limit on /api/search
bunx aqa replay AQA-2026-0001Re-runs the deterministic bug reproduction (curl / Playwright / SQL) and tells you if it still reproduces. If it doesn't, the bug is fixed — closes the loop.
Risk → Invariant → Scenario → Probe → Oracle → Finding → Replay
Every concept in AQA is one of these seven things or a tool that operates on them. See docs/ecosystem-explained.md for the deep introduction.
| Target | Files generated | Capability highlights |
|---|---|---|
| 🟣 Claude Code | CLAUDE.md, .claude/skills/aqa-*, .claude/agents/aqa-* |
Skills, subagents (isolated context), hooks, MCP |
| 🟢 Codex | AGENTS.md, .agents/skills/aqa-*, optional Codex plugin |
Skills, explicit subagents, plugins, MCP |
| 🔵 Gemini CLI | GEMINI.md, .gemini/skills/aqa-*, .gemini/agents/, .gemini/commands/*.toml |
Skills, subagents, slash commands, MCP |
| ⚫ GitHub Copilot CLI | .github/copilot-instructions.md, .github/skills/aqa-*, .github/agents/*.agent.md, .github/hooks/*.json |
Skills (auto-detects .claude/skills), custom agents, hooks |
Capability negotiation is runtime: the kit asks the agent target what it supports, and degrades gracefully when something is missing.
+- Local mode (single dev / CI) -----------------------------+
| bunx aqa CLI |
| |- engine + runner (sandboxed) |
| |- packs (core, api, web-ui, llm-agent, security, ...) |
| |- adapters (Claude/Codex/Gemini/Copilot) |
| `- .aqa/ (project state, runs, findings, replay) |
+------------------------------------------------------------+
+- Self-hosted (multi-team, post v0.3) ----------------------+
| Control Plane (HA) |
| |- agentic-qa-kit-server (Hono+Bun or Express+Node) |
| |- agentic-qa-kit-admin (React) |
| |- Postgres HA . Redis/NATS . S3-compat . Vault . OIDC |
| `- OTel Collector + Prometheus + Tempo + Loki |
| |
| Runners (per-team / CI shared / dev laptop) |
| - mTLS + OIDC to the control plane |
| - execute scenarios next to the code (code never leaves) |
+------------------------------------------------------------+
Full diagram: docs/architecture/reference.md (stub; expanded in v0.1.0).
| Version | Theme | Highlights |
|---|---|---|
v0.0.1-governance |
Bootstrap | Process docs, CI, Copilot review automation, admin spec |
v0.1.x |
Foundation | Schemas, CLI (init/doctor/validate), 5 base packs, 4 adapters, runner+smoke, reports, admin viewer |
v0.2.x |
Determinism & cost | 3-level replay, cost governance, container sandbox default |
v0.3.x |
Enterprise table-stakes | Postgres backend, SSO/RBAC, pack signing, on-prem LLM, Helm chart, air-gap installer |
v0.4.x |
Admin editing | Scenario Studio, AI-generation with review workflow |
v0.5.x |
Multi-team | Server + runner fleet, findings dedup, bug→fix→verify-fix loop |
v0.6.x |
Methodology rigor | STRIDE/FMEA/OWASP integration, oracle ensemble, judge calibration |
v1.0 |
GA enterprise — shipped | SOC2/ISO controls catalog, aqa-audit-verify CLI, pen-test scope doc |
v1.1 |
Polish — shipped | Banner, full Helm chart (runner StatefulSet, Ingress, NetworkPolicy, Postgres subchart), 3 example targets (Bun, Next.js, Laravel) |
v1.2 |
Admin SPA wired — shipped | Tailwind 4 + TanStack Router + Query + 12 screens, audit-chain verification in-browser via Web Crypto |
v1.3 |
Quality batch — shipped | Admin server↔UI mapping, 6 detail routes, 12 new admin tests, CLI E2E smoke gate, threat-model expansion, CHANGELOG backfill |
GA (v1.0 shipped, v1.3 current). The full 24-task roadmap is closed:
schemas, CLI (@aqa/kit), 5 baseline packs, multi-agent adapters
(Claude/Codex/Gemini/Copilot), runner with hash-chained audit, reporter
with 3-level replay, admin panel, server + runner fleet, on-prem LLM
adapters, SSO/RBAC, Postgres backend, pack signing + scanning,
container sandbox, cost governance, findings dedup + clustering,
STRIDE/FMEA/OWASP methodology layer, Helm chart + Terraform + air-gap
installer, SOC2/ISO controls catalog + aqa-audit-verify CLI.
Release notes per tag: Releases page.
Live state: docs/PROGRESS.md. Architectural
decisions: docs/adr/.
docs/getting-started.md— junior onboardingdocs/PACK-AUTHORING.md— write your own pack (community guide)docs/ecosystem-explained.md— concepts deep-divedocs/RULES.md— contribution rulesdocs/adr/— architecture decisionsdocs/design/admin-panel-template.md— admin UI spec (for parallel template work)AGENTS.md— single source of truth for AI contributorsdocs/architecture/reference.md— full architecture (stub; expanded in v0.1.0)docs/security/threat-model.md— STRIDE applied to AQA (stub; expanded in v0.1.0)docs/methodology/agentic-qa.md— methodology paper (stub; expanded in v0.1.0)
Please read CONTRIBUTING.md, AGENTS.md, and docs/RULES.md first.
We follow a strict PR loop with Copilot Code Review on every PR (automated by .github/workflows/copilot-review.yml).
For vulnerabilities, use the private channel in SECURITY.md — do not file public issues.
Apache License 2.0. © Padosoft.
Padosoft — info@padosoft.com
