feat(verification): AirGapCrossEngine [W3.A.1]#37
Conversation
Failing tests for the air-gap cross-engine quorum strategy (ARCHITECTURE.md §1 quorum-dispatch table rows 3-5): - Jaccard >=0.80 AND identical mitre_technique -> VETTED_AIRGAP - Jaccard >=0.80, divergent mitre_technique -> CONTESTED - Jaccard <0.80 -> CONTESTED - empty-set rule: empty parsed_artifacts is treated as DISAGREEMENT (never a free pass for the non-empty engine) - threshold imported from strategy.AIRGAP_JACCARD_THRESHOLD; never hard-coded - compute_verdict(qwen, glm) is the pure consensus surface; the transport-level verify(...) raises NotImplementedError until W2.B wires SGLang clients (CLAUDE.md §3.10 — no mocks against verdict internals; the agreement function is a pure data-path function and is unit-tested at the right granularity here) - VerdictResult.notes carries Jaccard + mitre-divergence reason for ledger audit + replan_node disagreement-type routing - engine identity is enforced (two outputs from the same family collapses cross-engine to self-consistency and is rejected) Module under test does not yet exist; collection ERRORs with ModuleNotFoundError for verdict.verification.airgap_cross_engine. GREEN follows in the next commit.
Implements the air-gap cross-engine quorum strategy
(ARCHITECTURE.md §1 quorum-dispatch table rows 3-5; CLAUDE.md §8):
- compute_verdict(qwen, glm) is the pure consensus surface.
- Jaccard(artifact_paths) >= AIRGAP_JACCARD_THRESHOLD (0.80, imported
from strategy.py — never hard-coded) AND identical mitre_technique
-> VETTED_AIRGAP.
- Jaccard >= threshold but divergent mitre_technique -> CONTESTED
(ARCHITECTURE.md §1 row 4).
- Jaccard < threshold -> CONTESTED.
- Empty-set rule (ARCHITECTURE.md §1): any empty artifact_paths is
treated as DISAGREEMENT, never a null vote that lets the non-empty
engine win by default. Pre-empts Jaccard so the mathematical
0/0=1.0 convention cannot vet by accident.
- Engine-family distinctness: passing two outputs from the same
family (e.g. qwen3 vs qwen3) raises ValueError. Air-gap quorum
collapses to self-consistency without independence; refused at
the consensus boundary.
- VerdictResult.notes records Jaccard score + disagreement reason
so the ledger has the audit handle and replan_node can route on
disagreement type.
verify(...) raises NotImplementedError until W2.B wires the SGLang
clients (Qwen3 + GLM-4.5-Air) + ledger plumbing. CLAUDE.md §3.10
explicitly permits this backend-level stub: the consensus logic is
real and exercised by unit tests against EngineOutput records — it
is not a mock, it is the strategy's load-bearing decision function.
EngineOutput is a frozen dataclass carrier (engine, artifact_paths,
mitre_technique) with a family() helper. Deliberately not a Pydantic
model — the cross-engine consensus path runs once per hypothesis
under the quorum_node and must not allocate validators on the hot
path. Hypothesis / Finding schemas already validate upstream.
12/12 tests in tests/verification/test_airgap_cross_engine.py pass;
30/30 in the full verification suite pass; ruff clean.
Review — W3.A.1 AirGapCrossEngine [automated reviewer, tier-1]CI result (local run on Consensus-logic correctnessAll seven spec invariants pass:
One nit — threshold via module reference, not attribute: Minor:
|
Summary
W3.A.1 implementation of
AirGapCrossEngine— air-gap mode quorum strategythat runs Qwen3-30B-A3B-Thinking against GLM-4.5-Air and accepts the
finding only when they agree (ARCHITECTURE.md §1 quorum-dispatch rows 3-5;
CLAUDE.md §8).
compute_verdict(qwen, glm)is the pure consensus surface — inputis two
EngineOutputrecords, output isVerdictResult.artifact_paths) ≥AIRGAP_JACCARD_THRESHOLD(0.80, importedfrom
strategy.py— never hard-coded) AND identicalmitre_technique→VETTED_AIRGAP.mitre_technique→CONTESTED(ARCHITECTURE.md §1 row 4).
CONTESTED.artifact_pathsfromeither side is DISAGREEMENT, never a null vote. Pre-empts Jaccard so
0/0=1.0 cannot vet by accident.
qwen3vsqwen3) raiseValueError. Air-gap quorum without independencecollapses to self-consistency; refused at the consensus boundary.
VerdictResult.notesrecords Jaccard score + disagreement reason forledger audit +
replan_nodedisagreement-type routing.verify(...)raisesNotImplementedErroruntil W2.B wires the SGLangclients + ledger plumbing. CLAUDE.md §3.10 explicitly permits this
backend-level stub: the consensus logic is real and exercised by unit
tests against
EngineOutputrecords — it is not a mock.EngineOutputis a frozen dataclass carrier (engine, artifact_paths,mitre_technique) with a
family()helper. Deliberately not Pydantic —the consensus path runs once per hypothesis under
quorum_nodeandmust not allocate validators on the hot path; the upstream
Hypothesis/Findingschemas already validate input.Builds on PR #29 (
feat/W1.C.3-strategy-protocol-and-usc) — base set tothat branch so this PR's diff shows only the W3.A.1 surface.
Test plan
tests/verification/test_airgap_cross_engine.py— 12 tests passruff check verdict/ tests/cleancommit subjects with
[W3.A.1]per CLAUDE.md §3.7NotImplementedErrorinverify()with realSGLang transport that calls
compute_verdict(...)