feat(verification): AirGapCrossEngine [W3.A.1] by TimothyVang · Pull Request #37 · TimothyVang/Verdict

TimothyVang · 2026-05-02T13:17:15Z

Summary

W3.A.1 implementation of AirGapCrossEngine — air-gap mode quorum strategy
that runs Qwen3-30B-A3B-Thinking against GLM-4.5-Air and accepts the
finding only when they agree (ARCHITECTURE.md §1 quorum-dispatch rows 3-5;
CLAUDE.md §8).

compute_verdict(qwen, glm) is the pure consensus surface — input
is two EngineOutput records, output is VerdictResult.
- Jaccard(artifact_paths) ≥ AIRGAP_JACCARD_THRESHOLD (0.80, imported
  from strategy.py — never hard-coded) AND identical
  mitre_technique → VETTED_AIRGAP.
- Jaccard ≥ threshold but divergent mitre_technique → CONTESTED
  (ARCHITECTURE.md §1 row 4).
- Jaccard < threshold → CONTESTED.
Empty-set rule (ARCHITECTURE.md §1): empty artifact_paths from
either side is DISAGREEMENT, never a null vote. Pre-empts Jaccard so
0/0=1.0 cannot vet by accident.
Engine-family distinctness: same-family pairs (e.g. qwen3 vs
qwen3) raise ValueError. Air-gap quorum without independence
collapses to self-consistency; refused at the consensus boundary.
VerdictResult.notes records Jaccard score + disagreement reason for
ledger audit + replan_node disagreement-type routing.

verify(...) raises NotImplementedError until W2.B wires the SGLang
clients + ledger plumbing. CLAUDE.md §3.10 explicitly permits this
backend-level stub: the consensus logic is real and exercised by unit
tests against EngineOutput records — it is not a mock.

EngineOutput is a frozen dataclass carrier (engine, artifact_paths,
mitre_technique) with a family() helper. Deliberately not Pydantic —
the consensus path runs once per hypothesis under quorum_node and
must not allocate validators on the hot path; the upstream Hypothesis/
Finding schemas already validate input.

Builds on PR #29 (feat/W1.C.3-strategy-protocol-and-usc) — base set to
that branch so this PR's diff shows only the W3.A.1 surface.

Test plan

tests/verification/test_airgap_cross_engine.py — 12 tests pass
Full verification suite green (30/30)
ruff check verdict/ tests/ clean
RED commit lands first, GREEN commit lands second; per-task-ID
commit subjects with [W3.A.1] per CLAUDE.md §3.7
After W2.B: replace NotImplementedError in verify() with real
SGLang transport that calls compute_verdict(...)

Failing tests for the air-gap cross-engine quorum strategy (ARCHITECTURE.md §1 quorum-dispatch table rows 3-5): - Jaccard >=0.80 AND identical mitre_technique -> VETTED_AIRGAP - Jaccard >=0.80, divergent mitre_technique -> CONTESTED - Jaccard <0.80 -> CONTESTED - empty-set rule: empty parsed_artifacts is treated as DISAGREEMENT (never a free pass for the non-empty engine) - threshold imported from strategy.AIRGAP_JACCARD_THRESHOLD; never hard-coded - compute_verdict(qwen, glm) is the pure consensus surface; the transport-level verify(...) raises NotImplementedError until W2.B wires SGLang clients (CLAUDE.md §3.10 — no mocks against verdict internals; the agreement function is a pure data-path function and is unit-tested at the right granularity here) - VerdictResult.notes carries Jaccard + mitre-divergence reason for ledger audit + replan_node disagreement-type routing - engine identity is enforced (two outputs from the same family collapses cross-engine to self-consistency and is rejected) Module under test does not yet exist; collection ERRORs with ModuleNotFoundError for verdict.verification.airgap_cross_engine. GREEN follows in the next commit.

Implements the air-gap cross-engine quorum strategy (ARCHITECTURE.md §1 quorum-dispatch table rows 3-5; CLAUDE.md §8): - compute_verdict(qwen, glm) is the pure consensus surface. - Jaccard(artifact_paths) >= AIRGAP_JACCARD_THRESHOLD (0.80, imported from strategy.py — never hard-coded) AND identical mitre_technique -> VETTED_AIRGAP. - Jaccard >= threshold but divergent mitre_technique -> CONTESTED (ARCHITECTURE.md §1 row 4). - Jaccard < threshold -> CONTESTED. - Empty-set rule (ARCHITECTURE.md §1): any empty artifact_paths is treated as DISAGREEMENT, never a null vote that lets the non-empty engine win by default. Pre-empts Jaccard so the mathematical 0/0=1.0 convention cannot vet by accident. - Engine-family distinctness: passing two outputs from the same family (e.g. qwen3 vs qwen3) raises ValueError. Air-gap quorum collapses to self-consistency without independence; refused at the consensus boundary. - VerdictResult.notes records Jaccard score + disagreement reason so the ledger has the audit handle and replan_node can route on disagreement type. verify(...) raises NotImplementedError until W2.B wires the SGLang clients (Qwen3 + GLM-4.5-Air) + ledger plumbing. CLAUDE.md §3.10 explicitly permits this backend-level stub: the consensus logic is real and exercised by unit tests against EngineOutput records — it is not a mock, it is the strategy's load-bearing decision function. EngineOutput is a frozen dataclass carrier (engine, artifact_paths, mitre_technique) with a family() helper. Deliberately not a Pydantic model — the cross-engine consensus path runs once per hypothesis under the quorum_node and must not allocate validators on the hot path. Hypothesis / Finding schemas already validate upstream. 12/12 tests in tests/verification/test_airgap_cross_engine.py pass; 30/30 in the full verification suite pass; ruff clean.

TimothyVang · 2026-05-02T13:32:44Z

Review — W3.A.1 AirGapCrossEngine [automated reviewer, tier-1]

CI result (local run on feat/W3.A.1-airgap-cross-engine): 30/30 tests pass. ruff check verdict/verification/ clean.

Consensus-logic correctness

All seven spec invariants pass:

Invariant	Location	Verdict
Jaccard ≥ 0.80 AND identical `mitre_technique` → `VETTED_AIRGAP`	`compute_verdict` lines 86–101	PASS
Jaccard ≥ 0.80, divergent `mitre_technique` → `CONTESTED`	lines 94–105	PASS
Jaccard < 0.80 → `CONTESTED`	lines 81–92	PASS
Empty `artifact_paths` from either side → `CONTESTED` (pre-empts Jaccard 0/0 = 1)	lines 61–79	PASS
Same-family pair raises `ValueError`	lines 47–58	PASS
`AIRGAP_JACCARD_THRESHOLD` read from `strategy._strategy` on every call, never hard-coded	line 83	PASS
`VerdictResult.notes` records Jaccard score on VETTED and disagreement reason on CONTESTED	lines 96–101, 87–91	PASS

One nit — threshold via module reference, not attribute: compute_verdict reads _strategy.AIRGAP_JACCARD_THRESHOLD at call time via a module alias. The test test_threshold_is_imported_from_strategy_module asserts the constant equals 0.80 and the boundary-case vets, but does not monkeypatch the constant and re-run. That is fine for now (monkeypatching a frozen module attribute is fragile), but worth noting: the "live constant" guarantee is structurally verified by importing the same name, not by runtime mutation. Acceptable posture for W3.A.1.

Minor: _jaccard defensive 0.0 on union-empty is belt-and-braces. The docstring correctly explains the empty-set pre-check in compute_verdict guards this path. That comment is important to keep — future callers of _jaccard directly would not get the empty-set guard for free.

`EngineOutput` design

Frozen dataclass is the right choice — no Pydantic allocation on the quorum hot-path. The family() split-on-first-hyphen approach works for the current engine identifiers (qwen3-*, glm-*, claude-*). One forward-compatibility note: if a model name is ever introduced without a hyphen (e.g. a single-word alias), family() returns the full engine string rather than raising. The docstring acknowledges this ("If '-' not in self.engine: return self.engine"). It will silently pass the family-distinctness check if two such single-word engines share the same string, which is an unlikely but real hazard. Low risk at W3.A.1; worth a # TODO(W-future): enforce hyphenated naming convention comment so it isn't forgotten.

§3.7 commit discipline

Two commits, RED then GREEN, both carry [W3.A.1] in the subject. Prefixes are test(verification): and feat(verification): — both in the allowed set. No --no-verify. No watermarks. Clean.

§3.8 dependencies

No new dependencies. collections and dataclasses are stdlib. No forbidden packages touched.

§3.10 no-mocks

No Mock, patch, MagicMock, or HTTP-replay library anywhere in the diff. EngineOutput is a plain dataclass, not a mock. verify() raises NotImplementedError at the network boundary — the correct pattern per §3.10. The test for verify asserts the raise; the consensus surface is unit-tested via compute_verdict directly.

Actionable items

(Nit, non-blocking) engine_output.py line ~74: add a brief comment noting that single-word engine names will return themselves from family() — protects future contributors.
(Non-blocking) _jaccard is duplicated verbatim in dual_lane_cross_engine.py (PR feat(verification): DualLaneCrossEngine three-way verification [W3.A.2] #42). A shared private helper in engine_output.py or a new _consensus_utils.py would eliminate drift risk. Not required for this PR but worth a follow-up task.

Recommendation: APPROVE. Logic is correct, tests pin every spec invariant, no mocks, commit hygiene clean.

TimothyVang added 2 commits May 2, 2026 08:14

TimothyVang mentioned this pull request May 2, 2026

feat(verification): DualLaneCrossEngine three-way verification [W3.A.2] #42

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(verification): AirGapCrossEngine [W3.A.1]#37

feat(verification): AirGapCrossEngine [W3.A.1]#37
TimothyVang wants to merge 2 commits into
feat/W1.C.3-strategy-protocol-and-uscfrom
feat/W3.A.1-airgap-cross-engine

TimothyVang commented May 2, 2026

Uh oh!

TimothyVang commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TimothyVang commented May 2, 2026

Summary

Test plan

Uh oh!

TimothyVang commented May 2, 2026

Review — W3.A.1 AirGapCrossEngine [automated reviewer, tier-1]

Consensus-logic correctness

EngineOutput design

§3.7 commit discipline

§3.8 dependencies

§3.10 no-mocks

Actionable items

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`EngineOutput` design