Skip to content

feat(verification): AirGapCrossEngine [W3.A.1]#37

Draft
TimothyVang wants to merge 2 commits into
feat/W1.C.3-strategy-protocol-and-uscfrom
feat/W3.A.1-airgap-cross-engine
Draft

feat(verification): AirGapCrossEngine [W3.A.1]#37
TimothyVang wants to merge 2 commits into
feat/W1.C.3-strategy-protocol-and-uscfrom
feat/W3.A.1-airgap-cross-engine

Conversation

@TimothyVang

Copy link
Copy Markdown
Owner

Summary

W3.A.1 implementation of AirGapCrossEngine — air-gap mode quorum strategy
that runs Qwen3-30B-A3B-Thinking against GLM-4.5-Air and accepts the
finding only when they agree (ARCHITECTURE.md §1 quorum-dispatch rows 3-5;
CLAUDE.md §8).

  • compute_verdict(qwen, glm) is the pure consensus surface — input
    is two EngineOutput records, output is VerdictResult.
    • Jaccard(artifact_paths) ≥ AIRGAP_JACCARD_THRESHOLD (0.80, imported
      from strategy.py — never hard-coded) AND identical
      mitre_techniqueVETTED_AIRGAP.
    • Jaccard ≥ threshold but divergent mitre_techniqueCONTESTED
      (ARCHITECTURE.md §1 row 4).
    • Jaccard < threshold → CONTESTED.
  • Empty-set rule (ARCHITECTURE.md §1): empty artifact_paths from
    either side is DISAGREEMENT, never a null vote. Pre-empts Jaccard so
    0/0=1.0 cannot vet by accident.
  • Engine-family distinctness: same-family pairs (e.g. qwen3 vs
    qwen3) raise ValueError. Air-gap quorum without independence
    collapses to self-consistency; refused at the consensus boundary.
  • VerdictResult.notes records Jaccard score + disagreement reason for
    ledger audit + replan_node disagreement-type routing.

verify(...) raises NotImplementedError until W2.B wires the SGLang
clients + ledger plumbing. CLAUDE.md §3.10 explicitly permits this
backend-level stub: the consensus logic is real and exercised by unit
tests against EngineOutput records — it is not a mock.

EngineOutput is a frozen dataclass carrier (engine, artifact_paths,
mitre_technique) with a family() helper. Deliberately not Pydantic —
the consensus path runs once per hypothesis under quorum_node and
must not allocate validators on the hot path; the upstream Hypothesis/
Finding schemas already validate input.

Builds on PR #29 (feat/W1.C.3-strategy-protocol-and-usc) — base set to
that branch so this PR's diff shows only the W3.A.1 surface.

Test plan

  • tests/verification/test_airgap_cross_engine.py — 12 tests pass
  • Full verification suite green (30/30)
  • ruff check verdict/ tests/ clean
  • RED commit lands first, GREEN commit lands second; per-task-ID
    commit subjects with [W3.A.1] per CLAUDE.md §3.7
  • After W2.B: replace NotImplementedError in verify() with real
    SGLang transport that calls compute_verdict(...)

TimothyVang added 2 commits May 2, 2026 08:14
Failing tests for the air-gap cross-engine quorum strategy
(ARCHITECTURE.md §1 quorum-dispatch table rows 3-5):

- Jaccard >=0.80 AND identical mitre_technique -> VETTED_AIRGAP
- Jaccard >=0.80, divergent mitre_technique -> CONTESTED
- Jaccard <0.80 -> CONTESTED
- empty-set rule: empty parsed_artifacts is treated as DISAGREEMENT
  (never a free pass for the non-empty engine)
- threshold imported from strategy.AIRGAP_JACCARD_THRESHOLD; never
  hard-coded
- compute_verdict(qwen, glm) is the pure consensus surface; the
  transport-level verify(...) raises NotImplementedError until W2.B
  wires SGLang clients (CLAUDE.md §3.10 — no mocks against verdict
  internals; the agreement function is a pure data-path function and
  is unit-tested at the right granularity here)
- VerdictResult.notes carries Jaccard + mitre-divergence reason for
  ledger audit + replan_node disagreement-type routing
- engine identity is enforced (two outputs from the same family
  collapses cross-engine to self-consistency and is rejected)

Module under test does not yet exist; collection ERRORs with
ModuleNotFoundError for verdict.verification.airgap_cross_engine.
GREEN follows in the next commit.
Implements the air-gap cross-engine quorum strategy
(ARCHITECTURE.md §1 quorum-dispatch table rows 3-5; CLAUDE.md §8):

- compute_verdict(qwen, glm) is the pure consensus surface.
  - Jaccard(artifact_paths) >= AIRGAP_JACCARD_THRESHOLD (0.80, imported
    from strategy.py — never hard-coded) AND identical mitre_technique
    -> VETTED_AIRGAP.
  - Jaccard >= threshold but divergent mitre_technique -> CONTESTED
    (ARCHITECTURE.md §1 row 4).
  - Jaccard < threshold -> CONTESTED.
- Empty-set rule (ARCHITECTURE.md §1): any empty artifact_paths is
  treated as DISAGREEMENT, never a null vote that lets the non-empty
  engine win by default. Pre-empts Jaccard so the mathematical
  0/0=1.0 convention cannot vet by accident.
- Engine-family distinctness: passing two outputs from the same
  family (e.g. qwen3 vs qwen3) raises ValueError. Air-gap quorum
  collapses to self-consistency without independence; refused at
  the consensus boundary.
- VerdictResult.notes records Jaccard score + disagreement reason
  so the ledger has the audit handle and replan_node can route on
  disagreement type.

verify(...) raises NotImplementedError until W2.B wires the SGLang
clients (Qwen3 + GLM-4.5-Air) + ledger plumbing. CLAUDE.md §3.10
explicitly permits this backend-level stub: the consensus logic is
real and exercised by unit tests against EngineOutput records — it
is not a mock, it is the strategy's load-bearing decision function.

EngineOutput is a frozen dataclass carrier (engine, artifact_paths,
mitre_technique) with a family() helper. Deliberately not a Pydantic
model — the cross-engine consensus path runs once per hypothesis
under the quorum_node and must not allocate validators on the hot
path. Hypothesis / Finding schemas already validate upstream.

12/12 tests in tests/verification/test_airgap_cross_engine.py pass;
30/30 in the full verification suite pass; ruff clean.
@TimothyVang

Copy link
Copy Markdown
Owner Author

Review — W3.A.1 AirGapCrossEngine [automated reviewer, tier-1]

CI result (local run on feat/W3.A.1-airgap-cross-engine): 30/30 tests pass. ruff check verdict/verification/ clean.


Consensus-logic correctness

All seven spec invariants pass:

Invariant Location Verdict
Jaccard ≥ 0.80 AND identical mitre_techniqueVETTED_AIRGAP compute_verdict lines 86–101 PASS
Jaccard ≥ 0.80, divergent mitre_techniqueCONTESTED lines 94–105 PASS
Jaccard < 0.80 → CONTESTED lines 81–92 PASS
Empty artifact_paths from either side → CONTESTED (pre-empts Jaccard 0/0 = 1) lines 61–79 PASS
Same-family pair raises ValueError lines 47–58 PASS
AIRGAP_JACCARD_THRESHOLD read from strategy._strategy on every call, never hard-coded line 83 PASS
VerdictResult.notes records Jaccard score on VETTED and disagreement reason on CONTESTED lines 96–101, 87–91 PASS

One nit — threshold via module reference, not attribute: compute_verdict reads _strategy.AIRGAP_JACCARD_THRESHOLD at call time via a module alias. The test test_threshold_is_imported_from_strategy_module asserts the constant equals 0.80 and the boundary-case vets, but does not monkeypatch the constant and re-run. That is fine for now (monkeypatching a frozen module attribute is fragile), but worth noting: the "live constant" guarantee is structurally verified by importing the same name, not by runtime mutation. Acceptable posture for W3.A.1.

Minor: _jaccard defensive 0.0 on union-empty is belt-and-braces. The docstring correctly explains the empty-set pre-check in compute_verdict guards this path. That comment is important to keep — future callers of _jaccard directly would not get the empty-set guard for free.

EngineOutput design

Frozen dataclass is the right choice — no Pydantic allocation on the quorum hot-path. The family() split-on-first-hyphen approach works for the current engine identifiers (qwen3-*, glm-*, claude-*). One forward-compatibility note: if a model name is ever introduced without a hyphen (e.g. a single-word alias), family() returns the full engine string rather than raising. The docstring acknowledges this ("If '-' not in self.engine: return self.engine"). It will silently pass the family-distinctness check if two such single-word engines share the same string, which is an unlikely but real hazard. Low risk at W3.A.1; worth a # TODO(W-future): enforce hyphenated naming convention comment so it isn't forgotten.

§3.7 commit discipline

Two commits, RED then GREEN, both carry [W3.A.1] in the subject. Prefixes are test(verification): and feat(verification): — both in the allowed set. No --no-verify. No watermarks. Clean.

§3.8 dependencies

No new dependencies. collections and dataclasses are stdlib. No forbidden packages touched.

§3.10 no-mocks

No Mock, patch, MagicMock, or HTTP-replay library anywhere in the diff. EngineOutput is a plain dataclass, not a mock. verify() raises NotImplementedError at the network boundary — the correct pattern per §3.10. The test for verify asserts the raise; the consensus surface is unit-tested via compute_verdict directly.

Actionable items

  1. (Nit, non-blocking) engine_output.py line ~74: add a brief comment noting that single-word engine names will return themselves from family() — protects future contributors.
  2. (Non-blocking) _jaccard is duplicated verbatim in dual_lane_cross_engine.py (PR feat(verification): DualLaneCrossEngine three-way verification [W3.A.2] #42). A shared private helper in engine_output.py or a new _consensus_utils.py would eliminate drift risk. Not required for this PR but worth a follow-up task.

Recommendation: APPROVE. Logic is correct, tests pin every spec invariant, no mocks, commit hygiene clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant