feat(verification): Universal Self-Consistency judge [W3.A.3] by TimothyVang · Pull Request #47 · TimothyVang/Verdict

TimothyVang · 2026-05-02T13:27:07Z

Summary

W3.A.3 upgrades UniversalSelfConsistency from the W1.C.3 stub to the
real Chen et al. 2023 (USC, arXiv:2311.17311) judge-of-last-resort
strategy (CLAUDE.md §8 / ARCHITECTURE.md §1).

USC is invoked AFTER another strategy returns CONTESTED. It reads the
prior strategy's candidate outputs (the n=3 CloudSelfConsistency trio
or the 2-engine cross-engine pair) and either:

Substance-majority case (deterministic): when a clear majority
cluster exists keyed by (mitre_technique, frozenset(artifact_paths)),
USC selects the first candidate of that cluster and returns
USCJudgement(selected_index, status=vetted_status, notes). This is
the load-bearing testable surface — BUILD_PLAN W3.A.3.a:
test_judge_picks_most_consistent_rationale_among_n3.
No-majority case: Chen 2023 §3 prescribes an LLM-as-judge
prompt; transport lands in W2.B. Until then judge() correctly
returns CONTESTED rather than invent a winner.

USCJudgement carries selected_index: int | None and a
VerdictStatus. Caller passes the locked mode's VETTED_* via
vetted_status so USC stays mode-agnostic.
Empty-set rule (ARCHITECTURE.md §1) carries: empty artifact_paths
candidates drop out of clustering. The majority denominator is the
original candidate count (not survivors-only) so empties count
against the threshold — a silently-crashing engine cannot carry a
vetted verdict.
Majority threshold is strict (>50% of originals). Tie-breaker is
deterministic via Counter.most_common (Python 3.7+
insertion-stable order).
verify(...) Protocol method accepts optional candidates kwarg
and delegates to judge(). Without candidates it raises
NotImplementedError (standalone USC has nothing to judge over
until the LLM transport lands in W2.B).
STUB_FOR class marker cleared from
"W3.A.3 (Chen et al. 2023 ...)" to "" to signal the
stub-vs-real boundary moved (per W1.C.3 strategy.py comment
"When W3.A.3 lands, set this to the empty string and remove the
guard test").
Two stub-anchored tests in test_strategy_protocol.py updated
in the RED commit (renamed regression-guard test, updated
test_strategy_returns_verdict_result to exercise the
candidates-supplied path).

Builds on PR #42 (feat/W3.A.2-dual-lane-cross-engine) — base set
to that branch so this PR's diff shows only the W3.A.3 surface.

Test plan

tests/verification/test_universal_self_consistency.py —
12 tests pass, including BUILD_PLAN W3.A.3.a's
test_judge_picks_most_consistent_rationale_among_n3
Two test_strategy_protocol.py tests updated for post-W3.A.3
reality (RED commit)
Full verification suite green (57/57)
ruff check verdict/ tests/ clean
RED commit lands first, GREEN commit lands second; per-task-ID
commit subjects with [W3.A.3] per CLAUDE.md §3.7
After W2.B: replace judge()'s no-majority CONTESTED return
with a real LLM-as-judge call (Chen 2023 §3 prompt) and
replace verify()'s standalone NotImplementedError with the
candidate-construction transport

… [W3.A.3] Failing tests for the W3.A.3 upgrade of UniversalSelfConsistency from the W1.C.3 stub to the real Chen et al. 2023 (UCSC) judge-of-last- resort strategy (CLAUDE.md §8 / ARCHITECTURE.md §1). New file tests/verification/test_universal_self_consistency.py: - judge(candidates) clusters by SUBSTANCE (artifact-set + mitre-technique). When a clear majority cluster exists, USC selects a majority member's index and returns a USCJudgement with status set to a VETTED_* state. - BUILD_PLAN W3.A.3.a load-bearing test: test_judge_picks_most_consistent_rationale_among_n3 — given three candidates with two-of-three substance majority, judge(...).selected_index in {0,1} AND status != CONTESTED. - test_judge_returns_contested_when_no_majority_exists — three pairwise-disagreeing candidates → CONTESTED. USC correctly admits "no winner" rather than invent one (the LLM-judge fallback for the no-majority case lands in W2.B; until then CONTESTED is the honest answer). - Substance-clustering uses set semantics on artifact_paths AND identity equality on mitre_technique. - Empty-set rule (ARCHITECTURE.md §1) carries: candidates with empty artifact_paths drop out of clustering rather than counting as a "majority of empties". - vetted_status kwarg (default VETTED_CLOUD) lets the dispatching quorum_node pass the locked mode's VETTED_* so USC is mode-agnostic. Passing CONTESTED / UNVERIFIABLE raises ValueError. - Boundary: judge() with <2 candidates raises ValueError. - verify(...) (Protocol method) raises NotImplementedError when called without candidates (USC is dispatched after another strategy returned CONTESTED, with prior outputs in hand). - verify(..., candidates=...) delegates to judge() and returns a VerdictResult so the Protocol contract holds on the dispatch path. tests/verification/test_strategy_protocol.py — two stub-anchored tests updated for the post-W3.A.3 reality (per W1.C.3 strategy.py comment "When W3.A.3 lands, set this to the empty string and remove the guard test"): - test_strategy_returns_verdict_result now exercises the candidates-supplied path on verify() and asserts a non-CONTESTED result on a 2-of-3 substance majority. The stub-era assertion status == VETTED_CLOUD is no longer load-bearing. - test_usc_stub_does_not_pretend_to_implement_chen_2023 renamed to test_usc_stub_marker_cleared_post_w3a3 and inverted: STUB_FOR must now be empty (W3.A.3 lands the real strategy). Module under test does not yet expose USCJudgement; collection ERRORs with ImportError. GREEN follows in the next commit.

Upgrades UniversalSelfConsistency from the W1.C.3 stub to the real Chen et al. 2023 (UCSC, arXiv:2311.17311) judge-of-last-resort strategy (CLAUDE.md §8 / ARCHITECTURE.md §1). USC is invoked AFTER another strategy returns CONTESTED. It reads the prior strategy's candidate outputs (the n=3 CloudSelfConsistency trio or the 2-engine cross-engine pair) and either: 1. Substance-majority case (deterministic): when a clear majority cluster exists in the candidate set — keyed by (mitre_technique, frozenset(artifact_paths)) — USC selects the first candidate of that cluster and returns USCJudgement with the caller-specified vetted_status. This is the load-bearing testable surface (BUILD_PLAN W3.A.3.a: test_judge_picks_most_consistent_rationale_among_n3). 2. No-majority case (LLM-as-judge fallback): Chen 2023 §3 prescribes an LLM-as-judge prompt — the model reads all candidate rationales and picks the most consistent one. The LLM transport lands in W2.B; until then judge() correctly returns CONTESTED for the no-majority case (USC admits "no winner" rather than invent one). USCJudgement carries selected_index (None on no-majority) and a VerdictStatus. The caller passes the locked mode's VETTED_* via vetted_status so USC stays mode-agnostic; passing CONTESTED / UNVERIFIABLE is rejected (the dispatcher would be asking USC to mislabel its own verdict). At-least-two-candidates is enforced. Empty-set rule (ARCHITECTURE.md §1) carries: candidates with empty artifact_paths drop out of clustering rather than counting as a "majority of empties". A silently-crashing engine must NOT carry a vetted verdict by virtue of producing nothing. The majority denominator is the ORIGINAL candidate count (not survivors-only) so empties count against the threshold. Majority threshold is strict: >50% of original candidates. n=3 requires 2 cluster-mates; n=4 requires 3. Tie-breaker is deterministic — first-insertion order via Counter.most_common (Python 3.7+ insertion-stable). verify(...) (Protocol method) accepts an optional candidates kwarg and a vetted_status kwarg and delegates to judge(); without candidates it raises NotImplementedError (standalone USC has nothing to judge over without the LLM transport). STUB_FOR class marker cleared (was "W3.A.3 (Chen et al. 2023 ...)") to signal the stub-vs-real boundary moved. 12 new tests in tests/verification/test_universal_self_consistency.py all pass; 2 stub-anchored tests in test_strategy_protocol.py updated in the RED commit; 57/57 in the full verification suite pass; ruff clean.

TimothyVang · 2026-05-02T13:33:52Z

Review — W3.A.3 UniversalSelfConsistency [automated reviewer, tier-1]

CI result (local run on feat/W3.A.3-universal-self-consistency): 57/57 tests pass (45 inherited from #42 + 12 new). ruff check verdict/verification/ clean.

Consensus-logic correctness

All spec invariants pass:

Invariant	Location	Verdict
Substance-clustering by `(mitre_technique, frozenset(artifact_paths))` — order-insensitive artifact sets	`_substance_key` + `Counter`	PASS
Strict-majority threshold `> 50%` of original candidate count (not survivors)	`majority_threshold = len(candidates) // 2 + 1`	PASS
Empty `artifact_paths` candidates drop out of clustering; denominator is original count	`real_indexed` filter + `len(candidates)` denominator	PASS
No-majority → `CONTESTED` (LLM-judge deferred to W2.B, honest return)	tail of `judge()`	PASS
`STUB_FOR` cleared to `""`	`STUB_FOR: str = ""`	PASS
`vetted_status` must be `VETTED_*`; `CONTESTED`/`UNVERIFIABLE` raise `ValueError`	guard at top of `judge()`	PASS
`judge()` requires ≥ 2 candidates	guard at top of `judge()`	PASS
`verify(..., candidates=...)` delegates to `judge()` and returns `VerdictResult`	`verify()` body	PASS
`verify(...)` without candidates raises `NotImplementedError`	`if candidates is None: raise`	PASS
Tie-breaker is deterministic (first insertion order via `Counter.most_common` + first matching original index)	`next(i for i, c in real_indexed ...)`	PASS
USC is mode-agnostic; caller passes `vetted_status`	`vetted_status` kwarg default `VETTED_CLOUD`	PASS

BUILD_PLAN W3.A.3.a load-bearing test

test_judge_picks_most_consistent_rationale_among_n3 passes and asserts selected_index in {0, 1} AND status != VETTED_CLOUD. Confirmed this test was in the RED commit before the implementation landed.

`_substance_key` correctness note

frozenset(candidate.artifact_paths) provides set semantics for clustering. This means two candidates that cite the same artifacts in different orders cluster as equal — the test test_judge_uses_artifact_set_semantics_not_list_order pins this. It also means duplicate paths within one candidate's list collapse (same artifact cited twice = cited once for clustering). This is the correct and desirable behaviour: artifact-list duplicates are a formatting artefact, not a substantive difference.

Majority denominator correctness

The comment in the code is precise: "the total denominator is len(candidates), NOT len(real_indexed)". For n=3 with one empty, majority requires 2 of 3 (not 2 of 2). This is correct and security-critical: a 1-of-2-survivors cluster would otherwise vet with only one real engine agreeing. The test test_judge_drops_empty_artifact_candidates_from_clustering covers the two-empties-against-one case and correctly asserts CONTESTED.

`verify()` Protocol conformance after W3.A.3

The VerifierStrategy Protocol signature is:

def verify(self, *, case_id: str, hypothesis: str, mitre_technique: str, evidence_summary: str) -> VerdictResult

UniversalSelfConsistency.verify() adds candidates and vetted_status as optional kwargs with defaults. This is backward-compatible: a VerifierStrategy-typed dispatch site that calls verify(case_id=..., hypothesis=..., mitre_technique=..., evidence_summary=...) will get NotImplementedError (correct — USC without candidates has nothing to judge). The quorum_node at W2.B must call with candidates= populated. This is a deliberate design choice and is clearly documented. Worth keeping the W2.B wiring note visible at that dispatch site when it lands.

One potential issue: TYPE_CHECKING-guarded import of EngineOutput in strategy.py. EngineOutput is imported only under TYPE_CHECKING, which means at runtime _substance_key's annotation candidate: EngineOutput is not resolved. This is fine because from __future__ import annotations is present (PEP 563 lazy evaluation), so the annotation is a string at runtime and never evaluated. However, the Sequence[EngineOutput] annotation in judge()'s signature and verify()'s candidates parameter are also only string literals at runtime. This is correct. The Counter(keys) and frozenset(candidate.artifact_paths) calls operate on the actual EngineOutput dataclass instances at runtime without needing the type annotation resolved. No functional issue. Confirming this is intentional: the TYPE_CHECKING guard keeps strategy.py free of a circular or heavy import on the hot path.

`test_strategy_protocol.py` updates

Two tests correctly updated:

test_strategy_returns_verdict_result now exercises the candidates-supplied path on verify() with a 2-of-3 majority and asserts result.status != CONTESTED. The former assertion == VETTED_CLOUD is removed because the stub era assumption no longer holds; mode-agnostic VETTED_CLOUD is the default but the test deliberately stays mode-neutral.
test_usc_stub_marker_cleared_post_w3a3 inverts the old guard test to assert STUB_FOR == "". Correct regression guard for the new reality.

§3.7 commit discipline

Two commits, RED then GREEN, both carry [W3.A.3]. Prefixes test(verification): and feat(verification): — allowed. The RED commit subject was truncated in GitHub display (test(verification): UniversalSelfConsistency Chen-2023 invariants RED…) but the commit body is complete. No --no-verify. No watermarks.

§3.8 dependencies

No new dependencies. collections.Counter and collections.abc.Sequence are stdlib.

§3.10 no-mocks

No Mock / patch / replay anywhere in the diff. All tests drive judge() and verify() directly with real EngineOutput dataclass instances. The LLM-judge fallback path is honestly deferred via a CONTESTED return, not a mock.

Actionable items

(Nit, non-blocking) verify() has _ = (case_id, hypothesis, mitre_technique, evidence_summary) in the candidates is None branch to suppress unused-variable warnings. This is fine but slightly obscures that those inputs will be consumed by the LLM-judge in W2.B. A comment # W2.B will use these to construct the LLM-judge prompt would make the intent clear for the implementer landing that work.
(Non-blocking) USCJudgement is not added to __all__ at module level — it is in __all__ on the feat/W3.A.3 branch ("USCJudgement" is in the diff). Confirmed present. No action needed.
(Non-blocking, forward) When W2.B wires the LLM-judge, the judge() no-majority path must also be exercised end-to-end in the eval suite. The current test correctly marks it as CONTESTED; that test will need updating (or a new test added) when the LLM transport lands.

Recommendation: APPROVE. Substance-clustering is correct, majority denominator uses original count (not survivors), empty-set rule carries from the cross-engine strategies, STUB_FOR cleared, all 12 new tests plus 45 inherited pass, commit hygiene clean.

TimothyVang added 2 commits May 2, 2026 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(verification): Universal Self-Consistency judge [W3.A.3]#47

feat(verification): Universal Self-Consistency judge [W3.A.3]#47
TimothyVang wants to merge 2 commits into
feat/W3.A.2-dual-lane-cross-enginefrom
feat/W3.A.3-universal-self-consistency

TimothyVang commented May 2, 2026

Uh oh!

TimothyVang commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TimothyVang commented May 2, 2026

Summary

Test plan

Uh oh!

TimothyVang commented May 2, 2026

Review — W3.A.3 UniversalSelfConsistency [automated reviewer, tier-1]

Consensus-logic correctness

BUILD_PLAN W3.A.3.a load-bearing test

_substance_key correctness note

Majority denominator correctness

verify() Protocol conformance after W3.A.3

test_strategy_protocol.py updates

§3.7 commit discipline

§3.8 dependencies

§3.10 no-mocks

Actionable items

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`_substance_key` correctness note

`verify()` Protocol conformance after W3.A.3

`test_strategy_protocol.py` updates