feat(verification): Universal Self-Consistency judge [W3.A.3]#47
feat(verification): Universal Self-Consistency judge [W3.A.3]#47TimothyVang wants to merge 2 commits into
Conversation
… [W3.A.3]
Failing tests for the W3.A.3 upgrade of UniversalSelfConsistency from
the W1.C.3 stub to the real Chen et al. 2023 (UCSC) judge-of-last-
resort strategy (CLAUDE.md §8 / ARCHITECTURE.md §1).
New file tests/verification/test_universal_self_consistency.py:
- judge(candidates) clusters by SUBSTANCE
(artifact-set + mitre-technique). When a clear majority cluster
exists, USC selects a majority member's index and returns a
USCJudgement with status set to a VETTED_* state.
- BUILD_PLAN W3.A.3.a load-bearing test:
test_judge_picks_most_consistent_rationale_among_n3 — given three
candidates with two-of-three substance majority,
judge(...).selected_index in {0,1} AND status != CONTESTED.
- test_judge_returns_contested_when_no_majority_exists — three
pairwise-disagreeing candidates → CONTESTED. USC correctly admits
"no winner" rather than invent one (the LLM-judge fallback for
the no-majority case lands in W2.B; until then CONTESTED is the
honest answer).
- Substance-clustering uses set semantics on artifact_paths AND
identity equality on mitre_technique.
- Empty-set rule (ARCHITECTURE.md §1) carries: candidates with empty
artifact_paths drop out of clustering rather than counting as a
"majority of empties".
- vetted_status kwarg (default VETTED_CLOUD) lets the dispatching
quorum_node pass the locked mode's VETTED_* so USC is mode-agnostic.
Passing CONTESTED / UNVERIFIABLE raises ValueError.
- Boundary: judge() with <2 candidates raises ValueError.
- verify(...) (Protocol method) raises NotImplementedError when
called without candidates (USC is dispatched after another
strategy returned CONTESTED, with prior outputs in hand).
- verify(..., candidates=...) delegates to judge() and returns a
VerdictResult so the Protocol contract holds on the dispatch path.
tests/verification/test_strategy_protocol.py — two stub-anchored
tests updated for the post-W3.A.3 reality (per W1.C.3 strategy.py
comment "When W3.A.3 lands, set this to the empty string and remove
the guard test"):
- test_strategy_returns_verdict_result now exercises the
candidates-supplied path on verify() and asserts a non-CONTESTED
result on a 2-of-3 substance majority. The stub-era assertion
status == VETTED_CLOUD is no longer load-bearing.
- test_usc_stub_does_not_pretend_to_implement_chen_2023 renamed to
test_usc_stub_marker_cleared_post_w3a3 and inverted: STUB_FOR
must now be empty (W3.A.3 lands the real strategy).
Module under test does not yet expose USCJudgement; collection
ERRORs with ImportError. GREEN follows in the next commit.
Upgrades UniversalSelfConsistency from the W1.C.3 stub to the real Chen et al. 2023 (UCSC, arXiv:2311.17311) judge-of-last-resort strategy (CLAUDE.md §8 / ARCHITECTURE.md §1). USC is invoked AFTER another strategy returns CONTESTED. It reads the prior strategy's candidate outputs (the n=3 CloudSelfConsistency trio or the 2-engine cross-engine pair) and either: 1. Substance-majority case (deterministic): when a clear majority cluster exists in the candidate set — keyed by (mitre_technique, frozenset(artifact_paths)) — USC selects the first candidate of that cluster and returns USCJudgement with the caller-specified vetted_status. This is the load-bearing testable surface (BUILD_PLAN W3.A.3.a: test_judge_picks_most_consistent_rationale_among_n3). 2. No-majority case (LLM-as-judge fallback): Chen 2023 §3 prescribes an LLM-as-judge prompt — the model reads all candidate rationales and picks the most consistent one. The LLM transport lands in W2.B; until then judge() correctly returns CONTESTED for the no-majority case (USC admits "no winner" rather than invent one). USCJudgement carries selected_index (None on no-majority) and a VerdictStatus. The caller passes the locked mode's VETTED_* via vetted_status so USC stays mode-agnostic; passing CONTESTED / UNVERIFIABLE is rejected (the dispatcher would be asking USC to mislabel its own verdict). At-least-two-candidates is enforced. Empty-set rule (ARCHITECTURE.md §1) carries: candidates with empty artifact_paths drop out of clustering rather than counting as a "majority of empties". A silently-crashing engine must NOT carry a vetted verdict by virtue of producing nothing. The majority denominator is the ORIGINAL candidate count (not survivors-only) so empties count against the threshold. Majority threshold is strict: >50% of original candidates. n=3 requires 2 cluster-mates; n=4 requires 3. Tie-breaker is deterministic — first-insertion order via Counter.most_common (Python 3.7+ insertion-stable). verify(...) (Protocol method) accepts an optional candidates kwarg and a vetted_status kwarg and delegates to judge(); without candidates it raises NotImplementedError (standalone USC has nothing to judge over without the LLM transport). STUB_FOR class marker cleared (was "W3.A.3 (Chen et al. 2023 ...)") to signal the stub-vs-real boundary moved. 12 new tests in tests/verification/test_universal_self_consistency.py all pass; 2 stub-anchored tests in test_strategy_protocol.py updated in the RED commit; 57/57 in the full verification suite pass; ruff clean.
Review — W3.A.3 UniversalSelfConsistency [automated reviewer, tier-1]CI result (local run on Consensus-logic correctnessAll spec invariants pass:
BUILD_PLAN W3.A.3.a load-bearing test
|
Summary
W3.A.3 upgrades
UniversalSelfConsistencyfrom the W1.C.3 stub to thereal Chen et al. 2023 (USC, arXiv:2311.17311) judge-of-last-resort
strategy (CLAUDE.md §8 / ARCHITECTURE.md §1).
USC is invoked AFTER another strategy returns
CONTESTED. It reads theprior strategy's candidate outputs (the n=3
CloudSelfConsistencytrioor the 2-engine cross-engine pair) and either:
cluster exists keyed by
(mitre_technique, frozenset(artifact_paths)),USC selects the first candidate of that cluster and returns
USCJudgement(selected_index, status=vetted_status, notes). This isthe load-bearing testable surface — BUILD_PLAN W3.A.3.a:
test_judge_picks_most_consistent_rationale_among_n3.prompt; transport lands in W2.B. Until then
judge()correctlyreturns
CONTESTEDrather than invent a winner.USCJudgementcarriesselected_index: int | Noneand aVerdictStatus. Caller passes the locked mode'sVETTED_*viavetted_statusso USC stays mode-agnostic.artifact_pathscandidates drop out of clustering. The majority denominator is the
original candidate count (not survivors-only) so empties count
against the threshold — a silently-crashing engine cannot carry a
vetted verdict.
deterministic via
Counter.most_common(Python 3.7+insertion-stable order).
verify(...)Protocol method accepts optionalcandidateskwargand delegates to
judge(). Withoutcandidatesit raisesNotImplementedError(standalone USC has nothing to judge overuntil the LLM transport lands in W2.B).
STUB_FORclass marker cleared from"W3.A.3 (Chen et al. 2023 ...)"to""to signal thestub-vs-real boundary moved (per W1.C.3 strategy.py comment
"When W3.A.3 lands, set this to the empty string and remove the
guard test").
test_strategy_protocol.pyupdatedin the RED commit (renamed regression-guard test, updated
test_strategy_returns_verdict_resultto exercise thecandidates-supplied path).
Builds on PR #42 (
feat/W3.A.2-dual-lane-cross-engine) — base setto that branch so this PR's diff shows only the W3.A.3 surface.
Test plan
tests/verification/test_universal_self_consistency.py—12 tests pass, including BUILD_PLAN W3.A.3.a's
test_judge_picks_most_consistent_rationale_among_n3test_strategy_protocol.pytests updated for post-W3.A.3reality (RED commit)
ruff check verdict/ tests/cleancommit subjects with
[W3.A.3]per CLAUDE.md §3.7judge()'s no-majorityCONTESTEDreturnwith a real LLM-as-judge call (Chen 2023 §3 prompt) and
replace
verify()'s standaloneNotImplementedErrorwith thecandidate-construction transport