Skip to content

feat(verification): Universal Self-Consistency judge [W3.A.3]#47

Draft
TimothyVang wants to merge 2 commits into
feat/W3.A.2-dual-lane-cross-enginefrom
feat/W3.A.3-universal-self-consistency
Draft

feat(verification): Universal Self-Consistency judge [W3.A.3]#47
TimothyVang wants to merge 2 commits into
feat/W3.A.2-dual-lane-cross-enginefrom
feat/W3.A.3-universal-self-consistency

Conversation

@TimothyVang

Copy link
Copy Markdown
Owner

Summary

W3.A.3 upgrades UniversalSelfConsistency from the W1.C.3 stub to the
real Chen et al. 2023 (USC, arXiv:2311.17311) judge-of-last-resort
strategy (CLAUDE.md §8 / ARCHITECTURE.md §1).

USC is invoked AFTER another strategy returns CONTESTED. It reads the
prior strategy's candidate outputs (the n=3 CloudSelfConsistency trio
or the 2-engine cross-engine pair) and either:

  1. Substance-majority case (deterministic): when a clear majority
    cluster exists keyed by (mitre_technique, frozenset(artifact_paths)),
    USC selects the first candidate of that cluster and returns
    USCJudgement(selected_index, status=vetted_status, notes). This is
    the load-bearing testable surface — BUILD_PLAN W3.A.3.a:
    test_judge_picks_most_consistent_rationale_among_n3.
  2. No-majority case: Chen 2023 §3 prescribes an LLM-as-judge
    prompt; transport lands in W2.B. Until then judge() correctly
    returns CONTESTED rather than invent a winner.
  • USCJudgement carries selected_index: int | None and a
    VerdictStatus. Caller passes the locked mode's VETTED_* via
    vetted_status so USC stays mode-agnostic.
  • Empty-set rule (ARCHITECTURE.md §1) carries: empty artifact_paths
    candidates drop out of clustering. The majority denominator is the
    original candidate count (not survivors-only) so empties count
    against the threshold — a silently-crashing engine cannot carry a
    vetted verdict.
  • Majority threshold is strict (>50% of originals). Tie-breaker is
    deterministic via Counter.most_common (Python 3.7+
    insertion-stable order).
  • verify(...) Protocol method accepts optional candidates kwarg
    and delegates to judge(). Without candidates it raises
    NotImplementedError (standalone USC has nothing to judge over
    until the LLM transport lands in W2.B).
  • STUB_FOR class marker cleared from
    "W3.A.3 (Chen et al. 2023 ...)" to "" to signal the
    stub-vs-real boundary moved (per W1.C.3 strategy.py comment
    "When W3.A.3 lands, set this to the empty string and remove the
    guard test").
  • Two stub-anchored tests in test_strategy_protocol.py updated
    in the RED commit (renamed regression-guard test, updated
    test_strategy_returns_verdict_result to exercise the
    candidates-supplied path).

Builds on PR #42 (feat/W3.A.2-dual-lane-cross-engine) — base set
to that branch so this PR's diff shows only the W3.A.3 surface.

Test plan

  • tests/verification/test_universal_self_consistency.py
    12 tests pass, including BUILD_PLAN W3.A.3.a's
    test_judge_picks_most_consistent_rationale_among_n3
  • Two test_strategy_protocol.py tests updated for post-W3.A.3
    reality (RED commit)
  • Full verification suite green (57/57)
  • ruff check verdict/ tests/ clean
  • RED commit lands first, GREEN commit lands second; per-task-ID
    commit subjects with [W3.A.3] per CLAUDE.md §3.7
  • After W2.B: replace judge()'s no-majority CONTESTED return
    with a real LLM-as-judge call (Chen 2023 §3 prompt) and
    replace verify()'s standalone NotImplementedError with the
    candidate-construction transport

TimothyVang added 2 commits May 2, 2026 08:24
… [W3.A.3]

Failing tests for the W3.A.3 upgrade of UniversalSelfConsistency from
the W1.C.3 stub to the real Chen et al. 2023 (UCSC) judge-of-last-
resort strategy (CLAUDE.md §8 / ARCHITECTURE.md §1).

New file tests/verification/test_universal_self_consistency.py:

- judge(candidates) clusters by SUBSTANCE
  (artifact-set + mitre-technique). When a clear majority cluster
  exists, USC selects a majority member's index and returns a
  USCJudgement with status set to a VETTED_* state.
- BUILD_PLAN W3.A.3.a load-bearing test:
  test_judge_picks_most_consistent_rationale_among_n3 — given three
  candidates with two-of-three substance majority,
  judge(...).selected_index in {0,1} AND status != CONTESTED.
- test_judge_returns_contested_when_no_majority_exists — three
  pairwise-disagreeing candidates → CONTESTED. USC correctly admits
  "no winner" rather than invent one (the LLM-judge fallback for
  the no-majority case lands in W2.B; until then CONTESTED is the
  honest answer).
- Substance-clustering uses set semantics on artifact_paths AND
  identity equality on mitre_technique.
- Empty-set rule (ARCHITECTURE.md §1) carries: candidates with empty
  artifact_paths drop out of clustering rather than counting as a
  "majority of empties".
- vetted_status kwarg (default VETTED_CLOUD) lets the dispatching
  quorum_node pass the locked mode's VETTED_* so USC is mode-agnostic.
  Passing CONTESTED / UNVERIFIABLE raises ValueError.
- Boundary: judge() with <2 candidates raises ValueError.
- verify(...) (Protocol method) raises NotImplementedError when
  called without candidates (USC is dispatched after another
  strategy returned CONTESTED, with prior outputs in hand).
- verify(..., candidates=...) delegates to judge() and returns a
  VerdictResult so the Protocol contract holds on the dispatch path.

tests/verification/test_strategy_protocol.py — two stub-anchored
tests updated for the post-W3.A.3 reality (per W1.C.3 strategy.py
comment "When W3.A.3 lands, set this to the empty string and remove
the guard test"):

- test_strategy_returns_verdict_result now exercises the
  candidates-supplied path on verify() and asserts a non-CONTESTED
  result on a 2-of-3 substance majority. The stub-era assertion
  status == VETTED_CLOUD is no longer load-bearing.
- test_usc_stub_does_not_pretend_to_implement_chen_2023 renamed to
  test_usc_stub_marker_cleared_post_w3a3 and inverted: STUB_FOR
  must now be empty (W3.A.3 lands the real strategy).

Module under test does not yet expose USCJudgement; collection
ERRORs with ImportError. GREEN follows in the next commit.
Upgrades UniversalSelfConsistency from the W1.C.3 stub to the real
Chen et al. 2023 (UCSC, arXiv:2311.17311) judge-of-last-resort
strategy (CLAUDE.md §8 / ARCHITECTURE.md §1).

USC is invoked AFTER another strategy returns CONTESTED. It reads
the prior strategy's candidate outputs (the n=3 CloudSelfConsistency
trio or the 2-engine cross-engine pair) and either:

1. Substance-majority case (deterministic): when a clear majority
   cluster exists in the candidate set — keyed by
   (mitre_technique, frozenset(artifact_paths)) — USC selects the
   first candidate of that cluster and returns USCJudgement with the
   caller-specified vetted_status. This is the load-bearing
   testable surface (BUILD_PLAN W3.A.3.a:
   test_judge_picks_most_consistent_rationale_among_n3).
2. No-majority case (LLM-as-judge fallback): Chen 2023 §3 prescribes
   an LLM-as-judge prompt — the model reads all candidate rationales
   and picks the most consistent one. The LLM transport lands in
   W2.B; until then judge() correctly returns CONTESTED for the
   no-majority case (USC admits "no winner" rather than invent one).

USCJudgement carries selected_index (None on no-majority) and a
VerdictStatus. The caller passes the locked mode's VETTED_* via
vetted_status so USC stays mode-agnostic; passing CONTESTED /
UNVERIFIABLE is rejected (the dispatcher would be asking USC to
mislabel its own verdict). At-least-two-candidates is enforced.

Empty-set rule (ARCHITECTURE.md §1) carries: candidates with empty
artifact_paths drop out of clustering rather than counting as a
"majority of empties". A silently-crashing engine must NOT carry a
vetted verdict by virtue of producing nothing. The majority
denominator is the ORIGINAL candidate count (not survivors-only) so
empties count against the threshold.

Majority threshold is strict: >50% of original candidates. n=3
requires 2 cluster-mates; n=4 requires 3. Tie-breaker is
deterministic — first-insertion order via Counter.most_common
(Python 3.7+ insertion-stable).

verify(...) (Protocol method) accepts an optional candidates kwarg
and a vetted_status kwarg and delegates to judge(); without
candidates it raises NotImplementedError (standalone USC has
nothing to judge over without the LLM transport).

STUB_FOR class marker cleared (was "W3.A.3 (Chen et al. 2023 ...)")
to signal the stub-vs-real boundary moved.

12 new tests in tests/verification/test_universal_self_consistency.py
all pass; 2 stub-anchored tests in test_strategy_protocol.py updated
in the RED commit; 57/57 in the full verification suite pass; ruff
clean.
@TimothyVang

Copy link
Copy Markdown
Owner Author

Review — W3.A.3 UniversalSelfConsistency [automated reviewer, tier-1]

CI result (local run on feat/W3.A.3-universal-self-consistency): 57/57 tests pass (45 inherited from #42 + 12 new). ruff check verdict/verification/ clean.


Consensus-logic correctness

All spec invariants pass:

Invariant Location Verdict
Substance-clustering by (mitre_technique, frozenset(artifact_paths)) — order-insensitive artifact sets _substance_key + Counter PASS
Strict-majority threshold > 50% of original candidate count (not survivors) majority_threshold = len(candidates) // 2 + 1 PASS
Empty artifact_paths candidates drop out of clustering; denominator is original count real_indexed filter + len(candidates) denominator PASS
No-majority → CONTESTED (LLM-judge deferred to W2.B, honest return) tail of judge() PASS
STUB_FOR cleared to "" STUB_FOR: str = "" PASS
vetted_status must be VETTED_*; CONTESTED/UNVERIFIABLE raise ValueError guard at top of judge() PASS
judge() requires ≥ 2 candidates guard at top of judge() PASS
verify(..., candidates=...) delegates to judge() and returns VerdictResult verify() body PASS
verify(...) without candidates raises NotImplementedError if candidates is None: raise PASS
Tie-breaker is deterministic (first insertion order via Counter.most_common + first matching original index) next(i for i, c in real_indexed ...) PASS
USC is mode-agnostic; caller passes vetted_status vetted_status kwarg default VETTED_CLOUD PASS

BUILD_PLAN W3.A.3.a load-bearing test

test_judge_picks_most_consistent_rationale_among_n3 passes and asserts selected_index in {0, 1} AND status != VETTED_CLOUD. Confirmed this test was in the RED commit before the implementation landed.

_substance_key correctness note

frozenset(candidate.artifact_paths) provides set semantics for clustering. This means two candidates that cite the same artifacts in different orders cluster as equal — the test test_judge_uses_artifact_set_semantics_not_list_order pins this. It also means duplicate paths within one candidate's list collapse (same artifact cited twice = cited once for clustering). This is the correct and desirable behaviour: artifact-list duplicates are a formatting artefact, not a substantive difference.

Majority denominator correctness

The comment in the code is precise: "the total denominator is len(candidates), NOT len(real_indexed)". For n=3 with one empty, majority requires 2 of 3 (not 2 of 2). This is correct and security-critical: a 1-of-2-survivors cluster would otherwise vet with only one real engine agreeing. The test test_judge_drops_empty_artifact_candidates_from_clustering covers the two-empties-against-one case and correctly asserts CONTESTED.

verify() Protocol conformance after W3.A.3

The VerifierStrategy Protocol signature is:

def verify(self, *, case_id: str, hypothesis: str, mitre_technique: str, evidence_summary: str) -> VerdictResult

UniversalSelfConsistency.verify() adds candidates and vetted_status as optional kwargs with defaults. This is backward-compatible: a VerifierStrategy-typed dispatch site that calls verify(case_id=..., hypothesis=..., mitre_technique=..., evidence_summary=...) will get NotImplementedError (correct — USC without candidates has nothing to judge). The quorum_node at W2.B must call with candidates= populated. This is a deliberate design choice and is clearly documented. Worth keeping the W2.B wiring note visible at that dispatch site when it lands.

One potential issue: TYPE_CHECKING-guarded import of EngineOutput in strategy.py. EngineOutput is imported only under TYPE_CHECKING, which means at runtime _substance_key's annotation candidate: EngineOutput is not resolved. This is fine because from __future__ import annotations is present (PEP 563 lazy evaluation), so the annotation is a string at runtime and never evaluated. However, the Sequence[EngineOutput] annotation in judge()'s signature and verify()'s candidates parameter are also only string literals at runtime. This is correct. The Counter(keys) and frozenset(candidate.artifact_paths) calls operate on the actual EngineOutput dataclass instances at runtime without needing the type annotation resolved. No functional issue. Confirming this is intentional: the TYPE_CHECKING guard keeps strategy.py free of a circular or heavy import on the hot path.

test_strategy_protocol.py updates

Two tests correctly updated:

  • test_strategy_returns_verdict_result now exercises the candidates-supplied path on verify() with a 2-of-3 majority and asserts result.status != CONTESTED. The former assertion == VETTED_CLOUD is removed because the stub era assumption no longer holds; mode-agnostic VETTED_CLOUD is the default but the test deliberately stays mode-neutral.
  • test_usc_stub_marker_cleared_post_w3a3 inverts the old guard test to assert STUB_FOR == "". Correct regression guard for the new reality.

§3.7 commit discipline

Two commits, RED then GREEN, both carry [W3.A.3]. Prefixes test(verification): and feat(verification): — allowed. The RED commit subject was truncated in GitHub display (test(verification): UniversalSelfConsistency Chen-2023 invariants RED…) but the commit body is complete. No --no-verify. No watermarks.

§3.8 dependencies

No new dependencies. collections.Counter and collections.abc.Sequence are stdlib.

§3.10 no-mocks

No Mock / patch / replay anywhere in the diff. All tests drive judge() and verify() directly with real EngineOutput dataclass instances. The LLM-judge fallback path is honestly deferred via a CONTESTED return, not a mock.

Actionable items

  1. (Nit, non-blocking) verify() has _ = (case_id, hypothesis, mitre_technique, evidence_summary) in the candidates is None branch to suppress unused-variable warnings. This is fine but slightly obscures that those inputs will be consumed by the LLM-judge in W2.B. A comment # W2.B will use these to construct the LLM-judge prompt would make the intent clear for the implementer landing that work.
  2. (Non-blocking) USCJudgement is not added to __all__ at module level — it is in __all__ on the feat/W3.A.3 branch ("USCJudgement" is in the diff). Confirmed present. No action needed.
  3. (Non-blocking, forward) When W2.B wires the LLM-judge, the judge() no-majority path must also be exercised end-to-end in the eval suite. The current test correctly marks it as CONTESTED; that test will need updating (or a new test added) when the LLM transport lands.

Recommendation: APPROVE. Substance-clustering is correct, majority denominator uses original count (not survivors), empty-set rule carries from the cross-engine strategies, STUB_FOR cleared, all 12 new tests plus 45 inherited pass, commit hygiene clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant