feat(observability): planner CoT capture (gzipped ledger + 8KB Langfuse) [W2.D.3] by TimothyVang · Pull Request #46 · TimothyVang/Verdict

TimothyVang · 2026-05-02T13:25:19Z

Summary

Adds verdict/planning/cot_capture.py with two entry points:
- extract_cot(response_text, mode) — strips Qwen3 <think>…</think> tags for air-gap mode; passthrough for cloud (Claude SDK). Handles nested/malformed tags via re.sub cleanup.
- capture_planner_cot(cot_text, case_id, ledger, mode) -> CotCaptureResult — gzip-compresses, SHA-256 hashes (over compressed bytes, not original), base64-encodes, writes planner_cot LedgerEntry, returns CotCaptureResult with first 8192 bytes as langfuse_span_payload.
CotCaptureResult(ledger_entry_id, langfuse_span_payload, cot_gzip_sha256) — the graph node attaches langfuse_span_payload to the Langfuse span attribute (live Langfuse API call not wired here per §3.10).

Test plan

pytest tests/planning/test_cot_capture.py — 11 tests GREEN
All 28 planning tests GREEN (W2.D.1 + W2.D.2 + W2.D.3)
ruff check verdict/ tests/planning/ — clean
RED commit precedes GREEN commit

…ED [W2.D.3] Failing tests per BUILD_PLAN W2.D.3.a: - test_gzipped_cot_in_ledger: payload["cot_gzip"] base64+gzip roundtrip; payload["cot_gzip_sha256"] SHA-256 over compressed bytes - test_8kb_attached_to_langfuse_span: CotCaptureResult.langfuse_span_payload <= 8192 bytes, first 8KB of original uncompressed CoT - test_extract_cloud_cot_returns_full_text: cloud mode = no stripping - test_extract_airgap_cot_strips_think_tags: Qwen3 <think>...</think> unwrap - test_extract_airgap_cot_passthrough_if_no_think_tags: no-op for plain text - test_cot_sha256_is_over_compressed_bytes: hash discipline check - test_cot_capture_result_has_entry_id: ledger_entry_id round-trip Currently RED: verdict.planning.cot_capture missing.

…se) [W2.D.3] Adds verdict/planning/cot_capture.py: extract_cot(response_text, mode): - mode="cloud": passthrough (Claude SDK responses have no <think> tags) - mode="airgap": strips Qwen3 <think>...</think> wrapper; nested tags are removed from inner content via re.sub; no-tag passthrough for GLM-4.5-Air non-thinking mode capture_planner_cot(cot_text, case_id, ledger, mode) -> CotCaptureResult: - gzip-compresses CoT at level 9 - SHA-256 over compressed bytes (hash what you store, not what you discard) - base64-encodes for JSON-safe ledger storage - writes planner_cot LedgerEntry with cot_gzip + cot_gzip_sha256 + metadata - returns CotCaptureResult(ledger_entry_id, langfuse_span_payload, sha256) where langfuse_span_payload = first 8192 bytes of original text ARCHITECTURE.md §9: "planner_cot" is the 13th canonical event type. ARCHITECTURE.md §9: "planner CoT capture goes to ledger AND first 8KB to Langfuse span." All 28 planning tests GREEN (11 new); ruff clean.

TimothyVang · 2026-05-02T13:37:00Z

W2.D.3 — Planner CoT capture (gzipped ledger + 8KB Langfuse)

CI gate: 28/28 GREEN (11 new + 17 from W2.D.1+D.2). Ruff: clean.

PASS — hard rules

Check	Status
`extract_cot(text, mode="cloud")` passthrough — no `<think>` stripping	PASS
`extract_cot(text, mode="airgap")` strips Qwen3 `<think>…</think>`	PASS
No `<think>` tags → passthrough (non-thinking mode / GLM-4.5-Air)	PASS
Nested/malformed `<think>` tags handled via greedy regex + inner cleanup	PASS
`gzip.compress` at level 9; SHA-256 over compressed bytes not original	PASS
base64-encoded for JSON-safe ledger storage	PASS
`CotCaptureResult.langfuse_span_payload` = first 8192 characters of original CoT	PASS (see bug below)
`LedgerEntry(event_type="planner_cot")` written with `cot_gzip` + `cot_gzip_sha256` payload	PASS
`CotCaptureResult.ledger_entry_id` matches written entry	PASS
§3.7 RED `4bef12b` precedes GREEN `59c9ad5`, both carry `[W2.D.3]`	PASS
§3.10 no mocks, pure Python gzip/hashlib, Langfuse live attach deferred per §3.10	PASS

BUG — `langfuse_span_payload` truncation uses character count, not byte count

ARCHITECTURE.md §9 says "first 8KB to Langfuse span attribute." The implementation does:

langfuse_span_payload = cot_text[:LANGFUSE_SPAN_MAX_BYTES]   # character slice

The test validates len(span_bytes) <= 8192 where span_bytes = result.langfuse_span_payload.encode("utf-8"). For ASCII CoT this is equivalent, but for Qwen3 thinking output that includes CJK characters (common in multilingual reasoning), each character is 3 bytes in UTF-8. A 8192-character slice of CJK text is 24,576 bytes — triple the limit. The Langfuse span attribute will be rejected or silently truncated server-side.

Fix — truncate by encoded byte count, not character count:

encoded = cot_text.encode("utf-8")
langfuse_span_payload = encoded[:LANGFUSE_SPAN_MAX_BYTES].decode("utf-8", errors="ignore")

The test should be updated to use a fixture containing multi-byte characters to prevent this regressing.

ISSUE — `capture_planner_cot` has no test coverage for `mode="airgap"`

All capture_planner_cot invocations in test_cot_capture.py use the default mode="cloud". The mode parameter is propagated to the LedgerEntry.mode_at_case_init field, so a test with mode="airgap" or mode="dual" is needed to confirm the mode is correctly threaded through. One test added would cover this:

def test_capture_planner_cot_mode_propagated_to_ledger_entry() -> None:
    ledger = InMemoryLedger()
    capture_planner_cot(cot_text="reasoning", case_id="case-mode", ledger=ledger, mode="airgap")
    assert ledger.entries[0].mode_at_case_init == "airgap"

MINOR — `extract_cot` `mode` parameter typed as `str` not `Mode`

extract_cot(response_text: str, *, mode: str) accepts any string. The capture_planner_cot call correctly types mode: Mode (the Literal["cloud","airgap","dual"]). Making extract_cot also accept Mode (or at minimum document that only "cloud" and "airgap" are valid) prevents a caller passing "dual" and getting cloud passthrough silently. The "dual" mode has both Claude and Qwen3 lanes; the caller should extract from the appropriate lane before calling extract_cot. A docstring note or a ValueError guard for unknown mode values would close this.

CARRY-OVER — `EVTX_4624` and `USNJRNL` absent from `ArtifactClass`

Still absent; must be resolved before W1.B.9/W1.B.10 caveat-validator PRs.

One blocking bug (byte vs char truncation), one non-blocking coverage gap, one minor type annotation issue.

TimothyVang added 2 commits May 2, 2026 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): planner CoT capture (gzipped ledger + 8KB Langfuse) [W2.D.3]#46

feat(observability): planner CoT capture (gzipped ledger + 8KB Langfuse) [W2.D.3]#46
TimothyVang wants to merge 2 commits into
feat/W2.D.2-critique-verdict-schema-ledgerfrom
feat/W2.D.3-planner-cot-capture

TimothyVang commented May 2, 2026

Uh oh!

TimothyVang commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TimothyVang commented May 2, 2026

Summary

Test plan

Uh oh!

TimothyVang commented May 2, 2026

W2.D.3 — Planner CoT capture (gzipped ledger + 8KB Langfuse)

PASS — hard rules

BUG — langfuse_span_payload truncation uses character count, not byte count

ISSUE — capture_planner_cot has no test coverage for mode="airgap"

MINOR — extract_cot mode parameter typed as str not Mode

CARRY-OVER — EVTX_4624 and USNJRNL absent from ArtifactClass

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BUG — `langfuse_span_payload` truncation uses character count, not byte count

ISSUE — `capture_planner_cot` has no test coverage for `mode="airgap"`

MINOR — `extract_cot` `mode` parameter typed as `str` not `Mode`

CARRY-OVER — `EVTX_4624` and `USNJRNL` absent from `ArtifactClass`