Skip to content

feat(observability): planner CoT capture (gzipped ledger + 8KB Langfuse) [W2.D.3]#46

Open
TimothyVang wants to merge 2 commits into
feat/W2.D.2-critique-verdict-schema-ledgerfrom
feat/W2.D.3-planner-cot-capture
Open

feat(observability): planner CoT capture (gzipped ledger + 8KB Langfuse) [W2.D.3]#46
TimothyVang wants to merge 2 commits into
feat/W2.D.2-critique-verdict-schema-ledgerfrom
feat/W2.D.3-planner-cot-capture

Conversation

@TimothyVang

Copy link
Copy Markdown
Owner

Summary

  • Adds verdict/planning/cot_capture.py with two entry points:
    • extract_cot(response_text, mode) — strips Qwen3 <think>…</think> tags for air-gap mode; passthrough for cloud (Claude SDK). Handles nested/malformed tags via re.sub cleanup.
    • capture_planner_cot(cot_text, case_id, ledger, mode) -> CotCaptureResult — gzip-compresses, SHA-256 hashes (over compressed bytes, not original), base64-encodes, writes planner_cot LedgerEntry, returns CotCaptureResult with first 8192 bytes as langfuse_span_payload.
  • CotCaptureResult(ledger_entry_id, langfuse_span_payload, cot_gzip_sha256) — the graph node attaches langfuse_span_payload to the Langfuse span attribute (live Langfuse API call not wired here per §3.10).

Test plan

  • pytest tests/planning/test_cot_capture.py — 11 tests GREEN
  • All 28 planning tests GREEN (W2.D.1 + W2.D.2 + W2.D.3)
  • ruff check verdict/ tests/planning/ — clean
  • RED commit precedes GREEN commit

TimothyVang added 2 commits May 2, 2026 08:23
…ED [W2.D.3]

Failing tests per BUILD_PLAN W2.D.3.a:
- test_gzipped_cot_in_ledger: payload["cot_gzip"] base64+gzip roundtrip;
  payload["cot_gzip_sha256"] SHA-256 over compressed bytes
- test_8kb_attached_to_langfuse_span: CotCaptureResult.langfuse_span_payload
  <= 8192 bytes, first 8KB of original uncompressed CoT
- test_extract_cloud_cot_returns_full_text: cloud mode = no stripping
- test_extract_airgap_cot_strips_think_tags: Qwen3 <think>...</think> unwrap
- test_extract_airgap_cot_passthrough_if_no_think_tags: no-op for plain text
- test_cot_sha256_is_over_compressed_bytes: hash discipline check
- test_cot_capture_result_has_entry_id: ledger_entry_id round-trip

Currently RED: verdict.planning.cot_capture missing.
…se) [W2.D.3]

Adds verdict/planning/cot_capture.py:

extract_cot(response_text, mode):
- mode="cloud": passthrough (Claude SDK responses have no <think> tags)
- mode="airgap": strips Qwen3 <think>...</think> wrapper; nested tags
  are removed from inner content via re.sub; no-tag passthrough for
  GLM-4.5-Air non-thinking mode

capture_planner_cot(cot_text, case_id, ledger, mode) -> CotCaptureResult:
- gzip-compresses CoT at level 9
- SHA-256 over compressed bytes (hash what you store, not what you discard)
- base64-encodes for JSON-safe ledger storage
- writes planner_cot LedgerEntry with cot_gzip + cot_gzip_sha256 + metadata
- returns CotCaptureResult(ledger_entry_id, langfuse_span_payload, sha256)
  where langfuse_span_payload = first 8192 bytes of original text

ARCHITECTURE.md §9: "planner_cot" is the 13th canonical event type.
ARCHITECTURE.md §9: "planner CoT capture goes to ledger AND first 8KB
to Langfuse span."

All 28 planning tests GREEN (11 new); ruff clean.
@TimothyVang

Copy link
Copy Markdown
Owner Author

W2.D.3 — Planner CoT capture (gzipped ledger + 8KB Langfuse)

CI gate: 28/28 GREEN (11 new + 17 from W2.D.1+D.2). Ruff: clean.

PASS — hard rules

Check Status
extract_cot(text, mode="cloud") passthrough — no <think> stripping PASS
extract_cot(text, mode="airgap") strips Qwen3 <think>…</think> PASS
No <think> tags → passthrough (non-thinking mode / GLM-4.5-Air) PASS
Nested/malformed <think> tags handled via greedy regex + inner cleanup PASS
gzip.compress at level 9; SHA-256 over compressed bytes not original PASS
base64-encoded for JSON-safe ledger storage PASS
CotCaptureResult.langfuse_span_payload = first 8192 characters of original CoT PASS (see bug below)
LedgerEntry(event_type="planner_cot") written with cot_gzip + cot_gzip_sha256 payload PASS
CotCaptureResult.ledger_entry_id matches written entry PASS
§3.7 RED 4bef12b precedes GREEN 59c9ad5, both carry [W2.D.3] PASS
§3.10 no mocks, pure Python gzip/hashlib, Langfuse live attach deferred per §3.10 PASS

BUG — langfuse_span_payload truncation uses character count, not byte count

ARCHITECTURE.md §9 says "first 8KB to Langfuse span attribute." The implementation does:

langfuse_span_payload = cot_text[:LANGFUSE_SPAN_MAX_BYTES]   # character slice

The test validates len(span_bytes) <= 8192 where span_bytes = result.langfuse_span_payload.encode("utf-8"). For ASCII CoT this is equivalent, but for Qwen3 thinking output that includes CJK characters (common in multilingual reasoning), each character is 3 bytes in UTF-8. A 8192-character slice of CJK text is 24,576 bytes — triple the limit. The Langfuse span attribute will be rejected or silently truncated server-side.

Fix — truncate by encoded byte count, not character count:

encoded = cot_text.encode("utf-8")
langfuse_span_payload = encoded[:LANGFUSE_SPAN_MAX_BYTES].decode("utf-8", errors="ignore")

The test should be updated to use a fixture containing multi-byte characters to prevent this regressing.

ISSUE — capture_planner_cot has no test coverage for mode="airgap"

All capture_planner_cot invocations in test_cot_capture.py use the default mode="cloud". The mode parameter is propagated to the LedgerEntry.mode_at_case_init field, so a test with mode="airgap" or mode="dual" is needed to confirm the mode is correctly threaded through. One test added would cover this:

def test_capture_planner_cot_mode_propagated_to_ledger_entry() -> None:
    ledger = InMemoryLedger()
    capture_planner_cot(cot_text="reasoning", case_id="case-mode", ledger=ledger, mode="airgap")
    assert ledger.entries[0].mode_at_case_init == "airgap"

MINOR — extract_cot mode parameter typed as str not Mode

extract_cot(response_text: str, *, mode: str) accepts any string. The capture_planner_cot call correctly types mode: Mode (the Literal["cloud","airgap","dual"]). Making extract_cot also accept Mode (or at minimum document that only "cloud" and "airgap" are valid) prevents a caller passing "dual" and getting cloud passthrough silently. The "dual" mode has both Claude and Qwen3 lanes; the caller should extract from the appropriate lane before calling extract_cot. A docstring note or a ValueError guard for unknown mode values would close this.

CARRY-OVER — EVTX_4624 and USNJRNL absent from ArtifactClass

Still absent; must be resolved before W1.B.9/W1.B.10 caveat-validator PRs.

One blocking bug (byte vs char truncation), one non-blocking coverage gap, one minor type annotation issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant