feat(bench): continuity-hell v1 coding-200 pilot harness by Davincc77 · Pull Request #126 · Davincc77/klickdskill

Davincc77 · 2026-06-02T13:19:16Z

Summary

Phase 1 of the continuity benchmark programme: a reproducible,
scientifically-defensible pilot stress-test harness for one skill lane,
x.klickd/coding (real artifact packages/@klickd/core/starter-skills/coding.klickd).

This is a real 200-task stress test on a single skill to find weaknesses
in its carried continuity/governance behaviour. It is not an A/B/C/D study
(ABCD comes later, after corrections), not a public release, and makes
no scientific-proof or market claim. Public release stays v4.1.

What's added (`benchmarks/continuity-hell-v1/coding-200/`)

tasks.json — exactly 200 unique tasks (CH1-COD-001..200), each
with ≥3 vectors across nine dimensions (continuity, constraint respect,
source discipline, governance/human-veto, security/no-leakage, skill
activation, handoff, actionability, no hallucinated facts). No easy tasks.
Governance vectors are read from the real coding.klickd via the SDK, so
the dataset can't drift from the artifact under test.
generate_tasks.py — deterministic, byte-stable generator (--check).
run_benchmark.py — deterministic dry-run lanes (floor/ceiling,
always labelled is_real_llm: false) + a gated real-LLM lane that refuses
without explicit approval and ships the provider call unwired on purpose.
score_outputs.py — deterministic, LLM-free scorer.
BENCHMARK_PROTOCOL.md / scoring_rubric.md — frozen protocol
(hypotheses, conditions, model/temp, thresholds, anti-mirage gate) and rubric.
README.md, reproducibility.md, failure_analysis.md template,
dry-run results, and a llm_x_klickd.BLOCKED.md marker.
tests/test_continuity_coding200.py — 21 validations.

Anti-mirage guarantees

No deterministic path ever emits is_real_llm: true; dry-run output carries
a not_real_label.
Real provider call is unwired (NotImplementedError) behind --execute +
XKLICKD_BENCHMARK_FULL_APPROVED=1 + a provider key.
Real 200-task LLM execution is BLOCKED — not run. See the BLOCKED marker
for the exact blocker and required input.

Test plan

pytest tests/test_continuity_coding200.py -q — 21 passed
pytest tests/ -q — 309 passed (no regressions; +21 new)
pytest tests/test_dev_preview.py -q — 5 passed
supply-chain tests — 102 passed
generate_tasks.py --check — dataset byte-stable
dry-run lanes diverge: baseline 0/200, x.klickd 200/200
real-LLM lane refuses with exact blocker (no provider called)

🤖 Generated by Computer

Add a reproducible, scientifically-defensible pilot stress-test harness for a single skill lane (x.klickd/coding). Phase 1 finds weaknesses in carried continuity/governance behaviour over 200 adversarial multi-vector tasks; it is not ABCD and makes no public/market or scientific-proof claim. - tasks.json: exactly 200 unique tasks, >=3 vectors each across 9 dimensions, no easy tasks; governance vectors read from the real coding.klickd via SDK. - generate_tasks.py: deterministic, byte-stable dataset generator (--check). - run_benchmark.py: deterministic dry-run lanes (floor/ceiling, labelled not-real) + a gated real-LLM lane that refuses without explicit approval and ships the provider call unwired (anti-mirage). - score_outputs.py: deterministic, LLM-free scorer. - BENCHMARK_PROTOCOL.md / scoring_rubric.md: frozen protocol and rubric. - README, reproducibility, failure_analysis template, dry-run results, real-lane BLOCKED marker. - tests/test_continuity_coding200.py: 21 validations. Real 200-task LLM execution is BLOCKED pending explicit human approval of provider spend and a wired output->contract mapping.

Add provider-key leakage guardrails before any future real LLM run: - secret_guard.py: single source of truth for detecting provider-key shapes, auth headers, high-entropy tokens, and any live provider env var value; redacts to [REDACTED:<kind>] and asserts payloads clean. - run_benchmark.py: redact + assert_clean every envelope before writing; new value-blind `preflight` mode (key present by name only + results/ clean); preflight wired into the real-LLM gate. - scripts/check_benchmark_secret_leakage.py: CI-friendly artifact scanner, exits non-zero on any finding, prints only redacted previews. - Docs (README, BENCHMARK_PROTOCOL sections 3/7/10, reproducibility, BLOCKED marker): keys live only in private env/secret manager, never committed, logged, or in artifacts; results record provider/model/run_id only. - tests/test_benchmark_secret_guard.py: 18 tests covering detection, redaction, preflight value-blindness, and that no live env var value is ever serialized to an artifact or stdout. Real 200-task LLM lane remains BLOCKED pending explicit human go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Davincc77 · 2026-06-02T13:31:14Z

Secret-safety guardrails added (commit `c0e13aa`)

Mandatory provider-key leakage guardrails are now in place before any future real LLM run. The real 200-task LLM lane remains BLOCKED — no provider was called and no real run was performed.

What changed

secret_guard.py — single source of truth for detecting provider-key shapes (Anthropic/OpenAI/Google/Groq/AWS), Authorization/Bearer headers, high-entropy tokens, and any live provider env var value. Redacts to [REDACTED:<kind>] and assert_cleans payloads. Findings never echo the raw secret.
run_benchmark.py — every output envelope is redact()-ed then assert_clean()-ed before any write, so no env var value, token, or header can be serialized to an artifact. New value-blind preflight mode verifies a provider key exists (reports env var name only, never the value) and that results/ is secret-clean; preflight is wired into the real-LLM gate (§7 item 5).
scripts/check_benchmark_secret_leakage.py — CI-friendly artifact scanner; exits non-zero on any finding, prints only redacted previews. Reuses secret_guard so detection is defined in one place.
Docs — README, BENCHMARK_PROTOCOL.md (§3 provenance, §7 gate item 5, new §10), reproducibility.md, and the BLOCKED marker now state: API keys live only in the private environment / secret manager, never committed, logged, or in artifacts; results record provider/model/run_id only, never headers/tokens.
tests/test_benchmark_secret_guard.py — 18 tests: fake-key detection, redaction, preflight value-blindness, and that a live env var value is never serialized to an artifact or stdout.

Validations

pytest tests/test_benchmark_secret_guard.py -q — 18 passed
pytest tests/test_continuity_coding200.py -q — 21 passed (no regressions)
pytest tests/test_dev_preview.py -q — 5 passed
supply-chain tests (-k supply_chain) — 102 passed
run_benchmark.py preflight green; check_benchmark_secret_leakage.py reports clean; raw grep for sk-ant-/sk-/AIza patterns in benchmark+scripts source: no matches; the live env key value appears in no tracked/added file.

Real-run status

Still BLOCKED. Remaining requirement before a real 200-task LLM run is unchanged plus the new secret gate: explicit human go-ahead + --execute + XKLICKD_BENCHMARK_FULL_APPROVED=1 + a provider key + a reviewed _call_provider output→contract mapping + a green secret-safety preflight. No run will happen until you explicitly approve.

🤖 Generated with Claude Code

Davincc77 · 2026-06-02T14:55:36Z

Closing immediately: benchmark work is internal/private and should not live in the public repo. No further public benchmark work will be pushed here.

klickd agent and others added 2 commits June 2, 2026 13:18

Davincc77 closed this Jun 2, 2026

Davincc77 deleted the bench/continuity-coding-200 branch June 2, 2026 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): continuity-hell v1 coding-200 pilot harness#126

feat(bench): continuity-hell v1 coding-200 pilot harness#126
Davincc77 wants to merge 2 commits into
mainfrom
bench/continuity-coding-200

Davincc77 commented Jun 2, 2026

Uh oh!

Davincc77 commented Jun 2, 2026

Uh oh!

Davincc77 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Davincc77 commented Jun 2, 2026

Summary

What's added (benchmarks/continuity-hell-v1/coding-200/)

Anti-mirage guarantees

Test plan

Uh oh!

Davincc77 commented Jun 2, 2026

Secret-safety guardrails added (commit c0e13aa)

What changed

Validations

Real-run status

Uh oh!

Davincc77 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

What's added (`benchmarks/continuity-hell-v1/coding-200/`)

Secret-safety guardrails added (commit `c0e13aa`)