Skip to content

feat(bench): continuity-hell v1 coding-200 pilot harness#126

Closed
Davincc77 wants to merge 2 commits into
mainfrom
bench/continuity-coding-200
Closed

feat(bench): continuity-hell v1 coding-200 pilot harness#126
Davincc77 wants to merge 2 commits into
mainfrom
bench/continuity-coding-200

Conversation

@Davincc77
Copy link
Copy Markdown
Owner

Summary

Phase 1 of the continuity benchmark programme: a reproducible,
scientifically-defensible pilot
stress-test harness for one skill lane,
x.klickd/coding (real artifact packages/@klickd/core/starter-skills/coding.klickd).

This is a real 200-task stress test on a single skill to find weaknesses
in its carried continuity/governance behaviour. It is not an A/B/C/D study
(ABCD comes later, after corrections), not a public release, and makes
no scientific-proof or market claim. Public release stays v4.1.

What's added (benchmarks/continuity-hell-v1/coding-200/)

  • tasks.json — exactly 200 unique tasks (CH1-COD-001..200), each
    with ≥3 vectors across nine dimensions (continuity, constraint respect,
    source discipline, governance/human-veto, security/no-leakage, skill
    activation, handoff, actionability, no hallucinated facts). No easy tasks.
    Governance vectors are read from the real coding.klickd via the SDK, so
    the dataset can't drift from the artifact under test.
  • generate_tasks.py — deterministic, byte-stable generator (--check).
  • run_benchmark.py — deterministic dry-run lanes (floor/ceiling,
    always labelled is_real_llm: false) + a gated real-LLM lane that refuses
    without explicit approval and ships the provider call unwired on purpose.
  • score_outputs.py — deterministic, LLM-free scorer.
  • BENCHMARK_PROTOCOL.md / scoring_rubric.md — frozen protocol
    (hypotheses, conditions, model/temp, thresholds, anti-mirage gate) and rubric.
  • README.md, reproducibility.md, failure_analysis.md template,
    dry-run results, and a llm_x_klickd.BLOCKED.md marker.
  • tests/test_continuity_coding200.py — 21 validations.

Anti-mirage guarantees

  • No deterministic path ever emits is_real_llm: true; dry-run output carries
    a not_real_label.
  • Real provider call is unwired (NotImplementedError) behind --execute +
    XKLICKD_BENCHMARK_FULL_APPROVED=1 + a provider key.
  • Real 200-task LLM execution is BLOCKED — not run. See the BLOCKED marker
    for the exact blocker and required input.

Test plan

  • pytest tests/test_continuity_coding200.py -q — 21 passed
  • pytest tests/ -q — 309 passed (no regressions; +21 new)
  • pytest tests/test_dev_preview.py -q — 5 passed
  • supply-chain tests — 102 passed
  • generate_tasks.py --check — dataset byte-stable
  • dry-run lanes diverge: baseline 0/200, x.klickd 200/200
  • real-LLM lane refuses with exact blocker (no provider called)

🤖 Generated by Computer

klickd agent and others added 2 commits June 2, 2026 13:18
Add a reproducible, scientifically-defensible pilot stress-test harness for a
single skill lane (x.klickd/coding). Phase 1 finds weaknesses in carried
continuity/governance behaviour over 200 adversarial multi-vector tasks; it is
not ABCD and makes no public/market or scientific-proof claim.

- tasks.json: exactly 200 unique tasks, >=3 vectors each across 9 dimensions,
  no easy tasks; governance vectors read from the real coding.klickd via SDK.
- generate_tasks.py: deterministic, byte-stable dataset generator (--check).
- run_benchmark.py: deterministic dry-run lanes (floor/ceiling, labelled
  not-real) + a gated real-LLM lane that refuses without explicit approval and
  ships the provider call unwired (anti-mirage).
- score_outputs.py: deterministic, LLM-free scorer.
- BENCHMARK_PROTOCOL.md / scoring_rubric.md: frozen protocol and rubric.
- README, reproducibility, failure_analysis template, dry-run results,
  real-lane BLOCKED marker.
- tests/test_continuity_coding200.py: 21 validations.

Real 200-task LLM execution is BLOCKED pending explicit human approval of
provider spend and a wired output->contract mapping.
Add provider-key leakage guardrails before any future real LLM run:

- secret_guard.py: single source of truth for detecting provider-key
  shapes, auth headers, high-entropy tokens, and any live provider env
  var value; redacts to [REDACTED:<kind>] and asserts payloads clean.
- run_benchmark.py: redact + assert_clean every envelope before writing;
  new value-blind `preflight` mode (key present by name only + results/
  clean); preflight wired into the real-LLM gate.
- scripts/check_benchmark_secret_leakage.py: CI-friendly artifact scanner,
  exits non-zero on any finding, prints only redacted previews.
- Docs (README, BENCHMARK_PROTOCOL sections 3/7/10, reproducibility, BLOCKED
  marker): keys live only in private env/secret manager, never committed,
  logged, or in artifacts; results record provider/model/run_id only.
- tests/test_benchmark_secret_guard.py: 18 tests covering detection,
  redaction, preflight value-blindness, and that no live env var value is
  ever serialized to an artifact or stdout.

Real 200-task LLM lane remains BLOCKED pending explicit human go.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Davincc77
Copy link
Copy Markdown
Owner Author

Secret-safety guardrails added (commit c0e13aa)

Mandatory provider-key leakage guardrails are now in place before any future real LLM run. The real 200-task LLM lane remains BLOCKED — no provider was called and no real run was performed.

What changed

  • secret_guard.py — single source of truth for detecting provider-key shapes (Anthropic/OpenAI/Google/Groq/AWS), Authorization/Bearer headers, high-entropy tokens, and any live provider env var value. Redacts to [REDACTED:<kind>] and assert_cleans payloads. Findings never echo the raw secret.
  • run_benchmark.py — every output envelope is redact()-ed then assert_clean()-ed before any write, so no env var value, token, or header can be serialized to an artifact. New value-blind preflight mode verifies a provider key exists (reports env var name only, never the value) and that results/ is secret-clean; preflight is wired into the real-LLM gate (§7 item 5).
  • scripts/check_benchmark_secret_leakage.py — CI-friendly artifact scanner; exits non-zero on any finding, prints only redacted previews. Reuses secret_guard so detection is defined in one place.
  • Docs — README, BENCHMARK_PROTOCOL.md (§3 provenance, §7 gate item 5, new §10), reproducibility.md, and the BLOCKED marker now state: API keys live only in the private environment / secret manager, never committed, logged, or in artifacts; results record provider/model/run_id only, never headers/tokens.
  • tests/test_benchmark_secret_guard.py — 18 tests: fake-key detection, redaction, preflight value-blindness, and that a live env var value is never serialized to an artifact or stdout.

Validations

  • pytest tests/test_benchmark_secret_guard.py -q18 passed
  • pytest tests/test_continuity_coding200.py -q21 passed (no regressions)
  • pytest tests/test_dev_preview.py -q5 passed
  • supply-chain tests (-k supply_chain) — 102 passed
  • run_benchmark.py preflight green; check_benchmark_secret_leakage.py reports clean; raw grep for sk-ant-/sk-/AIza patterns in benchmark+scripts source: no matches; the live env key value appears in no tracked/added file.

Real-run status

Still BLOCKED. Remaining requirement before a real 200-task LLM run is unchanged plus the new secret gate: explicit human go-ahead + --execute + XKLICKD_BENCHMARK_FULL_APPROVED=1 + a provider key + a reviewed _call_provider output→contract mapping + a green secret-safety preflight. No run will happen until you explicitly approve.

🤖 Generated with Claude Code

@Davincc77
Copy link
Copy Markdown
Owner Author

Closing immediately: benchmark work is internal/private and should not live in the public repo. No further public benchmark work will be pushed here.

@Davincc77 Davincc77 closed this Jun 2, 2026
@Davincc77 Davincc77 deleted the bench/continuity-coding-200 branch June 2, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants