feat(bench): continuity-hell v1 coding-200 pilot harness#126
Closed
Davincc77 wants to merge 2 commits into
Closed
Conversation
Add a reproducible, scientifically-defensible pilot stress-test harness for a single skill lane (x.klickd/coding). Phase 1 finds weaknesses in carried continuity/governance behaviour over 200 adversarial multi-vector tasks; it is not ABCD and makes no public/market or scientific-proof claim. - tasks.json: exactly 200 unique tasks, >=3 vectors each across 9 dimensions, no easy tasks; governance vectors read from the real coding.klickd via SDK. - generate_tasks.py: deterministic, byte-stable dataset generator (--check). - run_benchmark.py: deterministic dry-run lanes (floor/ceiling, labelled not-real) + a gated real-LLM lane that refuses without explicit approval and ships the provider call unwired (anti-mirage). - score_outputs.py: deterministic, LLM-free scorer. - BENCHMARK_PROTOCOL.md / scoring_rubric.md: frozen protocol and rubric. - README, reproducibility, failure_analysis template, dry-run results, real-lane BLOCKED marker. - tests/test_continuity_coding200.py: 21 validations. Real 200-task LLM execution is BLOCKED pending explicit human approval of provider spend and a wired output->contract mapping.
Add provider-key leakage guardrails before any future real LLM run: - secret_guard.py: single source of truth for detecting provider-key shapes, auth headers, high-entropy tokens, and any live provider env var value; redacts to [REDACTED:<kind>] and asserts payloads clean. - run_benchmark.py: redact + assert_clean every envelope before writing; new value-blind `preflight` mode (key present by name only + results/ clean); preflight wired into the real-LLM gate. - scripts/check_benchmark_secret_leakage.py: CI-friendly artifact scanner, exits non-zero on any finding, prints only redacted previews. - Docs (README, BENCHMARK_PROTOCOL sections 3/7/10, reproducibility, BLOCKED marker): keys live only in private env/secret manager, never committed, logged, or in artifacts; results record provider/model/run_id only. - tests/test_benchmark_secret_guard.py: 18 tests covering detection, redaction, preflight value-blindness, and that no live env var value is ever serialized to an artifact or stdout. Real 200-task LLM lane remains BLOCKED pending explicit human go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Owner
Author
Secret-safety guardrails added (commit c0e13aa)Mandatory provider-key leakage guardrails are now in place before any future real LLM run. The real 200-task LLM lane remains BLOCKED — no provider was called and no real run was performed. What changed
Validations
Real-run statusStill BLOCKED. Remaining requirement before a real 200-task LLM run is unchanged plus the new secret gate: explicit human go-ahead + 🤖 Generated with Claude Code |
Owner
Author
|
Closing immediately: benchmark work is internal/private and should not live in the public repo. No further public benchmark work will be pushed here. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 of the continuity benchmark programme: a reproducible,
scientifically-defensible pilot stress-test harness for one skill lane,
x.klickd/coding(real artifactpackages/@klickd/core/starter-skills/coding.klickd).This is a real 200-task stress test on a single skill to find weaknesses
in its carried continuity/governance behaviour. It is not an A/B/C/D study
(ABCD comes later, after corrections), not a public release, and makes
no scientific-proof or market claim. Public release stays v4.1.
What's added (
benchmarks/continuity-hell-v1/coding-200/)tasks.json— exactly 200 unique tasks (CH1-COD-001..200), eachwith ≥3 vectors across nine dimensions (continuity, constraint respect,
source discipline, governance/human-veto, security/no-leakage, skill
activation, handoff, actionability, no hallucinated facts). No easy tasks.
Governance vectors are read from the real
coding.klickdvia the SDK, sothe dataset can't drift from the artifact under test.
generate_tasks.py— deterministic, byte-stable generator (--check).run_benchmark.py— deterministic dry-run lanes (floor/ceiling,always labelled
is_real_llm: false) + a gated real-LLM lane that refuseswithout explicit approval and ships the provider call unwired on purpose.
score_outputs.py— deterministic, LLM-free scorer.BENCHMARK_PROTOCOL.md/scoring_rubric.md— frozen protocol(hypotheses, conditions, model/temp, thresholds, anti-mirage gate) and rubric.
README.md,reproducibility.md,failure_analysis.mdtemplate,dry-run results, and a
llm_x_klickd.BLOCKED.mdmarker.tests/test_continuity_coding200.py— 21 validations.Anti-mirage guarantees
is_real_llm: true; dry-run output carriesa
not_real_label.NotImplementedError) behind--execute+XKLICKD_BENCHMARK_FULL_APPROVED=1+ a provider key.for the exact blocker and required input.
Test plan
pytest tests/test_continuity_coding200.py -q— 21 passedpytest tests/ -q— 309 passed (no regressions; +21 new)pytest tests/test_dev_preview.py -q— 5 passedgenerate_tasks.py --check— dataset byte-stable🤖 Generated by Computer