diff --git a/CHANGELOG.md b/CHANGELOG.md index 72b866d..b980540 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,7 @@ All notable changes for the ContextGuard plugin are documented here. ## [Unreleased] +- Added `context-guard-bench --evidence-jsonl` replay and `--dashboard-md` rendering so synthetic/local benchmark evidence can regenerate CSV/report/dashboard artifacts while remaining non-public-claim-eligible unless provider-export provenance is complete. - Extended Batch 1 token-savings advisory reports with cache-score amortization risk fields, tool-prune deferred-schema proxy accounting, and a benchmark measurement-baseline contract while preserving local-only/no-savings-claim boundaries. - Clarified cache-score amortization output for cache-read multipliers above uncached cost by reporting a bounded `max_profitable_reuses` instead of a monotonic break-even reuse count. diff --git a/README.ko.md b/README.ko.md index 542bc8b..b46ef36 100644 --- a/README.ko.md +++ b/README.ko.md @@ -375,7 +375,7 @@ JSON 출력에는 여러 증거 surface가 포함될 수 있습니다. - 비용 필드가 0이거나 없으면 토큰 절감만 표시하고 실제 비용 절감은 주장하지 않습니다. - CSV 스키마는 엄격하게 검사합니다. 벤치마크 헬퍼를 업그레이드한 뒤에는 새 `--csv` 파일을 시작하거나 mismatch 오류가 알려주는 헤더로 마이그레이션하세요. -최소 보고서 형태 예시는 [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json)을, 작업 유형별 합성 예시와 안전한 해석 경계는 [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md)을, fixture-only 실험 시작 예시는 [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md)을 참고하세요. +최소 보고서 형태 예시는 [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json)을, 작업 유형별 합성 예시와 안전한 해석 경계는 [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md)을, fixture-only 실험 시작 예시는 [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md)을 참고하세요. live provider 실행 전 deterministic local replay가 필요하면 `--evidence-jsonl docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl --dashboard-md ... --baseline-variant baseline_full_context_fixture`를 사용하세요. Replay mode는 provider와 `success_command`를 실행하지 않고 CSV/report/dashboard를 만들지만 synthetic/manual evidence는 public hosted-savings claim 불가로 표시합니다. ### 실험 기능 opt-in 관리 diff --git a/README.md b/README.md index dd467e1..89c70f6 100644 --- a/README.md +++ b/README.md @@ -406,9 +406,12 @@ These fields can flag likely volatile content near the prompt prefix, stable-pre ```bash ./plugins/context-guard/bin/context-guard-bench \ --tasks bench/tasks.json --variants bench/variants.json --csv bench/results.csv \ - --ledger-jsonl bench/cost-shift.jsonl --report-json bench/report.json + --ledger-jsonl bench/cost-shift.jsonl --report-json bench/report.json \ + --dashboard-md bench/dashboard.md ``` +For deterministic local replay before a live provider run, add `--evidence-jsonl docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl` and, for the 12-task fixture, `--baseline-variant baseline_full_context_fixture`. Replay mode skips provider and `success_command` execution, writes the same CSV/report/dashboard surfaces, and marks synthetic/manual evidence as non-public-claim-eligible. + Read the report through its claim boundaries before writing any savings statement: - Successful baseline/variant runs are compared by real tokens and `cost_usd + external_cost_usd`; byte reductions stay proxy evidence. @@ -419,7 +422,7 @@ Read the report through its claim boundaries before writing any savings statemen - If cost fields are zero or unavailable, the report can still mark token savings but will not claim shifted-cost savings. - CSV schemas are strict; after upgrading the benchmark helper, start a new `--csv` file or migrate the header named in the mismatch error. -See [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json) for a minimal report-shape example, [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md) for workflow-specific synthetic examples, and [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md) for fixture-only experimental task/variant starters. +See [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json) for a minimal report-shape example, [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md) for workflow-specific synthetic examples, and [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md) for fixture-only experimental task/variant starters plus synthetic evidence replay. ### Manage experimental opt-ins diff --git a/context-guard-kit/README.md b/context-guard-kit/README.md index bdb71c9..79675f0 100644 --- a/context-guard-kit/README.md +++ b/context-guard-kit/README.md @@ -79,7 +79,7 @@ python3 context-guard-kit/sanitize_output.py -- git diff `experimental_registry.py plan local-proxy`는 localhost-only dry-run 안내 plan입니다. `experimental_registry.py plan local-proxy-external-forwarding`은 future external forwarding을 위한 design-only dry-run gate이며 explicit intent, HTTPS allowlist, threat model note, credential redaction policy, provider-evidence boundary를 요구하고 DNS lookup, external service call, traffic forwarding은 하지 않습니다. `experimental_registry.py record local-proxy-runtime-gate --ledger-jsonl ...`은 listener 시작, traffic forwarding, DNS lookup 없이 local gate row 하나만 기록하는 명시적 runtime입니다. `experimental_registry.py serve local-proxy`는 명시적 one-shot loopback forwarding MVP이며 `--runtime-gate-ack --forwarding-gate-ack --once`, private `--ready-file` nonce handoff, literal loopback bind/target IP, hostname DNS target 금지, nonzero port, byte/time limit, credential-free request가 필요합니다. API key를 저장하지 않고, external forwarding이나 CONNECT/TLS proxying을 지원하지 않으며, hosted savings claim도 만들지 않습니다. 선택적 `--diagnostic-ledger-jsonl`은 successful forwarded request 뒤에 raw header/body나 hosted-savings evidence 없이 shifted-cost diagnostic row 하나만 추가합니다. External proxy forwarding runtime은 shipped가 아니며, 나머지 roadmap lane은 별도 runtime gate가 생기기 전까지 안내 상태로 남습니다. -`benchmark_runner.py`는 `research/benchmark-plan.md`의 고정 task/variant 실험을 실행합니다. `variant_prompt_files`는 선택된 task/variant를 필터링한 뒤 필요한 file-backed prompt만 읽으므로 선택하지 않은 fixture의 누락 파일이 선택된 실행을 깨지 않습니다. `--ledger-jsonl`은 subagent·artifact 등 외부 실행 표면으로 옮겨간 token/cost와 run별 측정 가능 여부를 남기고, 선택적 `self_hosted_metrics` provider payload는 run별 sidecar로만 기록합니다. `--report-json`은 baseline 대비 실제 token/cost 절감과 proxy byte 감소를 분리한 A/B report를 생성하며, `self_hosted_metrics`는 CSV/report 요약에 접지 않습니다. Report의 `matched_pair_evidence`는 성공한 baseline/variant task bucket을 transform, quality gate, 측정 가능 여부, claim boundary와 연결하므로 절감 주장을 쓰기 전에 이 항목을 확인하세요. +`benchmark_runner.py`는 `research/benchmark-plan.md`의 고정 task/variant 실험을 실행합니다. `variant_prompt_files`는 선택된 task/variant를 필터링한 뒤 필요한 file-backed prompt만 읽으므로 선택하지 않은 fixture의 누락 파일이 선택된 실행을 깨지 않습니다. `--ledger-jsonl`은 subagent·artifact 등 외부 실행 표면으로 옮겨간 token/cost와 run별 측정 가능 여부를 남기고, 선택적 `self_hosted_metrics` provider payload는 run별 sidecar로만 기록합니다. `--report-json`은 baseline 대비 실제 token/cost 절감과 proxy byte 감소를 분리한 A/B report를 생성하며, `--dashboard-md`는 같은 report에서 Markdown dashboard를 렌더링합니다. `--evidence-jsonl` replay는 provider와 `success_command`를 실행하지 않는 deterministic import mode이고, synthetic/manual evidence는 public hosted-savings claim 불가로 강제됩니다. `self_hosted_metrics`는 CSV/report 요약에 접지 않습니다. Report의 `matched_pair_evidence`는 성공한 baseline/variant task bucket을 transform, quality gate, 측정 가능 여부, claim boundary와 연결하므로 절감 주장을 쓰기 전에 이 항목을 확인하세요. `../research/experimental-token-reduction-radar.md`는 learned compression, generated crop/OCR/visual-token pruning, self-hosted KV/latent inference optimization 같은 선택적 미래 실험을 문서화한 gate입니다. `../docs/experimental-benchmark-fixtures.md`에는 fixture-only task/variant 시작 예시가 있습니다. 이 radar와 fixture는 hosted API token/cost 절감을 보장하지 않습니다. 현재 제공되는 helper surface는 명시적 local context-diff emit, visual evidence-pack emit, learned candidate emit, self-hosted metrics record, local proxy gate record, one-shot literal-loopback local proxy serve, design-only external-forwarding plan 같은 좁은 local surface뿐이며, hosted API token/cost 절감 주장은 provider가 측정한 matched-task 근거가 있을 때만 허용합니다. Radar의 later-roadmap gate는 neural/semantic compression, trust-tiered injection-aware compression, generated visual-token reduction, broader external/daemon/hostname-DNS/credential-bearing local proxy forwarding constraints를 별도 미래 PR이 gate를 통과하기 전까지 experimental/non-shipped로 유지합니다. diff --git a/context-guard-kit/benchmark_runner.py b/context-guard-kit/benchmark_runner.py index e338b88..a1af3ab 100755 --- a/context-guard-kit/benchmark_runner.py +++ b/context-guard-kit/benchmark_runner.py @@ -178,6 +178,8 @@ ) MAX_USAGE_TOKEN_COUNT = 10**12 MAX_USAGE_COST_USD = 10**9 +MAX_EVIDENCE_JSONL_BYTES = 5_000_000 +MAX_EVIDENCE_JSONL_LINES = 100_000 # Byte -> token proxy 환산 계수. 측정된 모델 토큰이 아니라 byte delta 기반 보수적 # 추정치이며, report에서 evidence="inferred"로 분명히 라벨링한다. 영어 텍스트 기준 # ~4 bytes/token의 통용 근사값을 사용한다. @@ -188,6 +190,25 @@ SELF_HOSTED_METRICS_SCHEMA_VERSION = "contextguard.bench.self-hosted-metrics.v1" SELF_HOSTED_METRICS_KEY = "self_hosted_metrics" SELF_HOSTED_METRICS_CLAIM_BOUNDARY = "self_hosted_metrics_only_not_hosted_api_token_or_cost_savings" +EVIDENCE_REPLAY_SOURCE_TYPES = frozenset({"synthetic_fixture", "provider_export", "manual_audit"}) +PROVIDER_EXPORT_PUBLIC_CLAIM_SCOPES = frozenset({ + "provider_measured_matched_task", + "provider_measured_matched_task_public_claim", + "hosted_api_provider_measured_matched_task", +}) +REPLAY_PUBLIC_CLAIM_CANDIDATE_STATUS = "provider_export_public_claim_candidate" +REPLAY_PROVIDER_CLAIM_GATES_NOT_MET_STATUS = "provider_export_claim_gates_not_met" +REPLAY_NOT_PUBLIC_CLAIM_STATUS = "replay_only_not_public_claim" +REPLAY_UNKNOWN_MIXED_CSV_STATUS = "unknown_mixed_csv" +REPLAY_PUBLIC_CLAIM_ELIGIBLE_RAW_STATUSES = frozenset({ + "token_and_shifted_cost_savings_observed", +}) +REPLAY_CLAIM_BOUNDARY = ( + "Evidence replay is an import/replay mode. Synthetic fixtures and manual audits are never " + "hosted API token/cost savings evidence; public claims require complete provider_export " + "provenance for every report row plus the normal matched-task quality, token, cost, and " + "shifted-cost gates." +) MAX_SELF_HOSTED_LABEL_CHARS = 120 MAX_SELF_HOSTED_LATENCY_MS = 7 * 24 * 60 * 60 * 1000 MAX_SELF_HOSTED_MEMORY_MB = 10_000_000 @@ -401,6 +422,36 @@ class RunResult: self_hosted_metrics: dict[str, Any] | None = None +@dataclass +class EvidenceReplayRow: + result: RunResult + source_type: str + provider_name: str | None + capture_command_or_export_id: str | None + claim_scope: str + provider_export_provenance_complete: bool + public_claim_eligible: bool + line_number: int + + @property + def key(self) -> tuple[str, str]: + return (self.result.task_id, self.result.variant) + + def provenance_payload(self) -> dict[str, Any]: + return { + "schema_version": BENCH_RUN_EVIDENCE_SCHEMA_VERSION, + "mode": "evidence_jsonl_replay", + "evidence_source_type": self.source_type, + "provider_name": self.provider_name, + "capture_command_or_export_id": self.capture_command_or_export_id, + "claim_scope": self.claim_scope, + "provider_export_provenance_complete": self.provider_export_provenance_complete, + "public_claim_eligible": self.public_claim_eligible, + "line_number": self.line_number, + "claim_boundary": REPLAY_CLAIM_BOUNDARY, + } + + @dataclass class BoundedProcessResult: returncode: int @@ -1362,7 +1413,13 @@ def write_text_no_follow(path: Path, text: str) -> None: os.close(fd) -def append_cost_shift_ledger(path: Path, claude_ver: str, result: RunResult) -> None: +def append_cost_shift_ledger( + path: Path, + claude_ver: str, + result: RunResult, + *, + replay_provenance: dict[str, Any] | None = None, +) -> None: shifted_cost_known = cost_shift_measured(result) byte_metrics_observed = bool(result.bytes_before or result.bytes_after) payload = { @@ -1413,6 +1470,10 @@ def append_cost_shift_ledger(path: Path, claude_ver: str, result: RunResult) -> } if result.self_hosted_metrics is not None: payload["self_hosted_metrics"] = result.self_hosted_metrics + if replay_provenance is not None: + payload["replay_provenance"] = replay_provenance + payload["evidence_source_type"] = replay_provenance.get("evidence_source_type") + payload["public_claim_eligible"] = bool(replay_provenance.get("public_claim_eligible")) with csv_file_lock(path, create_parent=True): fd = _open_regular_no_symlink(path, os.O_CREAT | os.O_APPEND | os.O_WRONLY, 0o600, create_parent=True) try: @@ -1488,6 +1549,354 @@ def read_csv_rows(csv_path: Path) -> list[dict[str, str]]: os.close(fd) +def file_has_content_no_follow(path: Path) -> bool: + try: + fd = _open_regular_no_symlink(path) + except FileNotFoundError: + return False + try: + return os.fstat(fd).st_size > 0 + finally: + os.close(fd) + + +def require_evidence_object(raw: Any, *, owner: str) -> dict[str, Any]: + if not isinstance(raw, dict): + raise SystemExit(f"{owner} evidence row must be a JSON object") + return raw + + +def evidence_non_empty_string(raw: Any, *, field: str, owner: str, required: bool = True) -> str | None: + if raw is None: + if required: + raise SystemExit(f"{owner} {field} must be a non-empty string") + return None + if not isinstance(raw, str): + raise SystemExit(f"{owner} {field} must be a string") + text = sanitize_note_text(raw) + if not text: + if required: + raise SystemExit(f"{owner} {field} must be a non-empty string") + return None + return text + + +def evidence_bool(raw: Any, *, field: str, owner: str, default: bool = False) -> bool: + if raw is None: + return default + if not isinstance(raw, bool): + raise SystemExit(f"{owner} {field} must be a boolean") + return raw + + +def evidence_nonnegative_int( + raw: Any, + *, + field: str, + owner: str, + default: int = 0, + maximum: int = MAX_USAGE_TOKEN_COUNT, +) -> int: + if raw is None: + return default + value = normalize_usage_token(raw) + if value is None or value > maximum: + raise SystemExit(f"{owner} {field} must be a finite non-negative integer") + return value + + +def evidence_nonnegative_float( + raw: Any, + *, + field: str, + owner: str, + default: float = 0.0, + maximum: float = MAX_USAGE_COST_USD, +) -> float: + if raw is None: + return default + if isinstance(raw, bool) or not isinstance(raw, (int, float)): + raise SystemExit(f"{owner} {field} must be a finite non-negative number") + value = float(raw) + if not math.isfinite(value) or value < 0 or value > maximum: + raise SystemExit(f"{owner} {field} must be a finite non-negative number") + return value + + +def evidence_first(raw: dict[str, Any], *keys: str) -> Any: + for key in keys: + if key in raw: + return raw[key] + return None + + +def parse_evidence_provenance(raw: dict[str, Any], *, owner: str) -> dict[str, Any]: + provenance = raw.get("provenance") + if provenance is not None and not isinstance(provenance, dict): + raise SystemExit(f"{owner} provenance must be a JSON object") + source_raw = ( + provenance.get("evidence_source_type") + if isinstance(provenance, dict) and "evidence_source_type" in provenance + else raw.get("evidence_source_type") + ) + source_type = evidence_non_empty_string(source_raw, field="evidence_source_type", owner=owner) + assert source_type is not None + if source_type not in EVIDENCE_REPLAY_SOURCE_TYPES: + raise SystemExit( + f"{owner} evidence_source_type must be one of: {', '.join(sorted(EVIDENCE_REPLAY_SOURCE_TYPES))}" + ) + provider_name = evidence_non_empty_string( + provenance.get("provider_name") if isinstance(provenance, dict) else raw.get("provider_name"), + field="provider_name", + owner=owner, + required=False, + ) + capture_id = evidence_non_empty_string( + ( + provenance.get("capture_command_or_export_id") + if isinstance(provenance, dict) and "capture_command_or_export_id" in provenance + else raw.get("capture_command_or_export_id") + ), + field="capture_command_or_export_id", + owner=owner, + required=False, + ) + claim_scope = evidence_non_empty_string( + provenance.get("claim_scope") if isinstance(provenance, dict) else raw.get("claim_scope"), + field="claim_scope", + owner=owner, + ) + assert claim_scope is not None + provider_authority = ( + source_type == "provider_export" + and provider_name is not None + and capture_id is not None + and claim_scope in PROVIDER_EXPORT_PUBLIC_CLAIM_SCOPES + ) + return { + "source_type": source_type, + "provider_name": provider_name, + "capture_command_or_export_id": capture_id, + "claim_scope": claim_scope, + "provider_public_claim_authority": provider_authority, + } + + +def parse_evidence_tokens(raw: dict[str, Any], *, owner: str) -> tuple[dict[str, int], set[str]]: + token_block = raw.get("tokens") + if token_block is not None and not isinstance(token_block, dict): + raise SystemExit(f"{owner} tokens must be a JSON object") + tokens: dict[str, int] = {} + observed: set[str] = set() + source = token_block if isinstance(token_block, dict) else {} + for bucket, _keys in USAGE_KEY_GROUPS: + value = source.get(bucket) if bucket in source else raw.get(bucket) + if value is not None: + observed.add(bucket) + tokens[bucket] = evidence_nonnegative_int(value, field=bucket, owner=owner) + return tokens, observed + + +def parse_evidence_row(raw_value: Any, *, owner: str, line_number: int) -> EvidenceReplayRow: + raw = require_evidence_object(raw_value, owner=owner) + schema = evidence_non_empty_string(raw.get("schema_version"), field="schema_version", owner=owner) + if schema != BENCH_RUN_EVIDENCE_SCHEMA_VERSION: + raise SystemExit( + f"{owner} schema_version must be {BENCH_RUN_EVIDENCE_SCHEMA_VERSION}" + ) + task_id = evidence_non_empty_string(raw.get("task_id"), field="task_id", owner=owner) + variant = evidence_non_empty_string(raw.get("variant"), field="variant", owner=owner) + assert task_id is not None and variant is not None + provenance = parse_evidence_provenance(raw, owner=owner) + provider_authority = bool(provenance["provider_public_claim_authority"]) + raw_primary_tokens_measured = evidence_bool( + raw.get("primary_tokens_measured"), + field="primary_tokens_measured", + owner=owner, + ) + raw_cost_measured = evidence_bool( + evidence_first(raw, "cost_measured", "primary_cost_measured"), + field="cost_measured", + owner=owner, + ) + if provenance["source_type"] in {"synthetic_fixture", "manual_audit"}: + primary_tokens_measured = False + cost_measured = False + elif provider_authority: + primary_tokens_measured = raw_primary_tokens_measured + cost_measured = raw_cost_measured + else: + if raw_primary_tokens_measured or raw_cost_measured: + raise SystemExit( + f"{owner} provider_export measured flags require provider_name, " + "capture_command_or_export_id, and a provider-measured matched-task claim_scope" + ) + primary_tokens_measured = False + cost_measured = False + + tokens, observed_token_buckets = parse_evidence_tokens(raw, owner=owner) + if primary_tokens_measured and not {"input_tokens", "output_tokens"}.issubset(observed_token_buckets): + raise SystemExit( + f"{owner} primary_tokens_measured=true requires input_tokens and output_tokens evidence" + ) + cost_usd = evidence_nonnegative_float( + evidence_first(raw, "cost_usd", "primary_cost_usd"), + field="cost_usd", + owner=owner, + ) + if cost_measured and "cost_usd" not in raw and "primary_cost_usd" not in raw: + raise SystemExit(f"{owner} cost_measured=true requires cost_usd evidence") + + if "success" not in raw: + raise SystemExit(f"{owner} success must be a boolean") + success = evidence_bool(raw.get("success"), field="success", owner=owner) + notes = evidence_non_empty_string(raw.get("notes"), field="notes", owner=owner, required=False) + model = evidence_non_empty_string(raw.get("model"), field="model", owner=owner, required=False) or "evidence-replay" + effort = evidence_non_empty_string(raw.get("effort"), field="effort", owner=owner, required=False) or "" + self_hosted_metrics = None + if SELF_HOSTED_METRICS_KEY in raw: + self_hosted_metrics = normalize_self_hosted_metrics( + raw.get(SELF_HOSTED_METRICS_KEY), + source="evidence_jsonl.self_hosted_metrics", + ) + if self_hosted_metrics is None: + raise SystemExit(f"{owner} self_hosted_metrics must be normalized explicit metrics") + + result = RunResult( + task_id=task_id, + variant=variant, + model=model, + effort=effort, + tokens=tokens, + cost_usd=cost_usd, + success=success, + notes=notes or f"evidence replay ({provenance['source_type']})", + corrections=evidence_nonnegative_int(raw.get("corrections"), field="corrections", owner=owner), + cost_measured=cost_measured, + wall_time_seconds=evidence_nonnegative_float( + raw.get("wall_time_seconds"), + field="wall_time_seconds", + owner=owner, + maximum=MAX_SELF_HOSTED_LATENCY_MS / 1000, + ), + turns=evidence_nonnegative_int(raw.get("turns"), field="turns", owner=owner), + hook_triggers=evidence_nonnegative_int(raw.get("hook_triggers"), field="hook_triggers", owner=owner), + bytes_before=evidence_nonnegative_int(raw.get("bytes_before"), field="bytes_before", owner=owner), + bytes_after=evidence_nonnegative_int(raw.get("bytes_after"), field="bytes_after", owner=owner), + artifacts_used=evidence_nonnegative_int(raw.get("artifacts_used"), field="artifacts_used", owner=owner), + external_tokens=evidence_nonnegative_int(raw.get("external_tokens"), field="external_tokens", owner=owner), + external_tokens_measured=evidence_bool( + raw.get("external_tokens_measured"), + field="external_tokens_measured", + owner=owner, + ), + external_cost_usd=evidence_nonnegative_float( + raw.get("external_cost_usd"), + field="external_cost_usd", + owner=owner, + ), + external_cost_measured=evidence_bool( + raw.get("external_cost_measured"), + field="external_cost_measured", + owner=owner, + ), + provider_cached_tokens=evidence_nonnegative_int( + raw.get("provider_cached_tokens"), + field="provider_cached_tokens", + owner=owner, + ), + provider_cached_tokens_measured=evidence_bool( + raw.get("provider_cached_tokens_measured"), + field="provider_cached_tokens_measured", + owner=owner, + ), + primary_tokens_measured=primary_tokens_measured, + self_hosted_metrics=self_hosted_metrics, + ) + return EvidenceReplayRow( + result=result, + source_type=str(provenance["source_type"]), + provider_name=provenance["provider_name"], + capture_command_or_export_id=provenance["capture_command_or_export_id"], + claim_scope=str(provenance["claim_scope"]), + provider_export_provenance_complete=provider_authority, + public_claim_eligible=False, + line_number=line_number, + ) + + +def read_evidence_jsonl(path: Path) -> list[EvidenceReplayRow]: + fd = _open_regular_no_symlink(path) + try: + size = os.fstat(fd).st_size + if size > MAX_EVIDENCE_JSONL_BYTES: + raise SystemExit( + f"evidence JSONL exceeds {MAX_EVIDENCE_JSONL_BYTES} bytes: {path}" + ) + rows: list[EvidenceReplayRow] = [] + with os.fdopen(fd, "r", encoding="utf-8") as handle: + fd = -1 + for line_number, line in enumerate(handle, start=1): + if line_number > MAX_EVIDENCE_JSONL_LINES: + raise SystemExit( + f"evidence JSONL line limit exceeded for {path}: > {MAX_EVIDENCE_JSONL_LINES}" + ) + if not line.strip(): + continue + try: + payload = json.loads(line) + except json.JSONDecodeError as exc: + raise SystemExit( + f"{path}:{line_number} evidence row must be JSON: {exc.msg}" + ) from None + rows.append(parse_evidence_row(payload, owner=f"{path}:{line_number}", line_number=line_number)) + finally: + if fd != -1: + os.close(fd) + if not rows: + raise SystemExit(f"evidence JSONL contains no rows: {path}") + return rows + + +def validate_evidence_coverage( + evidence_rows: list[EvidenceReplayRow], + runnable_targets: list[tuple[TaskFixture, Variant]], +) -> dict[tuple[str, str], EvidenceReplayRow]: + by_key: dict[tuple[str, str], EvidenceReplayRow] = {} + for row in evidence_rows: + if row.key in by_key: + raise SystemExit( + f"duplicate evidence row for {row.key[0]}/{row.key[1]} " + f"(lines {by_key[row.key].line_number} and {row.line_number})" + ) + by_key[row.key] = row + missing = [ + f"{task.id}/{variant.name}" + for task, variant in runnable_targets + if (task.id, variant.name) not in by_key + ] + if missing: + raise SystemExit(f"missing evidence row(s) for selected targets: {', '.join(missing)}") + return { + (task.id, variant.name): by_key[(task.id, variant.name)] + for task, variant in runnable_targets + } + + +def run_evidence_fixture(task: TaskFixture, variant: Variant, evidence: EvidenceReplayRow) -> RunResult: + result = evidence.result + if result.task_id != task.id or result.variant != variant.name: + raise SystemExit( + f"evidence target mismatch: expected {task.id}/{variant.name}, " + f"got {result.task_id}/{result.variant}" + ) + if result.model == "evidence-replay": + result.model = task.model + if not result.effort: + result.effort = task.effort or "" + return result + + def row_int(row: dict[str, str], key: str) -> int: try: return int(float(row.get(key) or 0)) @@ -2277,18 +2686,230 @@ def matched_pair_evidence_entry( ), } +def annotate_replay_report( + report: dict[str, Any], + replay_rows: list[EvidenceReplayRow], + *, + mixed_csv: bool, +) -> dict[str, Any]: + source_types = sorted({row.source_type for row in replay_rows}) + provider_names = sorted({row.provider_name for row in replay_rows if row.provider_name}) + claim_scopes = sorted({row.claim_scope for row in replay_rows}) + same_run_complete = (not mixed_csv) and len(replay_rows) == int(report.get("row_count") or 0) + all_provider_claim_authority = bool(replay_rows) and all( + row.provider_export_provenance_complete for row in replay_rows + ) + raw_claim_status = str(report.get("claim_status") or "") + matched_pair_evidence = report.get("matched_pair_evidence") + matched_claim_gates_allow_public_claim = ( + isinstance(matched_pair_evidence, list) + and bool(matched_pair_evidence) + and all( + isinstance(item, dict) + and isinstance(item.get("claim_boundary"), dict) + and bool(item["claim_boundary"].get("token_savings_claim_allowed")) + and bool(item["claim_boundary"].get("shifted_cost_claim_allowed")) + for item in matched_pair_evidence + ) + ) + report_claim_gates_allow_public_claim = ( + raw_claim_status in REPLAY_PUBLIC_CLAIM_ELIGIBLE_RAW_STATUSES + and matched_claim_gates_allow_public_claim + ) + if not same_run_complete: + public_claim_status = REPLAY_UNKNOWN_MIXED_CSV_STATUS + public_claim_eligible = False + elif all_provider_claim_authority and report_claim_gates_allow_public_claim: + public_claim_status = REPLAY_PUBLIC_CLAIM_CANDIDATE_STATUS + public_claim_eligible = True + elif all_provider_claim_authority: + public_claim_status = REPLAY_PROVIDER_CLAIM_GATES_NOT_MET_STATUS + public_claim_eligible = False + else: + public_claim_status = REPLAY_NOT_PUBLIC_CLAIM_STATUS + public_claim_eligible = False + report["raw_metric_claim_status"] = raw_claim_status + report["public_claim_status"] = public_claim_status + report["public_claim_eligible"] = public_claim_eligible + if not public_claim_eligible: + report["claim_status"] = public_claim_status + report["replay_evidence"] = { + "schema_version": BENCH_RUN_EVIDENCE_SCHEMA_VERSION, + "mode": "evidence_jsonl_replay", + "row_count": len(replay_rows), + "source_types": source_types, + "provider_names": provider_names, + "claim_scopes": claim_scopes, + "same_run_complete": same_run_complete, + "mixed_csv": mixed_csv, + "provider_export_provenance_complete": all_provider_claim_authority, + "report_claim_gates_allow_public_claim": report_claim_gates_allow_public_claim, + "public_claim_status": public_claim_status, + "public_claim_eligible": public_claim_eligible, + "target_keys": [f"{row.result.task_id}/{row.result.variant}" for row in replay_rows], + "claim_boundary": REPLAY_CLAIM_BOUNDARY, + } + return report + + +def report_public_claim_status(report: dict[str, Any]) -> tuple[str, bool | None]: + if "public_claim_status" in report: + return str(report.get("public_claim_status")), bool(report.get("public_claim_eligible")) + return ( + "csv_provenance_unknown_requires_original_evidence_or_trusted_ledger", + None, + ) + + +def markdown_value(value: Any) -> str: + if value is None: + return "n/a" + if isinstance(value, bool): + return "true" if value else "false" + if isinstance(value, float): + return f"{value:.6g}" + text = sanitize_note_text(value) + return text.replace("|", "\\|") or "n/a" + + +def render_dashboard_markdown(report: dict[str, Any]) -> str: + public_claim_status, public_claim_eligible = report_public_claim_status(report) + metric_claim_status = report.get("raw_metric_claim_status", report.get("claim_status")) + lines = [ + "# ContextGuard Benchmark Dashboard", + "", + f"- Schema: `{markdown_value(report.get('schema'))}`", + f"- Baseline variant: `{markdown_value(report.get('baseline_variant'))}`", + f"- Rows: {markdown_value(report.get('row_count'))}", + f"- Metric claim status: `{markdown_value(metric_claim_status)}`", + f"- Public claim status: `{markdown_value(public_claim_status)}`", + f"- Public claim eligible: `{markdown_value(public_claim_eligible)}`", + "", + "> Claim boundary: this dashboard is not a hosted savings claim unless report claim gates " + "allow it and public-claim provenance is complete. Proxy byte reductions are diagnostic " + "and are not hosted API token savings.", + "", + "## Variant summary", + "", + "| Variant | Runs | Successes | Failure rate | Tokens/success | Bytes saved | Token proxy saved | Quality notes |", + "| --- | ---: | ---: | ---: | ---: | ---: | ---: | --- |", + ] + summaries = report.get("summary_by_variant") if isinstance(report.get("summary_by_variant"), dict) else {} + comparison_by_variant = { + item.get("variant"): item + for item in report.get("comparisons", []) + if isinstance(item, dict) + } + for variant, summary in sorted(summaries.items()): + if not isinstance(summary, dict): + continue + comparison = comparison_by_variant.get(variant, {}) + quality = comparison.get("quality_gate") if isinstance(comparison, dict) else None + if quality is None and summary.get("is_baseline_strategy"): + quality = "baseline" + lines.append( + "| " + + " | ".join([ + markdown_value(variant), + markdown_value(summary.get("runs")), + markdown_value(summary.get("successful_runs")), + markdown_value(summary.get("failure_rate")), + markdown_value(summary.get("tokens_per_successful_task")), + markdown_value(summary.get("bytes_saved_successful")), + markdown_value(summary.get("token_proxy_saved_successful")), + markdown_value(quality), + ]) + + " |" + ) + lines.extend([ + "", + "## Comparisons", + "", + "| Variant | Quality gate | Matched tasks | Token paired tasks | Token savings % | Shifted cost savings % |", + "| --- | --- | ---: | ---: | ---: | ---: |", + ]) + comparisons = report.get("comparisons") if isinstance(report.get("comparisons"), list) else [] + if comparisons: + for item in comparisons: + if not isinstance(item, dict): + continue + lines.append( + "| " + + " | ".join([ + markdown_value(item.get("variant")), + markdown_value(item.get("quality_gate")), + markdown_value(item.get("matched_successful_task_count")), + markdown_value(item.get("paired_token_task_count")), + markdown_value(item.get("token_savings_pct")), + markdown_value(item.get("cost_savings_pct_with_shift")), + ]) + + " |" + ) + else: + lines.append("| n/a | n/a | 0 | 0 | n/a | n/a |") + replay = report.get("replay_evidence") if isinstance(report.get("replay_evidence"), dict) else None + if replay is not None: + lines.extend([ + "", + "## Replay evidence provenance", + "", + f"- Source types: `{markdown_value(', '.join(replay.get('source_types') or []))}`", + f"- Claim scopes: `{markdown_value(', '.join(replay.get('claim_scopes') or []))}`", + f"- Same-run complete: `{markdown_value(replay.get('same_run_complete'))}`", + f"- Mixed/pre-existing CSV: `{markdown_value(replay.get('mixed_csv'))}`", + f"- Boundary: {markdown_value(replay.get('claim_boundary'))}", + ]) + else: + lines.extend([ + "", + "## Provenance note", + "", + "- CSV-only dashboards have unknown public-claim provenance unless regenerated from " + "the original evidence JSONL or a future trusted provenance ledger.", + ]) + lines.extend([ + "", + "## Re-run context", + "", + "- Evidence replay: `context-guard-bench --tasks --variants " + "--evidence-jsonl --csv --report-json " + "--dashboard-md `", + ]) + return "\n".join(lines) + "\n" + + +def write_report_outputs( + csv_path: Path, + report_path: Path | None, + dashboard_path: Path | None, + baseline_variant: str, + *, + replay_rows: list[EvidenceReplayRow] | None = None, + mixed_csv: bool = False, +) -> dict[str, Any]: + # Keep lock order stable across all derived writes: source CSV first, then + # report, then dashboard. Do not introduce a derived-output -> CSV path. + with csv_file_lock(csv_path, create_parent=True): + report = summarize_benchmark_rows(read_csv_rows(csv_path), baseline_variant) + if replay_rows is not None: + report = annotate_replay_report(report, replay_rows, mixed_csv=mixed_csv) + if report_path is not None: + with csv_file_lock(report_path, create_parent=True): + write_text_no_follow( + report_path, + json.dumps(report, ensure_ascii=False, indent=2, sort_keys=True) + "\n", + ) + if dashboard_path is not None: + with csv_file_lock(dashboard_path, create_parent=True): + write_text_no_follow(dashboard_path, render_dashboard_markdown(report)) + return report + + def write_report_json(csv_path: Path, report_path: Path, baseline_variant: str) -> dict[str, Any]: # Keep lock order stable across all report writes: source CSV first, derived # report second. Do not introduce a report -> CSV path; that can deadlock # concurrent report generation. - with csv_file_lock(csv_path, create_parent=True): - report = summarize_benchmark_rows(read_csv_rows(csv_path), baseline_variant) - with csv_file_lock(report_path, create_parent=True): - write_text_no_follow( - report_path, - json.dumps(report, ensure_ascii=False, indent=2, sort_keys=True) + "\n", - ) - return report + return write_report_outputs(csv_path, report_path, None, baseline_variant) def sanitize_note_text(value: Any) -> str: @@ -2351,8 +2972,18 @@ def existing_file_identity(path: Path) -> tuple[int, int] | None: os.close(fd) -def validate_distinct_output_paths(csv_path: Path, ledger_path: Path | None, report_path: Path | None) -> None: - outputs = [("csv", csv_path), ("ledger-jsonl", ledger_path), ("report-json", report_path)] +def validate_distinct_output_paths( + csv_path: Path, + ledger_path: Path | None, + report_path: Path | None, + dashboard_path: Path | None = None, +) -> None: + outputs = [ + ("csv", csv_path), + ("ledger-jsonl", ledger_path), + ("report-json", report_path), + ("dashboard-md", dashboard_path), + ] seen: dict[Path, str] = {} seen_identity: dict[tuple[int, int], str] = {} for label, path in outputs: @@ -2391,12 +3022,16 @@ def main() -> int: help="optional JSONL ledger path for cost-shift accounting per run") parser.add_argument("--report-json", default=None, type=Path, help="optional A/B summary report JSON path generated from --csv after real runs") + parser.add_argument("--dashboard-md", default=None, type=Path, + help="optional Markdown dashboard path generated from the benchmark report") + parser.add_argument("--evidence-jsonl", default=None, type=Path, + help="optional validated run-evidence JSONL replay input; skips provider invocation") parser.add_argument("--baseline-variant", default="baseline", help="variant name used as the report baseline (default: baseline)") args = parser.parse_args() require_no_follow_file_ops_supported() - validate_distinct_output_paths(args.csv, args.ledger_jsonl, args.report_json) + validate_distinct_output_paths(args.csv, args.ledger_jsonl, args.report_json, args.dashboard_md) variants = parse_variants(args.variants) tasks = parse_tasks(args.tasks, variants=variants) @@ -2411,6 +3046,61 @@ def main() -> int: for task, variant in targets if (task.id, variant.name) not in skip_keys ] + if args.evidence_jsonl is not None: + if args.dry_run: + for task, variant in targets: + if (task.id, variant.name) in skip_keys: + print(f"skip {task.id}/{variant.name} (already in {args.csv})") + continue + print(f"evidence replay dry-run: {task.id}/{variant.name} <- {args.evidence_jsonl}") + print("completed 0 run(s); results in (dry-run; no CSV writes)") + return 0 + csv_had_preexisting_content = file_has_content_no_follow(args.csv) + evidence_rows = read_evidence_jsonl(args.evidence_jsonl) + evidence_by_key = validate_evidence_coverage(evidence_rows, runnable_targets) + claude_ver = "evidence-replay" + completed = 0 + replay_rows_written: list[EvidenceReplayRow] = [] + for task, variant in targets: + if (task.id, variant.name) in skip_keys: + print(f"skip {task.id}/{variant.name} (already in {args.csv})") + continue + evidence = evidence_by_key[(task.id, variant.name)] + print(f"replay {task.id}/{variant.name} ...", flush=True) + result = run_evidence_fixture(task, variant, evidence) + wrote = append_csv(args.csv, claude_ver, result, skip_existing=args.resume) + if wrote: + replay_rows_written.append(evidence) + if args.ledger_jsonl is not None: + append_cost_shift_ledger( + args.ledger_jsonl, + claude_ver, + result, + replay_provenance=evidence.provenance_payload(), + ) + completed += 1 + status = "ok" if result.success else "FAIL" + suffix = "" if wrote else " (CSV not updated; row already present)" + print( + f" {status} tokens={sum(result.tokens.values())} cost=${result.cost_usd:.4f} " + f"wall_time={result.wall_time_seconds:.3f}s {sanitize_note_text(result.notes)}{suffix}" + ) + if args.report_json is not None or args.dashboard_md is not None: + report = write_report_outputs( + args.csv, + args.report_json, + args.dashboard_md, + args.baseline_variant, + replay_rows=replay_rows_written, + mixed_csv=csv_had_preexisting_content or bool(skip_keys) or len(replay_rows_written) != int(completed), + ) + if args.report_json is not None: + print(f"report {args.report_json}: {report['claim_status']}") + if args.dashboard_md is not None: + print(f"dashboard {args.dashboard_md}: {report_public_claim_status(report)[0]}") + print(f"completed {completed} run(s); results in {args.csv}") + return 0 + placeholder_targets = [ f"{task.id}/{variant.name}" for task, variant in runnable_targets @@ -2463,9 +3153,12 @@ def main() -> int: f"wall_time={result.wall_time_seconds:.3f}s {sanitize_note_text(result.notes)}{suffix}" ) target = args.csv if not args.dry_run else "(dry-run; no CSV writes)" - if args.report_json is not None and not args.dry_run: - report = write_report_json(args.csv, args.report_json, args.baseline_variant) - print(f"report {args.report_json}: {report['claim_status']}") + if (args.report_json is not None or args.dashboard_md is not None) and not args.dry_run: + report = write_report_outputs(args.csv, args.report_json, args.dashboard_md, args.baseline_variant) + if args.report_json is not None: + print(f"report {args.report_json}: {report['claim_status']}") + if args.dashboard_md is not None: + print(f"dashboard {args.dashboard_md}: {report_public_claim_status(report)[0]}") print(f"completed {completed} run(s); results in {target}") return 0 diff --git a/docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl b/docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl new file mode 100644 index 0000000..f632e00 --- /dev/null +++ b/docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl @@ -0,0 +1,24 @@ +{"artifacts_used": 0, "bytes_after": 9450, "bytes_before": 9450, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_01_bugfix", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1715, "output_tokens": 229}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.17} +{"artifacts_used": 1, "bytes_after": 5481, "bytes_before": 9450, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_01_bugfix", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1131, "output_tokens": 210}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.13} +{"artifacts_used": 0, "bytes_after": 9900, "bytes_before": 9900, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_02_exploration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1830, "output_tokens": 238}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.34} +{"artifacts_used": 1, "bytes_after": 5742, "bytes_before": 9900, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_02_exploration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1207, "output_tokens": 218}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.26} +{"artifacts_used": 0, "bytes_after": 10350, "bytes_before": 10350, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_03_code_review", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1945, "output_tokens": 247}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.51} +{"artifacts_used": 1, "bytes_after": 6003, "bytes_before": 10350, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_03_code_review", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1283, "output_tokens": 227}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.39} +{"artifacts_used": 0, "bytes_after": 10800, "bytes_before": 10800, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_04_long_log_analysis", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2060, "output_tokens": 256}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.68} +{"artifacts_used": 1, "bytes_after": 6264, "bytes_before": 10800, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_04_long_log_analysis", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1359, "output_tokens": 235}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.52} +{"artifacts_used": 0, "bytes_after": 11250, "bytes_before": 11250, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_05_migration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2175, "output_tokens": 265}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.85} +{"artifacts_used": 1, "bytes_after": 6525, "bytes_before": 11250, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_05_migration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1435, "output_tokens": 243}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.65} +{"artifacts_used": 0, "bytes_after": 11700, "bytes_before": 11700, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_06_docs", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2290, "output_tokens": 274}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.02} +{"artifacts_used": 1, "bytes_after": 6785, "bytes_before": 11700, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_06_docs", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1511, "output_tokens": 252}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.78} +{"artifacts_used": 0, "bytes_after": 12150, "bytes_before": 12150, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_07_refactor", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2405, "output_tokens": 283}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.19} +{"artifacts_used": 1, "bytes_after": 7046, "bytes_before": 12150, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_07_refactor", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1587, "output_tokens": 260}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.91} +{"artifacts_used": 0, "bytes_after": 12600, "bytes_before": 12600, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_08_performance", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2520, "output_tokens": 292}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.36} +{"artifacts_used": 1, "bytes_after": 7307, "bytes_before": 12600, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_08_performance", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1663, "output_tokens": 268}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.04} +{"artifacts_used": 0, "bytes_after": 13050, "bytes_before": 13050, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_09_telemetry", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2635, "output_tokens": 301}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.53} +{"artifacts_used": 1, "bytes_after": 7568, "bytes_before": 13050, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_09_telemetry", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1739, "output_tokens": 276}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.17} +{"artifacts_used": 0, "bytes_after": 13500, "bytes_before": 13500, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_10_cache_layout", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2750, "output_tokens": 310}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.7} +{"artifacts_used": 1, "bytes_after": 7829, "bytes_before": 13500, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_10_cache_layout", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1815, "output_tokens": 285}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.3} +{"artifacts_used": 0, "bytes_after": 13950, "bytes_before": 13950, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_11_tool_schema", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2865, "output_tokens": 319}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.87} +{"artifacts_used": 1, "bytes_after": 8090, "bytes_before": 13950, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_11_tool_schema", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1890, "output_tokens": 293}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.43} +{"artifacts_used": 0, "bytes_after": 14400, "bytes_before": 14400, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_12_artifact_receipt", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2980, "output_tokens": 328}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 13.04} +{"artifacts_used": 1, "bytes_after": 8352, "bytes_before": 14400, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_12_artifact_receipt", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1966, "output_tokens": 301}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.56} diff --git a/docs/benchmark-workflow-examples.md b/docs/benchmark-workflow-examples.md index 246fd4c..0661ed3 100644 --- a/docs/benchmark-workflow-examples.md +++ b/docs/benchmark-workflow-examples.md @@ -26,6 +26,7 @@ Use them to decide what evidence a workflow has and what it does **not** prove: 3. Treat `comparisons[].quality_gate != "pass"` as a warning to inspect failures, correction burden, and unmatched tasks before discussing savings. 4. Keep byte-proxy, provider-cache, wall-time, and shifted-cost evidence in separate language from provider-measured token/cost claims. Provider-cache telemetry is not independent savings proof. 5. Keep self-hosted local/model-server latency, memory, and quality metrics in the run-evidence ledger sidecar; do not fold them into hosted API token/cost savings claims unless provider-measured matched-task evidence separately supports that claim. +6. For deterministic local replay, add `--evidence-jsonl ... --dashboard-md ...`. Synthetic/manual replay evidence regenerates CSV/report/dashboard artifacts, but the report is marked `replay_only_not_public_claim` or `unknown_mixed_csv` unless every report row has complete provider-export provenance. ## Safe wording @@ -42,3 +43,5 @@ The `.example.json` fixtures intentionally use full `context-guard-bench-report- The self-hosted metrics example is a JSONL run-evidence sidecar, not a full report shape. Its fields are additive ledger evidence only: `latency_ms`, `peak_memory_mb`, and normalized `quality_score` describe local/model-server behavior and leave hosted API report calculations unchanged. Use `context-guard experiments plan self-hosted-metrics-ledger --json ...` only as a dry-run ledger-preview checker for explicit metrics; it does not write the benchmark ledger. For task/variant starter fixtures rather than full report-shape examples, see [`experimental-benchmark-fixtures.md`](experimental-benchmark-fixtures.md). Those files are fixture-only and synthetic dry-run-only starters until users replace the placeholder prompts and success checks; they are not shipped OCR, visual-token, learned-compression, or output-transform benchmark results, and real claims still require provider-measured matched successful tasks plus failure-rate, correction, and shifted-cost guardrails. + +The token-savings 12-task starter also includes [`benchmark-fixtures/token-savings-12task.evidence.example.jsonl`](benchmark-fixtures/token-savings-12task.evidence.example.jsonl) for `context-guard-bench --evidence-jsonl` replay. That file is synthetic local replay evidence, not provider-measured savings proof; use it to validate dashboards and claim-boundary handling before collecting real provider exports. diff --git a/docs/experimental-benchmark-fixtures.md b/docs/experimental-benchmark-fixtures.md index c8469c7..9d913ec 100644 --- a/docs/experimental-benchmark-fixtures.md +++ b/docs/experimental-benchmark-fixtures.md @@ -12,6 +12,23 @@ Use them when designing an experiment that starts from ContextGuard's existing b 5. Treat byte counts, image dimensions, OCR confidence, and local compressor ratios as proxy evidence. Real token/cost claims require **provider-measured** primary token/cost fields on both sides. 6. Keep private screenshots, raw secrets, and external service endpoints out of fixture files. +## Local replay evidence + +`context-guard-bench --evidence-jsonl ` can replay pre-recorded run evidence into the normal CSV/report pipeline without invoking `claude` or any task `success_command`. Pair it with `--report-json` and `--dashboard-md` to regenerate a deterministic local dashboard: + +```bash +context-guard-bench \ + --tasks docs/benchmark-fixtures/token-savings-12task.tasks.example.json \ + --variants docs/benchmark-fixtures/token-savings-12task.variants.example.json \ + --evidence-jsonl docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl \ + --csv /tmp/contextguard-token-savings.csv \ + --report-json /tmp/contextguard-token-savings.report.json \ + --dashboard-md /tmp/contextguard-token-savings.dashboard.md \ + --baseline-variant baseline_full_context_fixture +``` + +The included token-savings evidence file is deliberately `synthetic_fixture` provenance. It validates replay/dashboard mechanics and byte-proxy reporting only: replay forces synthetic/manual rows to `primary_tokens_measured=false` and `cost_measured=false`, so it is not public hosted API token/cost savings evidence even when token-looking numbers are present. A public claim still requires matched successful tasks, provider-export provenance, provider-measured primary tokens/cost, quality non-inferiority, and shifted-cost accounting. + ## Runner-native variant prompt files `context-guard-bench` supports optional file-backed `variant_prompt_files` in task fixtures. The map is keyed by variant name and lets a single logical task swap sanitized prompt evidence per variant, for example a baseline raw-output prompt versus a digest plus artifact receipt prompt. Prompt files are resolved relative to the task JSON, must be relative paths, and are read with the same no-follow/symlink-safe posture as task and variant fixtures. @@ -20,12 +37,12 @@ This runner-native swap only proves command shape and prompt selection until the ## Included fixture sets -| Fixture set | Task file | Variant file | Intended future experiment | -| --- | --- | --- | --- | -| Visual/OCR evidence | [`benchmark-fixtures/visual-ocr.tasks.example.json`](benchmark-fixtures/visual-ocr.tasks.example.json) | [`benchmark-fixtures/visual-ocr.variants.example.json`](benchmark-fixtures/visual-ocr.variants.example.json) | Compare full visual evidence against cropped or OCR-derived evidence after the user supplies sanitized textual evidence, missed-context notes, crop/OCR telemetry, and provider telemetry. | -| Learned compression | [`benchmark-fixtures/learned-compression.tasks.example.json`](benchmark-fixtures/learned-compression.tasks.example.json) | [`benchmark-fixtures/learned-compression.variants.example.json`](benchmark-fixtures/learned-compression.variants.example.json) | Compare sanitized baseline context packs against a fixture-only compressed digest candidate after exact retrieval or receipt fallback, quality gates, and shifted costs are measured. | -| Reversible output transform | [`benchmark-fixtures/output-transform.tasks.example.json`](benchmark-fixtures/output-transform.tasks.example.json) | [`benchmark-fixtures/output-transform.variants.example.json`](benchmark-fixtures/output-transform.variants.example.json) | Compare raw sanitized command output against a digest plus artifact receipt after variant prompt files, success checks, and provider telemetry are supplied. | -| Token-savings 12-task roadmap | [`benchmark-fixtures/token-savings-12task.tasks.example.json`](benchmark-fixtures/token-savings-12task.tasks.example.json) | [`benchmark-fixtures/token-savings-12task.variants.example.json`](benchmark-fixtures/token-savings-12task.variants.example.json) | Exercise a canonical 12-task spread for bugfix, exploration, review, log analysis, migration, docs, refactor, performance, telemetry, cache layout, tool-schema deferral, and artifact receipt experiments after real success commands and provider telemetry are supplied. | +| Fixture set | Task file | Variant file | Evidence replay file | Intended future experiment | +| --- | --- | --- | --- | --- | +| Visual/OCR evidence | [`benchmark-fixtures/visual-ocr.tasks.example.json`](benchmark-fixtures/visual-ocr.tasks.example.json) | [`benchmark-fixtures/visual-ocr.variants.example.json`](benchmark-fixtures/visual-ocr.variants.example.json) | n/a | Compare full visual evidence against cropped or OCR-derived evidence after the user supplies sanitized textual evidence, missed-context notes, crop/OCR telemetry, and provider telemetry. | +| Learned compression | [`benchmark-fixtures/learned-compression.tasks.example.json`](benchmark-fixtures/learned-compression.tasks.example.json) | [`benchmark-fixtures/learned-compression.variants.example.json`](benchmark-fixtures/learned-compression.variants.example.json) | n/a | Compare sanitized baseline context packs against a fixture-only compressed digest candidate after exact retrieval or receipt fallback, quality gates, and shifted costs are measured. | +| Reversible output transform | [`benchmark-fixtures/output-transform.tasks.example.json`](benchmark-fixtures/output-transform.tasks.example.json) | [`benchmark-fixtures/output-transform.variants.example.json`](benchmark-fixtures/output-transform.variants.example.json) | n/a | Compare raw sanitized command output against a digest plus artifact receipt after variant prompt files, success checks, and provider telemetry are supplied. | +| Token-savings 12-task roadmap | [`benchmark-fixtures/token-savings-12task.tasks.example.json`](benchmark-fixtures/token-savings-12task.tasks.example.json) | [`benchmark-fixtures/token-savings-12task.variants.example.json`](benchmark-fixtures/token-savings-12task.variants.example.json) | [`benchmark-fixtures/token-savings-12task.evidence.example.jsonl`](benchmark-fixtures/token-savings-12task.evidence.example.jsonl) | Exercise a canonical 12-task spread for bugfix, exploration, review, log analysis, migration, docs, refactor, performance, telemetry, cache layout, tool-schema deferral, and artifact receipt experiments after real success commands and provider telemetry are supplied. | ## Visual/OCR fixture notes @@ -41,7 +58,7 @@ The output-transform fixtures describe already-sanitized command output comparis ## Token-savings 12-task roadmap fixture notes -The token-savings 12-task fixtures are a canonical **fixture-only** spread for roadmap-level A/B design. They demonstrate `variant_prompt_files` for a baseline full-context prompt versus a ContextGuard advisory-foundations prompt that may later include cache layout lint, core-vs-deferred tool schemas, artifact receipts, and claim-safe telemetry. They do not execute `context-guard-cache-score`, `context-guard-tool-prune`, or any provider call. +The token-savings 12-task fixtures are a canonical **fixture-only** spread for roadmap-level A/B design. They demonstrate `variant_prompt_files` for a baseline full-context prompt versus a ContextGuard advisory-foundations prompt that may later include cache layout lint, core-vs-deferred tool schemas, artifact receipts, and claim-safe telemetry. They do not execute `context-guard-cache-score`, `context-guard-tool-prune`, or any provider call. The companion `token-savings-12task.evidence.example.jsonl` lets users replay deterministic synthetic rows into CSV/report/dashboard outputs while preserving the same non-claim boundary. For real non-dry-run experiments, replace every placeholder `success_command`, keep task IDs matched across baseline and candidate variants, and require provider-measured primary token/cost data before interpreting `tokens_per_successful_task`, `total_cost_with_shift_usd`, or `external_cost_usd`. Cache predictions, char/4 token proxies, local latency, and byte reductions remain diagnostic proxy evidence unless the generated report contains matched successful task evidence and stays within the 10%p failure-rate guardrail. diff --git a/package.json b/package.json index 215b611..03f9d9b 100644 --- a/package.json +++ b/package.json @@ -59,6 +59,7 @@ "docs/benchmark-workflows/*.example.jsonl", "docs/benchmark-workflow-examples.md", "docs/benchmark-fixtures/*.example.json", + "docs/benchmark-fixtures/*.example.jsonl", "docs/benchmark-fixtures/*.prompt.example.md", "docs/experimental-benchmark-fixtures.md", "packaging/homebrew/context-guard.rb.template" diff --git a/plugins/context-guard/README.ko.md b/plugins/context-guard/README.ko.md index 9340a80..8345784 100644 --- a/plugins/context-guard/README.ko.md +++ b/plugins/context-guard/README.ko.md @@ -114,7 +114,7 @@ brief 모드는 코딩 에이전트가 군더더기를 줄이도록 요청하되 ## 절감 수치를 과장하지 않습니다 -이 헬퍼들은 흔히 컨텍스트를 불필요하게 키우는 원인을 줄이지만, 고정된 절감률을 보장하지 않습니다. 실제 전후 비교 증거가 필요하면 `context-guard-bench --ledger-jsonl ... --report-json ...`로 본인 작업에서 측정하세요. 토큰 절감 주장은 대응 태스크 양쪽 모두에 `primary_tokens_measured`가 있을 때만 계산하며, report의 `matched_pair_evidence`가 성공한 baseline/variant task bucket을 transform, quality gate, 측정 가능 여부, claim boundary와 연결합니다. wall-time과 provider-cache 필드는 진단용 텔레메트리이지 단독 절감 증거가 아닙니다. 감사의 `cache_friendliness`, [`cache_diagnostics`](https://github.com/ictechgy/context-guard/blob/main/docs/cache-diagnostics-schema.md), `cache_layout_advice`는 관측/추론/가설/불가 경계를 둔 휴리스틱 배치·cache-read 신호와 순위화된 확인/실험이며 청구 기준이나 provider-cache 증명이 아닙니다. 벤치마크 CSV 스키마는 엄격하므로 헬퍼 업그레이드 후에는 새 CSV를 시작하거나 헤더를 마이그레이션하세요. 작업 유형별 합성 예시는 [`docs/benchmark-workflow-examples.md`](https://github.com/ictechgy/context-guard/blob/main/docs/benchmark-workflow-examples.md)에 있고, fixture-only 실험 시작 예시는 [`docs/experimental-benchmark-fixtures.md`](https://github.com/ictechgy/context-guard/blob/main/docs/experimental-benchmark-fixtures.md)에 있습니다. +이 헬퍼들은 흔히 컨텍스트를 불필요하게 키우는 원인을 줄이지만, 고정된 절감률을 보장하지 않습니다. 실제 전후 비교 증거가 필요하면 `context-guard-bench --ledger-jsonl ... --report-json ... --dashboard-md ...`로 본인 작업에서 측정하세요. `--evidence-jsonl ...`는 deterministic local replay용이며 provider-export provenance가 완전하지 않으면 public claim 불가로 표시됩니다. 토큰 절감 주장은 대응 태스크 양쪽 모두에 `primary_tokens_measured`가 있을 때만 계산하며, report의 `matched_pair_evidence`가 성공한 baseline/variant task bucket을 transform, quality gate, 측정 가능 여부, claim boundary와 연결합니다. wall-time과 provider-cache 필드는 진단용 텔레메트리이지 단독 절감 증거가 아닙니다. 감사의 `cache_friendliness`, [`cache_diagnostics`](https://github.com/ictechgy/context-guard/blob/main/docs/cache-diagnostics-schema.md), `cache_layout_advice`는 관측/추론/가설/불가 경계를 둔 휴리스틱 배치·cache-read 신호와 순위화된 확인/실험이며 청구 기준이나 provider-cache 증명이 아닙니다. 벤치마크 CSV 스키마는 엄격하므로 헬퍼 업그레이드 후에는 새 CSV를 시작하거나 헤더를 마이그레이션하세요. 작업 유형별 합성 예시는 [`docs/benchmark-workflow-examples.md`](https://github.com/ictechgy/context-guard/blob/main/docs/benchmark-workflow-examples.md)에 있고, fixture-only 실험 시작 예시는 [`docs/experimental-benchmark-fixtures.md`](https://github.com/ictechgy/context-guard/blob/main/docs/experimental-benchmark-fixtures.md)에 있습니다. ContextGuard는 모델 토큰을 줄이기 위해 작업을 외부 AI 서비스로 전송하지 않습니다. 모든 헬퍼 명령은 로컬에서 동작합니다. 로컬 RAM/디스크 보관본은 다음에 보낼 컨텍스트를 줄이는 데 도움될 수 있지만 provider prompt cache를 대체하지 않습니다. Anthropic 배포나 청구 설명 전에는 공식 prompt caching/pricing 문서를 다시 확인하세요: https://docs.anthropic.com/en/build-with-claude/prompt-caching 및 https://platform.claude.com/docs/en/about-claude/pricing. diff --git a/plugins/context-guard/README.md b/plugins/context-guard/README.md index d3c10c1..d625ebd 100644 --- a/plugins/context-guard/README.md +++ b/plugins/context-guard/README.md @@ -123,7 +123,7 @@ Three deterministic levels — `lite`, `standard`, `ultra` — live under [`brie ## Conservative claims -These helpers reduce common sources of context bloat, but they do not guarantee a fixed percentage savings. Use `context-guard-bench --ledger-jsonl ... --report-json ...` when you need measured before/after evidence for your own tasks; token-savings claims require `primary_tokens_measured` on both matched sides, and the report's `matched_pair_evidence` links each successful baseline/variant task bucket to the transform, quality gate, measurement availability, and claim boundary. Wall-time/provider-cache fields are diagnostic telemetry, not standalone savings proof. Audit `cache_friendliness`, [`cache_diagnostics`](https://github.com/ictechgy/context-guard/blob/main/docs/cache-diagnostics-schema.md), and `cache_layout_advice` findings are heuristic layout/cache-read signals and ranked checks/experiments with observed/inferred/hypothesis/unavailable boundaries, not billing authority or provider-cache proof. Benchmark CSV schemas are strict, so start a new CSV or migrate the header after helper upgrades. Workflow-specific synthetic examples live in [`docs/benchmark-workflow-examples.md`](https://github.com/ictechgy/context-guard/blob/main/docs/benchmark-workflow-examples.md), and fixture-only experimental task/variant starters live in [`docs/experimental-benchmark-fixtures.md`](https://github.com/ictechgy/context-guard/blob/main/docs/experimental-benchmark-fixtures.md). +These helpers reduce common sources of context bloat, but they do not guarantee a fixed percentage savings. Use `context-guard-bench --ledger-jsonl ... --report-json ... --dashboard-md ...` when you need measured before/after evidence for your own tasks; add `--evidence-jsonl ...` only for deterministic local replay that remains non-claim-eligible unless provider-export provenance is complete; token-savings claims require `primary_tokens_measured` on both matched sides, and the report's `matched_pair_evidence` links each successful baseline/variant task bucket to the transform, quality gate, measurement availability, and claim boundary. Wall-time/provider-cache fields are diagnostic telemetry, not standalone savings proof. Audit `cache_friendliness`, [`cache_diagnostics`](https://github.com/ictechgy/context-guard/blob/main/docs/cache-diagnostics-schema.md), and `cache_layout_advice` findings are heuristic layout/cache-read signals and ranked checks/experiments with observed/inferred/hypothesis/unavailable boundaries, not billing authority or provider-cache proof. Benchmark CSV schemas are strict, so start a new CSV or migrate the header after helper upgrades. Workflow-specific synthetic examples live in [`docs/benchmark-workflow-examples.md`](https://github.com/ictechgy/context-guard/blob/main/docs/benchmark-workflow-examples.md), and fixture-only experimental task/variant starters live in [`docs/experimental-benchmark-fixtures.md`](https://github.com/ictechgy/context-guard/blob/main/docs/experimental-benchmark-fixtures.md). ContextGuard also does not send work to external AI providers to save model tokens. All helper commands run locally. Local RAM/disk receipts can reduce what you choose to send, but they do not replace a provider prompt cache. Before release or billing claims for Anthropic, recheck the official prompt-caching and pricing docs: https://docs.anthropic.com/en/build-with-claude/prompt-caching and https://platform.claude.com/docs/en/about-claude/pricing. diff --git a/plugins/context-guard/bin/context-guard-bench b/plugins/context-guard/bin/context-guard-bench index e338b88..a1af3ab 100755 --- a/plugins/context-guard/bin/context-guard-bench +++ b/plugins/context-guard/bin/context-guard-bench @@ -178,6 +178,8 @@ EXTERNAL_SOURCE_KEY_GROUPS: tuple[tuple[str, tuple[str, ...], tuple[str, ...]], ) MAX_USAGE_TOKEN_COUNT = 10**12 MAX_USAGE_COST_USD = 10**9 +MAX_EVIDENCE_JSONL_BYTES = 5_000_000 +MAX_EVIDENCE_JSONL_LINES = 100_000 # Byte -> token proxy 환산 계수. 측정된 모델 토큰이 아니라 byte delta 기반 보수적 # 추정치이며, report에서 evidence="inferred"로 분명히 라벨링한다. 영어 텍스트 기준 # ~4 bytes/token의 통용 근사값을 사용한다. @@ -188,6 +190,25 @@ MEASUREMENT_BASELINE_SCHEMA_VERSION = "contextguard.bench.measurement-baseline.v SELF_HOSTED_METRICS_SCHEMA_VERSION = "contextguard.bench.self-hosted-metrics.v1" SELF_HOSTED_METRICS_KEY = "self_hosted_metrics" SELF_HOSTED_METRICS_CLAIM_BOUNDARY = "self_hosted_metrics_only_not_hosted_api_token_or_cost_savings" +EVIDENCE_REPLAY_SOURCE_TYPES = frozenset({"synthetic_fixture", "provider_export", "manual_audit"}) +PROVIDER_EXPORT_PUBLIC_CLAIM_SCOPES = frozenset({ + "provider_measured_matched_task", + "provider_measured_matched_task_public_claim", + "hosted_api_provider_measured_matched_task", +}) +REPLAY_PUBLIC_CLAIM_CANDIDATE_STATUS = "provider_export_public_claim_candidate" +REPLAY_PROVIDER_CLAIM_GATES_NOT_MET_STATUS = "provider_export_claim_gates_not_met" +REPLAY_NOT_PUBLIC_CLAIM_STATUS = "replay_only_not_public_claim" +REPLAY_UNKNOWN_MIXED_CSV_STATUS = "unknown_mixed_csv" +REPLAY_PUBLIC_CLAIM_ELIGIBLE_RAW_STATUSES = frozenset({ + "token_and_shifted_cost_savings_observed", +}) +REPLAY_CLAIM_BOUNDARY = ( + "Evidence replay is an import/replay mode. Synthetic fixtures and manual audits are never " + "hosted API token/cost savings evidence; public claims require complete provider_export " + "provenance for every report row plus the normal matched-task quality, token, cost, and " + "shifted-cost gates." +) MAX_SELF_HOSTED_LABEL_CHARS = 120 MAX_SELF_HOSTED_LATENCY_MS = 7 * 24 * 60 * 60 * 1000 MAX_SELF_HOSTED_MEMORY_MB = 10_000_000 @@ -401,6 +422,36 @@ class RunResult: self_hosted_metrics: dict[str, Any] | None = None +@dataclass +class EvidenceReplayRow: + result: RunResult + source_type: str + provider_name: str | None + capture_command_or_export_id: str | None + claim_scope: str + provider_export_provenance_complete: bool + public_claim_eligible: bool + line_number: int + + @property + def key(self) -> tuple[str, str]: + return (self.result.task_id, self.result.variant) + + def provenance_payload(self) -> dict[str, Any]: + return { + "schema_version": BENCH_RUN_EVIDENCE_SCHEMA_VERSION, + "mode": "evidence_jsonl_replay", + "evidence_source_type": self.source_type, + "provider_name": self.provider_name, + "capture_command_or_export_id": self.capture_command_or_export_id, + "claim_scope": self.claim_scope, + "provider_export_provenance_complete": self.provider_export_provenance_complete, + "public_claim_eligible": self.public_claim_eligible, + "line_number": self.line_number, + "claim_boundary": REPLAY_CLAIM_BOUNDARY, + } + + @dataclass class BoundedProcessResult: returncode: int @@ -1362,7 +1413,13 @@ def write_text_no_follow(path: Path, text: str) -> None: os.close(fd) -def append_cost_shift_ledger(path: Path, claude_ver: str, result: RunResult) -> None: +def append_cost_shift_ledger( + path: Path, + claude_ver: str, + result: RunResult, + *, + replay_provenance: dict[str, Any] | None = None, +) -> None: shifted_cost_known = cost_shift_measured(result) byte_metrics_observed = bool(result.bytes_before or result.bytes_after) payload = { @@ -1413,6 +1470,10 @@ def append_cost_shift_ledger(path: Path, claude_ver: str, result: RunResult) -> } if result.self_hosted_metrics is not None: payload["self_hosted_metrics"] = result.self_hosted_metrics + if replay_provenance is not None: + payload["replay_provenance"] = replay_provenance + payload["evidence_source_type"] = replay_provenance.get("evidence_source_type") + payload["public_claim_eligible"] = bool(replay_provenance.get("public_claim_eligible")) with csv_file_lock(path, create_parent=True): fd = _open_regular_no_symlink(path, os.O_CREAT | os.O_APPEND | os.O_WRONLY, 0o600, create_parent=True) try: @@ -1488,6 +1549,354 @@ def read_csv_rows(csv_path: Path) -> list[dict[str, str]]: os.close(fd) +def file_has_content_no_follow(path: Path) -> bool: + try: + fd = _open_regular_no_symlink(path) + except FileNotFoundError: + return False + try: + return os.fstat(fd).st_size > 0 + finally: + os.close(fd) + + +def require_evidence_object(raw: Any, *, owner: str) -> dict[str, Any]: + if not isinstance(raw, dict): + raise SystemExit(f"{owner} evidence row must be a JSON object") + return raw + + +def evidence_non_empty_string(raw: Any, *, field: str, owner: str, required: bool = True) -> str | None: + if raw is None: + if required: + raise SystemExit(f"{owner} {field} must be a non-empty string") + return None + if not isinstance(raw, str): + raise SystemExit(f"{owner} {field} must be a string") + text = sanitize_note_text(raw) + if not text: + if required: + raise SystemExit(f"{owner} {field} must be a non-empty string") + return None + return text + + +def evidence_bool(raw: Any, *, field: str, owner: str, default: bool = False) -> bool: + if raw is None: + return default + if not isinstance(raw, bool): + raise SystemExit(f"{owner} {field} must be a boolean") + return raw + + +def evidence_nonnegative_int( + raw: Any, + *, + field: str, + owner: str, + default: int = 0, + maximum: int = MAX_USAGE_TOKEN_COUNT, +) -> int: + if raw is None: + return default + value = normalize_usage_token(raw) + if value is None or value > maximum: + raise SystemExit(f"{owner} {field} must be a finite non-negative integer") + return value + + +def evidence_nonnegative_float( + raw: Any, + *, + field: str, + owner: str, + default: float = 0.0, + maximum: float = MAX_USAGE_COST_USD, +) -> float: + if raw is None: + return default + if isinstance(raw, bool) or not isinstance(raw, (int, float)): + raise SystemExit(f"{owner} {field} must be a finite non-negative number") + value = float(raw) + if not math.isfinite(value) or value < 0 or value > maximum: + raise SystemExit(f"{owner} {field} must be a finite non-negative number") + return value + + +def evidence_first(raw: dict[str, Any], *keys: str) -> Any: + for key in keys: + if key in raw: + return raw[key] + return None + + +def parse_evidence_provenance(raw: dict[str, Any], *, owner: str) -> dict[str, Any]: + provenance = raw.get("provenance") + if provenance is not None and not isinstance(provenance, dict): + raise SystemExit(f"{owner} provenance must be a JSON object") + source_raw = ( + provenance.get("evidence_source_type") + if isinstance(provenance, dict) and "evidence_source_type" in provenance + else raw.get("evidence_source_type") + ) + source_type = evidence_non_empty_string(source_raw, field="evidence_source_type", owner=owner) + assert source_type is not None + if source_type not in EVIDENCE_REPLAY_SOURCE_TYPES: + raise SystemExit( + f"{owner} evidence_source_type must be one of: {', '.join(sorted(EVIDENCE_REPLAY_SOURCE_TYPES))}" + ) + provider_name = evidence_non_empty_string( + provenance.get("provider_name") if isinstance(provenance, dict) else raw.get("provider_name"), + field="provider_name", + owner=owner, + required=False, + ) + capture_id = evidence_non_empty_string( + ( + provenance.get("capture_command_or_export_id") + if isinstance(provenance, dict) and "capture_command_or_export_id" in provenance + else raw.get("capture_command_or_export_id") + ), + field="capture_command_or_export_id", + owner=owner, + required=False, + ) + claim_scope = evidence_non_empty_string( + provenance.get("claim_scope") if isinstance(provenance, dict) else raw.get("claim_scope"), + field="claim_scope", + owner=owner, + ) + assert claim_scope is not None + provider_authority = ( + source_type == "provider_export" + and provider_name is not None + and capture_id is not None + and claim_scope in PROVIDER_EXPORT_PUBLIC_CLAIM_SCOPES + ) + return { + "source_type": source_type, + "provider_name": provider_name, + "capture_command_or_export_id": capture_id, + "claim_scope": claim_scope, + "provider_public_claim_authority": provider_authority, + } + + +def parse_evidence_tokens(raw: dict[str, Any], *, owner: str) -> tuple[dict[str, int], set[str]]: + token_block = raw.get("tokens") + if token_block is not None and not isinstance(token_block, dict): + raise SystemExit(f"{owner} tokens must be a JSON object") + tokens: dict[str, int] = {} + observed: set[str] = set() + source = token_block if isinstance(token_block, dict) else {} + for bucket, _keys in USAGE_KEY_GROUPS: + value = source.get(bucket) if bucket in source else raw.get(bucket) + if value is not None: + observed.add(bucket) + tokens[bucket] = evidence_nonnegative_int(value, field=bucket, owner=owner) + return tokens, observed + + +def parse_evidence_row(raw_value: Any, *, owner: str, line_number: int) -> EvidenceReplayRow: + raw = require_evidence_object(raw_value, owner=owner) + schema = evidence_non_empty_string(raw.get("schema_version"), field="schema_version", owner=owner) + if schema != BENCH_RUN_EVIDENCE_SCHEMA_VERSION: + raise SystemExit( + f"{owner} schema_version must be {BENCH_RUN_EVIDENCE_SCHEMA_VERSION}" + ) + task_id = evidence_non_empty_string(raw.get("task_id"), field="task_id", owner=owner) + variant = evidence_non_empty_string(raw.get("variant"), field="variant", owner=owner) + assert task_id is not None and variant is not None + provenance = parse_evidence_provenance(raw, owner=owner) + provider_authority = bool(provenance["provider_public_claim_authority"]) + raw_primary_tokens_measured = evidence_bool( + raw.get("primary_tokens_measured"), + field="primary_tokens_measured", + owner=owner, + ) + raw_cost_measured = evidence_bool( + evidence_first(raw, "cost_measured", "primary_cost_measured"), + field="cost_measured", + owner=owner, + ) + if provenance["source_type"] in {"synthetic_fixture", "manual_audit"}: + primary_tokens_measured = False + cost_measured = False + elif provider_authority: + primary_tokens_measured = raw_primary_tokens_measured + cost_measured = raw_cost_measured + else: + if raw_primary_tokens_measured or raw_cost_measured: + raise SystemExit( + f"{owner} provider_export measured flags require provider_name, " + "capture_command_or_export_id, and a provider-measured matched-task claim_scope" + ) + primary_tokens_measured = False + cost_measured = False + + tokens, observed_token_buckets = parse_evidence_tokens(raw, owner=owner) + if primary_tokens_measured and not {"input_tokens", "output_tokens"}.issubset(observed_token_buckets): + raise SystemExit( + f"{owner} primary_tokens_measured=true requires input_tokens and output_tokens evidence" + ) + cost_usd = evidence_nonnegative_float( + evidence_first(raw, "cost_usd", "primary_cost_usd"), + field="cost_usd", + owner=owner, + ) + if cost_measured and "cost_usd" not in raw and "primary_cost_usd" not in raw: + raise SystemExit(f"{owner} cost_measured=true requires cost_usd evidence") + + if "success" not in raw: + raise SystemExit(f"{owner} success must be a boolean") + success = evidence_bool(raw.get("success"), field="success", owner=owner) + notes = evidence_non_empty_string(raw.get("notes"), field="notes", owner=owner, required=False) + model = evidence_non_empty_string(raw.get("model"), field="model", owner=owner, required=False) or "evidence-replay" + effort = evidence_non_empty_string(raw.get("effort"), field="effort", owner=owner, required=False) or "" + self_hosted_metrics = None + if SELF_HOSTED_METRICS_KEY in raw: + self_hosted_metrics = normalize_self_hosted_metrics( + raw.get(SELF_HOSTED_METRICS_KEY), + source="evidence_jsonl.self_hosted_metrics", + ) + if self_hosted_metrics is None: + raise SystemExit(f"{owner} self_hosted_metrics must be normalized explicit metrics") + + result = RunResult( + task_id=task_id, + variant=variant, + model=model, + effort=effort, + tokens=tokens, + cost_usd=cost_usd, + success=success, + notes=notes or f"evidence replay ({provenance['source_type']})", + corrections=evidence_nonnegative_int(raw.get("corrections"), field="corrections", owner=owner), + cost_measured=cost_measured, + wall_time_seconds=evidence_nonnegative_float( + raw.get("wall_time_seconds"), + field="wall_time_seconds", + owner=owner, + maximum=MAX_SELF_HOSTED_LATENCY_MS / 1000, + ), + turns=evidence_nonnegative_int(raw.get("turns"), field="turns", owner=owner), + hook_triggers=evidence_nonnegative_int(raw.get("hook_triggers"), field="hook_triggers", owner=owner), + bytes_before=evidence_nonnegative_int(raw.get("bytes_before"), field="bytes_before", owner=owner), + bytes_after=evidence_nonnegative_int(raw.get("bytes_after"), field="bytes_after", owner=owner), + artifacts_used=evidence_nonnegative_int(raw.get("artifacts_used"), field="artifacts_used", owner=owner), + external_tokens=evidence_nonnegative_int(raw.get("external_tokens"), field="external_tokens", owner=owner), + external_tokens_measured=evidence_bool( + raw.get("external_tokens_measured"), + field="external_tokens_measured", + owner=owner, + ), + external_cost_usd=evidence_nonnegative_float( + raw.get("external_cost_usd"), + field="external_cost_usd", + owner=owner, + ), + external_cost_measured=evidence_bool( + raw.get("external_cost_measured"), + field="external_cost_measured", + owner=owner, + ), + provider_cached_tokens=evidence_nonnegative_int( + raw.get("provider_cached_tokens"), + field="provider_cached_tokens", + owner=owner, + ), + provider_cached_tokens_measured=evidence_bool( + raw.get("provider_cached_tokens_measured"), + field="provider_cached_tokens_measured", + owner=owner, + ), + primary_tokens_measured=primary_tokens_measured, + self_hosted_metrics=self_hosted_metrics, + ) + return EvidenceReplayRow( + result=result, + source_type=str(provenance["source_type"]), + provider_name=provenance["provider_name"], + capture_command_or_export_id=provenance["capture_command_or_export_id"], + claim_scope=str(provenance["claim_scope"]), + provider_export_provenance_complete=provider_authority, + public_claim_eligible=False, + line_number=line_number, + ) + + +def read_evidence_jsonl(path: Path) -> list[EvidenceReplayRow]: + fd = _open_regular_no_symlink(path) + try: + size = os.fstat(fd).st_size + if size > MAX_EVIDENCE_JSONL_BYTES: + raise SystemExit( + f"evidence JSONL exceeds {MAX_EVIDENCE_JSONL_BYTES} bytes: {path}" + ) + rows: list[EvidenceReplayRow] = [] + with os.fdopen(fd, "r", encoding="utf-8") as handle: + fd = -1 + for line_number, line in enumerate(handle, start=1): + if line_number > MAX_EVIDENCE_JSONL_LINES: + raise SystemExit( + f"evidence JSONL line limit exceeded for {path}: > {MAX_EVIDENCE_JSONL_LINES}" + ) + if not line.strip(): + continue + try: + payload = json.loads(line) + except json.JSONDecodeError as exc: + raise SystemExit( + f"{path}:{line_number} evidence row must be JSON: {exc.msg}" + ) from None + rows.append(parse_evidence_row(payload, owner=f"{path}:{line_number}", line_number=line_number)) + finally: + if fd != -1: + os.close(fd) + if not rows: + raise SystemExit(f"evidence JSONL contains no rows: {path}") + return rows + + +def validate_evidence_coverage( + evidence_rows: list[EvidenceReplayRow], + runnable_targets: list[tuple[TaskFixture, Variant]], +) -> dict[tuple[str, str], EvidenceReplayRow]: + by_key: dict[tuple[str, str], EvidenceReplayRow] = {} + for row in evidence_rows: + if row.key in by_key: + raise SystemExit( + f"duplicate evidence row for {row.key[0]}/{row.key[1]} " + f"(lines {by_key[row.key].line_number} and {row.line_number})" + ) + by_key[row.key] = row + missing = [ + f"{task.id}/{variant.name}" + for task, variant in runnable_targets + if (task.id, variant.name) not in by_key + ] + if missing: + raise SystemExit(f"missing evidence row(s) for selected targets: {', '.join(missing)}") + return { + (task.id, variant.name): by_key[(task.id, variant.name)] + for task, variant in runnable_targets + } + + +def run_evidence_fixture(task: TaskFixture, variant: Variant, evidence: EvidenceReplayRow) -> RunResult: + result = evidence.result + if result.task_id != task.id or result.variant != variant.name: + raise SystemExit( + f"evidence target mismatch: expected {task.id}/{variant.name}, " + f"got {result.task_id}/{result.variant}" + ) + if result.model == "evidence-replay": + result.model = task.model + if not result.effort: + result.effort = task.effort or "" + return result + + def row_int(row: dict[str, str], key: str) -> int: try: return int(float(row.get(key) or 0)) @@ -2277,18 +2686,230 @@ def summarize_benchmark_rows(rows: list[dict[str, str]], baseline_variant: str) ), } +def annotate_replay_report( + report: dict[str, Any], + replay_rows: list[EvidenceReplayRow], + *, + mixed_csv: bool, +) -> dict[str, Any]: + source_types = sorted({row.source_type for row in replay_rows}) + provider_names = sorted({row.provider_name for row in replay_rows if row.provider_name}) + claim_scopes = sorted({row.claim_scope for row in replay_rows}) + same_run_complete = (not mixed_csv) and len(replay_rows) == int(report.get("row_count") or 0) + all_provider_claim_authority = bool(replay_rows) and all( + row.provider_export_provenance_complete for row in replay_rows + ) + raw_claim_status = str(report.get("claim_status") or "") + matched_pair_evidence = report.get("matched_pair_evidence") + matched_claim_gates_allow_public_claim = ( + isinstance(matched_pair_evidence, list) + and bool(matched_pair_evidence) + and all( + isinstance(item, dict) + and isinstance(item.get("claim_boundary"), dict) + and bool(item["claim_boundary"].get("token_savings_claim_allowed")) + and bool(item["claim_boundary"].get("shifted_cost_claim_allowed")) + for item in matched_pair_evidence + ) + ) + report_claim_gates_allow_public_claim = ( + raw_claim_status in REPLAY_PUBLIC_CLAIM_ELIGIBLE_RAW_STATUSES + and matched_claim_gates_allow_public_claim + ) + if not same_run_complete: + public_claim_status = REPLAY_UNKNOWN_MIXED_CSV_STATUS + public_claim_eligible = False + elif all_provider_claim_authority and report_claim_gates_allow_public_claim: + public_claim_status = REPLAY_PUBLIC_CLAIM_CANDIDATE_STATUS + public_claim_eligible = True + elif all_provider_claim_authority: + public_claim_status = REPLAY_PROVIDER_CLAIM_GATES_NOT_MET_STATUS + public_claim_eligible = False + else: + public_claim_status = REPLAY_NOT_PUBLIC_CLAIM_STATUS + public_claim_eligible = False + report["raw_metric_claim_status"] = raw_claim_status + report["public_claim_status"] = public_claim_status + report["public_claim_eligible"] = public_claim_eligible + if not public_claim_eligible: + report["claim_status"] = public_claim_status + report["replay_evidence"] = { + "schema_version": BENCH_RUN_EVIDENCE_SCHEMA_VERSION, + "mode": "evidence_jsonl_replay", + "row_count": len(replay_rows), + "source_types": source_types, + "provider_names": provider_names, + "claim_scopes": claim_scopes, + "same_run_complete": same_run_complete, + "mixed_csv": mixed_csv, + "provider_export_provenance_complete": all_provider_claim_authority, + "report_claim_gates_allow_public_claim": report_claim_gates_allow_public_claim, + "public_claim_status": public_claim_status, + "public_claim_eligible": public_claim_eligible, + "target_keys": [f"{row.result.task_id}/{row.result.variant}" for row in replay_rows], + "claim_boundary": REPLAY_CLAIM_BOUNDARY, + } + return report + + +def report_public_claim_status(report: dict[str, Any]) -> tuple[str, bool | None]: + if "public_claim_status" in report: + return str(report.get("public_claim_status")), bool(report.get("public_claim_eligible")) + return ( + "csv_provenance_unknown_requires_original_evidence_or_trusted_ledger", + None, + ) + + +def markdown_value(value: Any) -> str: + if value is None: + return "n/a" + if isinstance(value, bool): + return "true" if value else "false" + if isinstance(value, float): + return f"{value:.6g}" + text = sanitize_note_text(value) + return text.replace("|", "\\|") or "n/a" + + +def render_dashboard_markdown(report: dict[str, Any]) -> str: + public_claim_status, public_claim_eligible = report_public_claim_status(report) + metric_claim_status = report.get("raw_metric_claim_status", report.get("claim_status")) + lines = [ + "# ContextGuard Benchmark Dashboard", + "", + f"- Schema: `{markdown_value(report.get('schema'))}`", + f"- Baseline variant: `{markdown_value(report.get('baseline_variant'))}`", + f"- Rows: {markdown_value(report.get('row_count'))}", + f"- Metric claim status: `{markdown_value(metric_claim_status)}`", + f"- Public claim status: `{markdown_value(public_claim_status)}`", + f"- Public claim eligible: `{markdown_value(public_claim_eligible)}`", + "", + "> Claim boundary: this dashboard is not a hosted savings claim unless report claim gates " + "allow it and public-claim provenance is complete. Proxy byte reductions are diagnostic " + "and are not hosted API token savings.", + "", + "## Variant summary", + "", + "| Variant | Runs | Successes | Failure rate | Tokens/success | Bytes saved | Token proxy saved | Quality notes |", + "| --- | ---: | ---: | ---: | ---: | ---: | ---: | --- |", + ] + summaries = report.get("summary_by_variant") if isinstance(report.get("summary_by_variant"), dict) else {} + comparison_by_variant = { + item.get("variant"): item + for item in report.get("comparisons", []) + if isinstance(item, dict) + } + for variant, summary in sorted(summaries.items()): + if not isinstance(summary, dict): + continue + comparison = comparison_by_variant.get(variant, {}) + quality = comparison.get("quality_gate") if isinstance(comparison, dict) else None + if quality is None and summary.get("is_baseline_strategy"): + quality = "baseline" + lines.append( + "| " + + " | ".join([ + markdown_value(variant), + markdown_value(summary.get("runs")), + markdown_value(summary.get("successful_runs")), + markdown_value(summary.get("failure_rate")), + markdown_value(summary.get("tokens_per_successful_task")), + markdown_value(summary.get("bytes_saved_successful")), + markdown_value(summary.get("token_proxy_saved_successful")), + markdown_value(quality), + ]) + + " |" + ) + lines.extend([ + "", + "## Comparisons", + "", + "| Variant | Quality gate | Matched tasks | Token paired tasks | Token savings % | Shifted cost savings % |", + "| --- | --- | ---: | ---: | ---: | ---: |", + ]) + comparisons = report.get("comparisons") if isinstance(report.get("comparisons"), list) else [] + if comparisons: + for item in comparisons: + if not isinstance(item, dict): + continue + lines.append( + "| " + + " | ".join([ + markdown_value(item.get("variant")), + markdown_value(item.get("quality_gate")), + markdown_value(item.get("matched_successful_task_count")), + markdown_value(item.get("paired_token_task_count")), + markdown_value(item.get("token_savings_pct")), + markdown_value(item.get("cost_savings_pct_with_shift")), + ]) + + " |" + ) + else: + lines.append("| n/a | n/a | 0 | 0 | n/a | n/a |") + replay = report.get("replay_evidence") if isinstance(report.get("replay_evidence"), dict) else None + if replay is not None: + lines.extend([ + "", + "## Replay evidence provenance", + "", + f"- Source types: `{markdown_value(', '.join(replay.get('source_types') or []))}`", + f"- Claim scopes: `{markdown_value(', '.join(replay.get('claim_scopes') or []))}`", + f"- Same-run complete: `{markdown_value(replay.get('same_run_complete'))}`", + f"- Mixed/pre-existing CSV: `{markdown_value(replay.get('mixed_csv'))}`", + f"- Boundary: {markdown_value(replay.get('claim_boundary'))}", + ]) + else: + lines.extend([ + "", + "## Provenance note", + "", + "- CSV-only dashboards have unknown public-claim provenance unless regenerated from " + "the original evidence JSONL or a future trusted provenance ledger.", + ]) + lines.extend([ + "", + "## Re-run context", + "", + "- Evidence replay: `context-guard-bench --tasks --variants " + "--evidence-jsonl --csv --report-json " + "--dashboard-md `", + ]) + return "\n".join(lines) + "\n" + + +def write_report_outputs( + csv_path: Path, + report_path: Path | None, + dashboard_path: Path | None, + baseline_variant: str, + *, + replay_rows: list[EvidenceReplayRow] | None = None, + mixed_csv: bool = False, +) -> dict[str, Any]: + # Keep lock order stable across all derived writes: source CSV first, then + # report, then dashboard. Do not introduce a derived-output -> CSV path. + with csv_file_lock(csv_path, create_parent=True): + report = summarize_benchmark_rows(read_csv_rows(csv_path), baseline_variant) + if replay_rows is not None: + report = annotate_replay_report(report, replay_rows, mixed_csv=mixed_csv) + if report_path is not None: + with csv_file_lock(report_path, create_parent=True): + write_text_no_follow( + report_path, + json.dumps(report, ensure_ascii=False, indent=2, sort_keys=True) + "\n", + ) + if dashboard_path is not None: + with csv_file_lock(dashboard_path, create_parent=True): + write_text_no_follow(dashboard_path, render_dashboard_markdown(report)) + return report + + def write_report_json(csv_path: Path, report_path: Path, baseline_variant: str) -> dict[str, Any]: # Keep lock order stable across all report writes: source CSV first, derived # report second. Do not introduce a report -> CSV path; that can deadlock # concurrent report generation. - with csv_file_lock(csv_path, create_parent=True): - report = summarize_benchmark_rows(read_csv_rows(csv_path), baseline_variant) - with csv_file_lock(report_path, create_parent=True): - write_text_no_follow( - report_path, - json.dumps(report, ensure_ascii=False, indent=2, sort_keys=True) + "\n", - ) - return report + return write_report_outputs(csv_path, report_path, None, baseline_variant) def sanitize_note_text(value: Any) -> str: @@ -2351,8 +2972,18 @@ def existing_file_identity(path: Path) -> tuple[int, int] | None: os.close(fd) -def validate_distinct_output_paths(csv_path: Path, ledger_path: Path | None, report_path: Path | None) -> None: - outputs = [("csv", csv_path), ("ledger-jsonl", ledger_path), ("report-json", report_path)] +def validate_distinct_output_paths( + csv_path: Path, + ledger_path: Path | None, + report_path: Path | None, + dashboard_path: Path | None = None, +) -> None: + outputs = [ + ("csv", csv_path), + ("ledger-jsonl", ledger_path), + ("report-json", report_path), + ("dashboard-md", dashboard_path), + ] seen: dict[Path, str] = {} seen_identity: dict[tuple[int, int], str] = {} for label, path in outputs: @@ -2391,12 +3022,16 @@ def main() -> int: help="optional JSONL ledger path for cost-shift accounting per run") parser.add_argument("--report-json", default=None, type=Path, help="optional A/B summary report JSON path generated from --csv after real runs") + parser.add_argument("--dashboard-md", default=None, type=Path, + help="optional Markdown dashboard path generated from the benchmark report") + parser.add_argument("--evidence-jsonl", default=None, type=Path, + help="optional validated run-evidence JSONL replay input; skips provider invocation") parser.add_argument("--baseline-variant", default="baseline", help="variant name used as the report baseline (default: baseline)") args = parser.parse_args() require_no_follow_file_ops_supported() - validate_distinct_output_paths(args.csv, args.ledger_jsonl, args.report_json) + validate_distinct_output_paths(args.csv, args.ledger_jsonl, args.report_json, args.dashboard_md) variants = parse_variants(args.variants) tasks = parse_tasks(args.tasks, variants=variants) @@ -2411,6 +3046,61 @@ def main() -> int: for task, variant in targets if (task.id, variant.name) not in skip_keys ] + if args.evidence_jsonl is not None: + if args.dry_run: + for task, variant in targets: + if (task.id, variant.name) in skip_keys: + print(f"skip {task.id}/{variant.name} (already in {args.csv})") + continue + print(f"evidence replay dry-run: {task.id}/{variant.name} <- {args.evidence_jsonl}") + print("completed 0 run(s); results in (dry-run; no CSV writes)") + return 0 + csv_had_preexisting_content = file_has_content_no_follow(args.csv) + evidence_rows = read_evidence_jsonl(args.evidence_jsonl) + evidence_by_key = validate_evidence_coverage(evidence_rows, runnable_targets) + claude_ver = "evidence-replay" + completed = 0 + replay_rows_written: list[EvidenceReplayRow] = [] + for task, variant in targets: + if (task.id, variant.name) in skip_keys: + print(f"skip {task.id}/{variant.name} (already in {args.csv})") + continue + evidence = evidence_by_key[(task.id, variant.name)] + print(f"replay {task.id}/{variant.name} ...", flush=True) + result = run_evidence_fixture(task, variant, evidence) + wrote = append_csv(args.csv, claude_ver, result, skip_existing=args.resume) + if wrote: + replay_rows_written.append(evidence) + if args.ledger_jsonl is not None: + append_cost_shift_ledger( + args.ledger_jsonl, + claude_ver, + result, + replay_provenance=evidence.provenance_payload(), + ) + completed += 1 + status = "ok" if result.success else "FAIL" + suffix = "" if wrote else " (CSV not updated; row already present)" + print( + f" {status} tokens={sum(result.tokens.values())} cost=${result.cost_usd:.4f} " + f"wall_time={result.wall_time_seconds:.3f}s {sanitize_note_text(result.notes)}{suffix}" + ) + if args.report_json is not None or args.dashboard_md is not None: + report = write_report_outputs( + args.csv, + args.report_json, + args.dashboard_md, + args.baseline_variant, + replay_rows=replay_rows_written, + mixed_csv=csv_had_preexisting_content or bool(skip_keys) or len(replay_rows_written) != int(completed), + ) + if args.report_json is not None: + print(f"report {args.report_json}: {report['claim_status']}") + if args.dashboard_md is not None: + print(f"dashboard {args.dashboard_md}: {report_public_claim_status(report)[0]}") + print(f"completed {completed} run(s); results in {args.csv}") + return 0 + placeholder_targets = [ f"{task.id}/{variant.name}" for task, variant in runnable_targets @@ -2463,9 +3153,12 @@ def main() -> int: f"wall_time={result.wall_time_seconds:.3f}s {sanitize_note_text(result.notes)}{suffix}" ) target = args.csv if not args.dry_run else "(dry-run; no CSV writes)" - if args.report_json is not None and not args.dry_run: - report = write_report_json(args.csv, args.report_json, args.baseline_variant) - print(f"report {args.report_json}: {report['claim_status']}") + if (args.report_json is not None or args.dashboard_md is not None) and not args.dry_run: + report = write_report_outputs(args.csv, args.report_json, args.dashboard_md, args.baseline_variant) + if args.report_json is not None: + print(f"report {args.report_json}: {report['claim_status']}") + if args.dashboard_md is not None: + print(f"dashboard {args.dashboard_md}: {report_public_claim_status(report)[0]}") print(f"completed {completed} run(s); results in {target}") return 0 diff --git a/scripts/prepublish_check.py b/scripts/prepublish_check.py index c42db5b..2557322 100755 --- a/scripts/prepublish_check.py +++ b/scripts/prepublish_check.py @@ -227,6 +227,7 @@ def load_command_manifest(): "docs/benchmark-fixtures/visual-ocr-cropped-ocr.prompt.example.md", "docs/benchmark-fixtures/token-savings-12task.tasks.example.json", "docs/benchmark-fixtures/token-savings-12task.variants.example.json", + "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "docs/benchmark-fixtures/token-savings-12task-baseline.prompt.example.md", "docs/benchmark-fixtures/token-savings-12task-contextguard.prompt.example.md", "package.json", diff --git a/tests/test_context_guard_kit.py b/tests/test_context_guard_kit.py index a40f514..41ea260 100644 --- a/tests/test_context_guard_kit.py +++ b/tests/test_context_guard_kit.py @@ -24224,8 +24224,397 @@ def test_benchmark_runner_rejects_overlapping_output_paths(self): root / "results.csv", root / "cost-shift.jsonl", root / "report.json", + root / "dashboard.md", ) + with self.assertRaises(SystemExit) as dashboard_ctx: + module.validate_distinct_output_paths( + root / "results.csv", + root / "cost-shift.jsonl", + root / "report.json", + root / "bench" / ".." / "results.csv", + ) + self.assertIn("--dashboard-md must not point to the same path as --csv", str(dashboard_ctx.exception)) + + def test_benchmark_runner_replays_evidence_without_provider_and_writes_dashboard(self): + for script in BENCH_SCRIPTS: + with self.subTest(script=script): + with tempfile.TemporaryDirectory() as tmp: + root = Path(tmp) + placeholder = "python3 -c \"raise SystemExit('fixture-only placeholder: replace success_command before real benchmark runs')\"" + tasks_path = root / "tasks.json" + variants_path = root / "variants.json" + evidence_path = root / "evidence.jsonl" + tasks_path.write_text(json.dumps([ + { + "id": "t01", + "prompt": "fixture prompt", + "model": "sonnet", + "effort": "medium", + "max_turns": 1, + "success_command": placeholder, + "success_cwd": ".", + } + ]), encoding="utf-8") + variants_path.write_text(json.dumps([ + {"name": "baseline", "extra_args": []}, + {"name": "optimized", "extra_args": []}, + ]), encoding="utf-8") + + def evidence_row(variant: str, input_tokens: int, output_tokens: int, bytes_after: int) -> dict: + return { + "schema_version": "contextguard.bench.run-evidence.v1", + "task_id": "t01", + "variant": variant, + "model": "sonnet", + "effort": "medium", + "success": True, + "tokens": {"input_tokens": input_tokens, "output_tokens": output_tokens}, + "primary_tokens_measured": True, + "cost_usd": 0.123, + "cost_measured": True, + "external_tokens": 0, + "external_tokens_measured": True, + "external_cost_usd": 0, + "external_cost_measured": True, + "bytes_before": 1000, + "bytes_after": bytes_after, + "corrections": 0, + "notes": f"synthetic {variant}", + "provenance": { + "evidence_source_type": "synthetic_fixture", + "capture_command_or_export_id": "unit-test-fixture", + "claim_scope": "local_replay_fixture_not_public_claim", + }, + } + + evidence_path.write_text( + "\n".join([ + json.dumps(evidence_row("baseline", 100, 20, 1000)), + json.dumps(evidence_row("optimized", 50, 10, 200)), + ]) + "\n", + encoding="utf-8", + ) + dry_csv = root / "dry-results.csv" + dry_proc = subprocess.run( + [ + sys.executable, + str(script), + "--tasks", + str(tasks_path), + "--variants", + str(variants_path), + "--csv", + str(dry_csv), + "--evidence-jsonl", + str(evidence_path), + "--dry-run", + "--claude-bin", + str(root / "missing-claude"), + ], + text=True, + capture_output=True, + check=True, + ) + self.assertIn("evidence replay dry-run", dry_proc.stdout) + self.assertFalse(dry_csv.exists()) + self.assertFalse((root / "dry-results.csv.lock").exists()) + + csv_path = root / "results.csv" + ledger_path = root / "ledger.jsonl" + report_path = root / "report.json" + dashboard_path = root / "dashboard.md" + proc = subprocess.run( + [ + sys.executable, + str(script), + "--tasks", + str(tasks_path), + "--variants", + str(variants_path), + "--csv", + str(csv_path), + "--evidence-jsonl", + str(evidence_path), + "--ledger-jsonl", + str(ledger_path), + "--report-json", + str(report_path), + "--dashboard-md", + str(dashboard_path), + "--claude-bin", + str(root / "missing-claude"), + ], + text=True, + capture_output=True, + check=True, + ) + self.assertIn("replay t01/baseline", proc.stdout) + self.assertIn("dashboard", proc.stdout) + with csv_path.open(encoding="utf-8", newline="") as f: + rows = list(csv.DictReader(f)) + self.assertEqual(len(rows), 2) + self.assertTrue(all(row["claude_version"] == "evidence-replay" for row in rows)) + self.assertTrue(all(row["primary_tokens_measured"] == "false" for row in rows)) + self.assertTrue(all(row["cost_measured"] == "false" for row in rows)) + + ledger_rows = [ + json.loads(line) + for line in ledger_path.read_text(encoding="utf-8").splitlines() + if line.strip() + ] + self.assertEqual(len(ledger_rows), 2) + self.assertEqual(ledger_rows[0]["evidence_source_type"], "synthetic_fixture") + self.assertFalse(ledger_rows[0]["public_claim_eligible"]) + self.assertIn("replay_provenance", ledger_rows[0]) + + report = json.loads(report_path.read_text(encoding="utf-8")) + self.assertEqual(report["claim_status"], "replay_only_not_public_claim") + self.assertEqual(report["raw_metric_claim_status"], "insufficient_paired_data") + self.assertEqual(report["public_claim_status"], "replay_only_not_public_claim") + self.assertFalse(report["public_claim_eligible"]) + self.assertEqual(report["replay_evidence"]["source_types"], ["synthetic_fixture"]) + + dashboard = dashboard_path.read_text(encoding="utf-8") + self.assertIn("Claim boundary", dashboard) + self.assertIn("Quality gate", dashboard) + self.assertIn("context-guard-bench --tasks", dashboard) + self.assertIn("--evidence-jsonl", dashboard) + + resumed_report = root / "resumed-report.json" + subprocess.run( + [ + sys.executable, + str(script), + "--tasks", + str(tasks_path), + "--variants", + str(variants_path), + "--csv", + str(csv_path), + "--evidence-jsonl", + str(evidence_path), + "--report-json", + str(resumed_report), + "--resume", + ], + text=True, + capture_output=True, + check=True, + ) + resumed = json.loads(resumed_report.read_text(encoding="utf-8")) + self.assertEqual(resumed["claim_status"], "unknown_mixed_csv") + self.assertFalse(resumed["public_claim_eligible"]) + + no_evidence_proc = subprocess.run( + [ + sys.executable, + str(script), + "--tasks", + str(tasks_path), + "--variants", + str(variants_path), + "--csv", + str(root / "no-evidence.csv"), + "--claude-bin", + str(root / "missing-claude"), + ], + text=True, + capture_output=True, + ) + self.assertEqual(no_evidence_proc.returncode, 2) + self.assertIn("fixture-only placeholder", no_evidence_proc.stderr) + + def test_benchmark_runner_evidence_replay_validation_fails_closed(self): + for index, script in enumerate(BENCH_SCRIPTS): + with self.subTest(script=script): + module = load_python_script_module(script, f"_bench_runner_evidence_validation_{index}") + with tempfile.TemporaryDirectory() as tmp: + root = Path(tmp) + evidence_path = root / "evidence.jsonl" + + def good_row(**updates): + row = { + "schema_version": "contextguard.bench.run-evidence.v1", + "task_id": "t01", + "variant": "baseline", + "success": True, + "tokens": {"input_tokens": 100, "output_tokens": 20}, + "primary_tokens_measured": False, + "cost_usd": 0.0, + "cost_measured": False, + "provenance": { + "evidence_source_type": "synthetic_fixture", + "claim_scope": "local_replay_fixture_not_public_claim", + }, + } + row.update(updates) + return row + + bad_cases = { + "schema": good_row(schema_version="wrong"), + "missing_provenance": {k: v for k, v in good_row().items() if k != "provenance"}, + "negative_metric": good_row(bytes_after=-1), + } + for name, row in bad_cases.items(): + with self.subTest(case=name): + evidence_path.write_text(json.dumps(row) + "\n", encoding="utf-8") + with self.assertRaises(SystemExit): + module.read_evidence_jsonl(evidence_path) + + evidence_path.write_text( + json.dumps(good_row(cost_usd=float("nan"))) + "\n", + encoding="utf-8", + ) + with self.assertRaises(SystemExit): + module.read_evidence_jsonl(evidence_path) + + manual = good_row( + primary_tokens_measured=True, + cost_measured=True, + cost_usd=1.23, + provenance={ + "evidence_source_type": "manual_audit", + "claim_scope": "manual_check_not_public_claim", + }, + ) + evidence_path.write_text(json.dumps(manual) + "\n", encoding="utf-8") + parsed = module.read_evidence_jsonl(evidence_path)[0] + self.assertFalse(parsed.result.primary_tokens_measured) + self.assertFalse(parsed.result.cost_measured) + self.assertFalse(parsed.public_claim_eligible) + + def provider_row(variant, *, input_tokens, output_tokens, cost_usd, corrections=0, + measured=True): + return { + "schema_version": "contextguard.bench.run-evidence.v1", + "task_id": "t01", + "variant": variant, + "success": True, + "tokens": {"input_tokens": input_tokens, "output_tokens": output_tokens}, + "primary_tokens_measured": measured, + "cost_usd": cost_usd, + "cost_measured": measured, + "external_tokens": 0, + "external_tokens_measured": True, + "external_cost_usd": 0, + "external_cost_measured": True, + "bytes_before": 1000, + "bytes_after": 800 if variant == "optimized" else 1000, + "corrections": corrections, + "provenance": { + "evidence_source_type": "provider_export", + "provider_name": "unit-provider", + "capture_command_or_export_id": "export-123", + "claim_scope": "provider_measured_matched_task_public_claim", + }, + } + + def csv_rows_from_replay(replay_rows): + csv_rows = [] + for replay in replay_rows: + result = replay.result + shifted = ( + result.cost_measured + and result.external_tokens_measured + and (result.external_tokens == 0 or result.external_cost_measured) + ) + csv_rows.append({ + "task_id": result.task_id, + "variant": result.variant, + "success": "true" if result.success else "false", + "total_tokens": str(sum(result.tokens.values())), + "primary_tokens_measured": "true" if result.primary_tokens_measured else "false", + "cost_usd": f"{result.cost_usd:.6f}", + "cost_measured": "true" if result.cost_measured else "false", + "external_tokens": str(result.external_tokens), + "external_tokens_measured": "true" if result.external_tokens_measured else "false", + "external_cost_usd": f"{result.external_cost_usd:.6f}", + "external_cost_measured": "true" if result.external_cost_measured else "false", + "total_cost_with_shift_usd": ( + f"{(result.cost_usd + result.external_cost_usd):.6f}" if shifted else "" + ), + "bytes_before": str(result.bytes_before), + "bytes_after": str(result.bytes_after), + "corrections": str(result.corrections), + }) + return csv_rows + + evidence_path.write_text( + "\n".join([ + json.dumps(provider_row("baseline", input_tokens=100, output_tokens=20, cost_usd=0.12, measured=False)), + json.dumps(provider_row("optimized", input_tokens=50, output_tokens=10, cost_usd=0.06, measured=False)), + ]) + "\n", + encoding="utf-8", + ) + provider_incomplete = module.read_evidence_jsonl(evidence_path) + incomplete_report = module.annotate_replay_report( + module.summarize_benchmark_rows(csv_rows_from_replay(provider_incomplete), "baseline"), + provider_incomplete, + mixed_csv=False, + ) + self.assertEqual(incomplete_report["raw_metric_claim_status"], "insufficient_paired_data") + self.assertEqual(incomplete_report["claim_status"], "provider_export_claim_gates_not_met") + self.assertFalse(incomplete_report["public_claim_eligible"]) + + evidence_path.write_text( + "\n".join([ + json.dumps(provider_row("baseline", input_tokens=100, output_tokens=20, cost_usd=0.12)), + json.dumps(provider_row("optimized", input_tokens=50, output_tokens=10, cost_usd=0.06, corrections=1)), + ]) + "\n", + encoding="utf-8", + ) + provider_quality_regression = module.read_evidence_jsonl(evidence_path) + quality_report = module.annotate_replay_report( + module.summarize_benchmark_rows(csv_rows_from_replay(provider_quality_regression), "baseline"), + provider_quality_regression, + mixed_csv=False, + ) + self.assertEqual(quality_report["raw_metric_claim_status"], "quality_gate_watch") + self.assertEqual(quality_report["claim_status"], "provider_export_claim_gates_not_met") + self.assertFalse(quality_report["public_claim_eligible"]) + + evidence_path.write_text( + "\n".join([ + json.dumps(provider_row("baseline", input_tokens=100, output_tokens=20, cost_usd=0.12)), + json.dumps(provider_row("optimized", input_tokens=50, output_tokens=10, cost_usd=0.06)), + ]) + "\n", + encoding="utf-8", + ) + provider_complete = module.read_evidence_jsonl(evidence_path) + complete_report = module.annotate_replay_report( + module.summarize_benchmark_rows(csv_rows_from_replay(provider_complete), "baseline"), + provider_complete, + mixed_csv=False, + ) + self.assertEqual( + complete_report["raw_metric_claim_status"], + "token_and_shifted_cost_savings_observed", + ) + self.assertEqual(complete_report["claim_status"], "token_and_shifted_cost_savings_observed") + self.assertEqual(complete_report["public_claim_status"], "provider_export_public_claim_candidate") + self.assertTrue(complete_report["public_claim_eligible"]) + + duplicate = "\n".join([json.dumps(good_row()), json.dumps(good_row())]) + "\n" + evidence_path.write_text(duplicate, encoding="utf-8") + rows = module.read_evidence_jsonl(evidence_path) + with self.assertRaises(SystemExit): + module.validate_evidence_coverage( + rows, + [(module.TaskFixture(id="t01", prompt="x"), module.Variant(name="baseline"))], + ) + + evidence_path.write_text(json.dumps(good_row()) + "\n", encoding="utf-8") + rows = module.read_evidence_jsonl(evidence_path) + with self.assertRaises(SystemExit): + module.validate_evidence_coverage( + rows, + [ + (module.TaskFixture(id="t01", prompt="x"), module.Variant(name="baseline")), + (module.TaskFixture(id="t01", prompt="x"), module.Variant(name="optimized")), + ], + ) + def test_benchmark_runner_preflight_fails_unsupported_platform_before_file_io(self): module = load_module_from_path(KIT_DIR / "benchmark_runner.py", "_bench_runner_unsupported_platform") with tempfile.TemporaryDirectory() as tmp: @@ -25212,6 +25601,8 @@ def _combined_experimental_benchmark_fixture_text(self, guide, fixture_dir, fixt for task_path, variant_path in fixture_pairs.values(): combined += "\n" + task_path.read_text(encoding="utf-8").lower() combined += "\n" + variant_path.read_text(encoding="utf-8").lower() + for evidence_path in sorted(fixture_dir.glob("*.example.jsonl")): + combined += "\n" + evidence_path.read_text(encoding="utf-8").lower() for prompt_path in sorted(fixture_dir.glob("*.prompt.example.md")): combined += "\n" + prompt_path.read_text(encoding="utf-8").lower() return combined @@ -25232,6 +25623,7 @@ def test_experimental_benchmark_fixtures_are_packaged_and_linked(self): package_files = set(json.loads((ROOT / "package.json").read_text(encoding="utf-8"))["files"]) self.assertIn("docs/experimental-benchmark-fixtures.md", package_files) self.assertIn("docs/benchmark-fixtures/*.example.json", package_files) + self.assertIn("docs/benchmark-fixtures/*.example.jsonl", package_files) self.assertIn("docs/benchmark-fixtures/*.prompt.example.md", package_files) prepublish = (ROOT / "scripts" / "prepublish_check.py").read_text(encoding="utf-8") @@ -25251,6 +25643,7 @@ def test_experimental_benchmark_fixtures_are_packaged_and_linked(self): "docs/benchmark-fixtures/visual-ocr-cropped-ocr.prompt.example.md", "docs/benchmark-fixtures/token-savings-12task.tasks.example.json", "docs/benchmark-fixtures/token-savings-12task.variants.example.json", + "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "docs/benchmark-fixtures/token-savings-12task-baseline.prompt.example.md", "docs/benchmark-fixtures/token-savings-12task-contextguard.prompt.example.md", 'ROOT / "docs" / "experimental-benchmark-fixtures.md"', @@ -25339,8 +25732,17 @@ def test_experimental_benchmark_fixtures_parse_and_bind_prompt_files(self): def test_token_savings_12task_fixture_parses_and_generates_claim_safe_report(self): fixture_dir, _guide, fixture_pairs = self._experimental_benchmark_fixture_paths() task_path, variant_path = fixture_pairs["token_savings"] + evidence_path = fixture_dir / "token-savings-12task.evidence.example.jsonl" task_raw = json.loads(task_path.read_text(encoding="utf-8")) self.assertEqual(len(task_raw), 12) + evidence_raw = [ + json.loads(line) + for line in evidence_path.read_text(encoding="utf-8").splitlines() + if line.strip() + ] + self.assertEqual(len(evidence_raw), 24) + self.assertTrue(all(row["provenance"]["evidence_source_type"] == "synthetic_fixture" for row in evidence_raw)) + self.assertTrue(all(row["primary_tokens_measured"] is False for row in evidence_raw)) expected_categories = { "bugfix", "exploration", @@ -25415,6 +25817,42 @@ def test_token_savings_12task_fixture_parses_and_generates_claim_safe_report(sel parsed_tasks = module.parse_tasks(task_path, variants=parsed_variants) self.assertEqual(len(parsed_tasks), 12) self.assertTrue(all(module.is_placeholder_success_command(task.success_command) for task in parsed_tasks)) + replay_rows = module.read_evidence_jsonl(evidence_path) + replay_targets = module.filter_targets(parsed_tasks, parsed_variants, None, None) + replay_by_key = module.validate_evidence_coverage(replay_rows, replay_targets) + self.assertEqual(len(replay_by_key), 24) + self.assertTrue(all(row.source_type == "synthetic_fixture" for row in replay_rows)) + self.assertTrue(all(not row.result.primary_tokens_measured for row in replay_rows)) + self.assertTrue(all(not row.result.cost_measured for row in replay_rows)) + replay_report = module.annotate_replay_report( + module.summarize_benchmark_rows( + [ + { + "task_id": row.result.task_id, + "variant": row.result.variant, + "success": "true" if row.result.success else "false", + "total_tokens": str(sum(row.result.tokens.values())), + "primary_tokens_measured": "false", + "cost_usd": f"{row.result.cost_usd:.6f}", + "cost_measured": "false", + "external_tokens": str(row.result.external_tokens), + "external_tokens_measured": "true" if row.result.external_tokens_measured else "false", + "external_cost_usd": f"{row.result.external_cost_usd:.6f}", + "external_cost_measured": "true" if row.result.external_cost_measured else "false", + "total_cost_with_shift_usd": "", + "bytes_before": str(row.result.bytes_before), + "bytes_after": str(row.result.bytes_after), + "corrections": str(row.result.corrections), + } + for row in replay_rows + ], + "baseline_full_context_fixture", + ), + replay_rows, + mixed_csv=False, + ) + self.assertEqual(replay_report["claim_status"], "replay_only_not_public_claim") + self.assertEqual(replay_report["public_claim_status"], "replay_only_not_public_claim") report = module.summarize_benchmark_rows(rows, "baseline_full_context_fixture") self.assertEqual(report["schema"], "context-guard-bench-report-v1") self.assertEqual(report["claim_status"], "token_and_shifted_cost_savings_observed")