diff --git a/CHANGELOG.md b/CHANGELOG.md index 00447f6..aa48422 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,8 @@ All notable changes for the ContextGuard plugin are documented here. ## [Unreleased] +- Extended Batch 1 token-savings advisory reports with cache-score amortization risk fields, tool-prune deferred-schema proxy accounting, and a benchmark measurement-baseline contract while preserving local-only/no-savings-claim boundaries. + ## [0.4.10] - 2026-06-14 - Added `context-guard-artifact search`, a local sanitized artifact sandbox search that returns capped literal matches with exact `get --lines` rehydration commands and no hosted savings claims. diff --git a/README.ko.md b/README.ko.md index 8ab8741..542bc8b 100644 --- a/README.ko.md +++ b/README.ko.md @@ -102,6 +102,7 @@ brief 모드는 코딩 에이전트가 군더더기를 줄이도록 요청하되 - `context-guard-audit`가 보고한 대화 기록 사용량 집중 지점, `cache_friendliness` 프롬프트 배치 신호, `cache_layout_advice` 실험 우선순위 - 상태표시줄의 `cache` / `reuse` 값: ContextGuard가 직접 만든 절감 효과가 아니라 관찰된 대화 기록·provider cache 신호입니다. - `context-guard cost preflight`로 Anthropic 요청 JSON의 추정 비용을 보고, 호출 뒤 `context-guard cost observe`로 provider usage 필드(`cache_creation_input_tokens`, `cache_read_input_tokens`)를 대조합니다. +- `context-guard-cache-score`로 정적 cache layout과, 사용자가 직접 넣은 cache write/read multiplier 기반 amortization 위험을 안내받습니다. char/4 토큰 값은 provider 측정 절감이 아니라 추정 proxy입니다. - `context-guard-bench`로 성공한 기준/변형 실행을 쌍으로 맞춰 비교한 결과 - 큰 tool/MCP catalog와 `context-guard-tool-prune` top-k 리포트 및 요약 기록 재조회 방식의 차이 - [`research/experimental-token-reduction-radar.md`](research/experimental-token-reduction-radar.md)의 선택적 실험 lane과 마찬가지로, [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md)의 fixture-only 시작 예시도 절감 주장을 하려면 같은 matched-task benchmark gate를 먼저 통과해야 합니다. @@ -282,10 +283,14 @@ long-command 2>&1 | ./plugins/context-guard/bin/context-guard-artifact store --c --catalog tools.json \ --query "review failing tests" \ --top 5 --budget-bytes 12000 --json +./plugins/context-guard/bin/context-guard-tool-prune defer-report \ + --catalog tools.json \ + --query "review failing tests" \ + --core-top 3 --deferred-top 20 --json ./plugins/context-guard/bin/context-guard-tool-prune get --tool read_file --json ``` -`context-guard-tool-prune`은 로컬 tool 또는 MCP catalog를 결정적 lexical heuristic(어휘 기반 휴리스틱)으로 순위화해 제한된 top-k 자문 리포트를 만듭니다. inline schema는 관측된 UTF-8 바이트 예산을 지키고, 누락되거나 예산 때문에 생략된 schema는 `.context-guard/tool-prune`의 compact 요약 기록과 별도 가림 처리 payload로 다시 조회할 수 있습니다. 이 기능은 안내용이며 MCP 설정을 변경하지 않습니다. 토큰 값은 provider가 측정한 절감 수치가 아니라 추정 proxy입니다. +`context-guard-tool-prune`은 로컬 tool 또는 MCP catalog를 결정적 lexical heuristic(어휘 기반 휴리스틱)으로 순위화해 제한된 top-k 자문 리포트를 만듭니다. inline schema는 관측된 UTF-8 바이트 예산을 지키고, 누락되거나 예산 때문에 생략된 schema는 `.context-guard/tool-prune`의 compact 요약 기록과 별도 가림 처리 payload로 다시 조회할 수 있습니다. `defer-report`는 core inline tool과 deferred tool stub/namespace 요약을 나누고, 첫 프롬프트에서 빠진 schema의 gross/net char/4 proxy 회계를 함께 보여줍니다. 이 기능은 안내용이며 MCP 설정이나 native provider tool search를 변경하지 않습니다. 토큰 값은 provider가 측정한 절감 수치가 아니라 추정 proxy입니다. ### 총비용, batchability, routing 후보 자문 diff --git a/README.md b/README.md index d78f25a..dd467e1 100644 --- a/README.md +++ b/README.md @@ -104,7 +104,7 @@ When you need a savings claim, measure it on your own tasks: - transcript hotspots reported by `context-guard-audit`, including `cache_friendliness` prompt-layout signals and `cache_layout_advice` experiment priorities - statusline `cache` / `reuse` as observed transcript/provider-cache signals, not savings caused by ContextGuard - `context-guard cost preflight` estimates for Anthropic request JSON, followed by `context-guard cost observe` using provider usage fields (`cache_creation_input_tokens`, `cache_read_input_tokens`) after the call -- static prompt/request cache layout checks from `context-guard-cache-score`; its char/4 token estimates and warnings are advisory only until provider usage fields confirm real cache hits +- static prompt/request cache layout checks from `context-guard-cache-score`, including optional user-supplied cache write/read multiplier amortization risk; its char/4 token estimates and warnings are advisory only until provider usage fields confirm real cache hits - matched successful baseline/variant runs from `context-guard-bench` - large tool/MCP catalogs versus `context-guard-tool-prune` top-k reports plus receipt retrieval - optional experimental lanes in [`research/experimental-token-reduction-radar.md`](research/experimental-token-reduction-radar.md); fixture-only starters in [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md) use the same matched-task benchmark gates before any savings claim @@ -303,7 +303,7 @@ The packer uses deterministic standard-library heuristics only: no network, mode ./plugins/context-guard/bin/context-guard-tool-prune get --tool read_file --json ``` -`context-guard-tool-prune` ranks a local tool or MCP catalog with deterministic lexical heuristics and emits a bounded top-k advisory report. Inline selected schemas respect an observed UTF-8 byte budget, and omitted or budget-skipped schemas remain recoverable from a compact local receipt plus a separate sanitized payload under `.context-guard/tool-prune`. `defer-report` uses the same receipt path to split a catalog into core inline tools plus deferred tool stubs and namespace summaries. This is advisory only: it does not mutate MCP configuration, does not configure native provider tool search, and token counts remain estimated proxies rather than measured provider savings. +`context-guard-tool-prune` ranks a local tool or MCP catalog with deterministic lexical heuristics and emits a bounded top-k advisory report. Inline selected schemas respect an observed UTF-8 byte budget, and omitted or budget-skipped schemas remain recoverable from a compact local receipt plus a separate sanitized payload under `.context-guard/tool-prune`. `defer-report` uses the same receipt path to split a catalog into core inline tools plus deferred tool stubs and namespace summaries, and reports gross deferred-schema plus net initial-report char/4 proxy accounting so you can see what moved out of the first prompt. This is advisory only: it does not mutate MCP configuration, does not configure native provider tool search, and token counts remain estimated proxies rather than measured provider savings. ### Score static prompt cacheability @@ -312,7 +312,7 @@ The packer uses deterministic standard-library heuristics only: no network, mode ./plugins/context-guard/bin/context-guard cache-score --input prompt.txt --provider anthropic --json ``` -`context-guard-cache-score` is a local static lint for prompt/request layout. It estimates total and cacheable-prefix size with a tokenizer-free char/4 proxy, warns about dynamic-looking values near the prefix, and records provider caveats for OpenAI, Anthropic, Gemini, or a generic threshold. It does not call providers, store raw prompts, estimate prices, observe cache hits, or prove token/cost savings; verify real cache behavior with provider usage telemetry. +`context-guard-cache-score` is a local static lint for prompt/request layout. It estimates total and cacheable-prefix size with a tokenizer-free char/4 proxy, warns about dynamic-looking values near the prefix, and records provider caveats for OpenAI, Anthropic, Gemini, or a generic threshold. Optional `--expected-reuses`, `--cache-write-multiplier`, and `--cache-read-multiplier` inputs add an advisory amortization-risk section using user-supplied economics only. It does not call providers, store raw prompts, estimate prices from bundled defaults, observe cache hits, or prove token/cost savings; verify real cache behavior with provider usage telemetry. ### Advise on total cost, batchability, and routing diff --git a/context-guard-kit/README.md b/context-guard-kit/README.md index 11222a6..bdb71c9 100644 --- a/context-guard-kit/README.md +++ b/context-guard-kit/README.md @@ -57,9 +57,9 @@ python3 context-guard-kit/sanitize_output.py -- git diff `context_filter.py`는 opt-in declarative output filter helper입니다. filter JSON은 사용자가 package code 밖(예: `.context-guard/filter-dsl.json`)에 두고 `validate`로 검증한 뒤 `run --config ... -- `로 적용합니다. invalid config, no-match, filter error, empty output, protected `git`/test/lint/`gh` failure는 원래 command stdout/stderr와 exit code를 passthrough합니다. filtered mode는 stdout+stderr를 합친 line에 filter를 적용해 stdout으로 쓰고, passthrough mode는 stdout/stderr stream을 그대로 보존합니다. `--json-report`는 stdout을 command/filter output 전용으로 두기 위해 stderr에만 diagnostic JSON을 쓰지만, protected nonzero passthrough에서는 stderr 원문 보존을 위해 report를 생략합니다. token/cost 절감 수치는 측정 claim이 아니라 local presentation 변화로만 다루세요. -`cache_score.py`는 provider 호출 없이 prompt/request 파일 또는 stdin을 정적으로 검사하는 cacheability lint입니다. OpenAI/Anthropic/Gemini/generic threshold를 기준으로 stable prefix, 첫 dynamic marker, JSON/tool ordering hint, char/4 token proxy, provider caveat, claim boundary를 출력합니다. raw prompt를 저장하지 않으며, 가격/ledger/cache hit 관측은 `cost_guard.py`와 provider usage field의 영역입니다. +`cache_score.py`는 provider 호출 없이 prompt/request 파일 또는 stdin을 정적으로 검사하는 cacheability lint입니다. OpenAI/Anthropic/Gemini/generic threshold를 기준으로 stable prefix, 첫 dynamic marker, JSON/tool ordering hint, char/4 token proxy, provider caveat, claim boundary를 출력합니다. 선택적으로 `--expected-reuses`, `--cache-write-multiplier`, `--cache-read-multiplier`를 받아 사용자가 제공한 경제성 가정으로만 amortization risk를 표시합니다. raw prompt를 저장하지 않으며, 번들 가격 추정/ledger/cache hit 관측은 `cost_guard.py`와 provider usage field의 영역입니다. -`tool_schema_pruner.py`는 provider-neutral tool/MCP catalog helper입니다. `select`는 task query와 lexical overlap으로 top-k tool을 고르고, inline schema는 `--budget-bytes` 안에만 넣으며, compact receipt와 별도 sanitized payload를 `.context-guard/tool-prune`에 기록합니다. `defer-report`는 같은 receipt path를 사용해 core inline tools와 deferred tool stubs/namespace summaries를 분리합니다. `get`은 payload size/SHA-256을 검증한 뒤 전체 정제 schema를 반환합니다. 이 helper는 MCP 설정이나 native provider tool search를 바꾸지 않으며, token 절감은 측정값이 아니라 추정 proxy로만 표현합니다. +`tool_schema_pruner.py`는 provider-neutral tool/MCP catalog helper입니다. `select`는 task query와 lexical overlap으로 top-k tool을 고르고, inline schema는 `--budget-bytes` 안에만 넣으며, compact receipt와 별도 sanitized payload를 `.context-guard/tool-prune`에 기록합니다. `defer-report`는 같은 receipt path를 사용해 core inline tools와 deferred tool stubs/namespace summaries를 분리하고, gross deferred-schema 및 net initial-report `chars_div_4` proxy 회계를 표시합니다. `get`은 payload size/SHA-256을 검증한 뒤 전체 정제 schema를 반환합니다. 이 helper는 MCP 설정이나 native provider tool search를 바꾸지 않으며, token 절감은 측정값이 아니라 추정 proxy로만 표현합니다. `context_compress.py --protected-policy`는 기본 압축 동작을 바꾸지 않고 code fence, diff, identifier, numeric constant, hash, path, stack frame, quoted string, JSON key 같은 보호-zone class/count 정책 메타데이터를 추가합니다. 보호-zone 정책은 semantic/paraphrase rewrite를 금지하고 structural dedupe/window/truncate 및 artifact retrieval만 허용합니다. raw span은 receipt에 저장하지 않으며, lossy structural transform에는 정확 재조회가 필요하다는 hint를 남깁니다. `context_compress.py --mode readable`은 가림 처리된 prose에만 deterministic sentence-window preview를 시도하고, prompt-like/high-risk protected signal이 있으면 보수 모드로 차단합니다. learned compressor, model, embedding, reranker, hosted savings claim은 포함하지 않습니다. diff --git a/context-guard-kit/benchmark_runner.py b/context-guard-kit/benchmark_runner.py index 70afd68..e338b88 100755 --- a/context-guard-kit/benchmark_runner.py +++ b/context-guard-kit/benchmark_runner.py @@ -184,6 +184,7 @@ TOKEN_PROXY_BYTES_PER_TOKEN = 4 BENCH_RUN_EVIDENCE_SCHEMA_VERSION = "contextguard.bench.run-evidence.v1" MATCHED_PAIR_EVIDENCE_SCHEMA_VERSION = "contextguard.bench.matched-pair.v1" +MEASUREMENT_BASELINE_SCHEMA_VERSION = "contextguard.bench.measurement-baseline.v1" SELF_HOSTED_METRICS_SCHEMA_VERSION = "contextguard.bench.self-hosted-metrics.v1" SELF_HOSTED_METRICS_KEY = "self_hosted_metrics" SELF_HOSTED_METRICS_CLAIM_BOUNDARY = "self_hosted_metrics_only_not_hosted_api_token_or_cost_savings" @@ -1546,6 +1547,77 @@ def row_cost_shift_measured(row: dict[str, str]) -> bool: ) +def measurement_baseline_contract() -> dict[str, Any]: + """Describe the benchmark report's current measurement baseline contract. + + This block is descriptive. It does not change the CSV schema and does not + grant token/cost savings claims by itself; those remain gated by matched + successful tasks, measured primary tokens/costs, shifted-cost accounting, + and quality gates. + """ + return { + "schema_version": MEASUREMENT_BASELINE_SCHEMA_VERSION, + "csv_schema_unchanged": True, + "csv_columns": list(CSV_COLUMNS), + "captured_fields": { + "task_identity": ["task_id", "variant"], + "run_configuration": ["model", "effort", "claude_version"], + "primary_token_buckets": [ + "input_tokens", + "output_tokens", + "cache_read", + "cache_creation", + "total_tokens", + "primary_tokens_measured", + ], + "primary_cost": ["cost_usd", "cost_measured"], + "provider_cache_telemetry": ["provider_cached_tokens", "provider_cached_tokens_measured"], + "latency": ["wall_time_seconds"], + "quality_and_result": ["success", "corrections", "notes"], + "tooling_and_proxy_metrics": ["turns", "hook_triggers", "bytes_before", "bytes_after", "artifacts_used"], + "shifted_cost_accounting": [ + "external_tokens", + "external_tokens_measured", + "external_cost_usd", + "external_cost_measured", + "total_cost_with_shift_usd", + ], + }, + "claim_eligible_fields": { + "token_savings": [ + "matched successful baseline and variant tasks", + "primary_tokens_measured=true on both sides", + "quality_gate=pass", + ], + "shifted_cost_savings": [ + "matched successful baseline and variant tasks", + "cost_measured=true on both sides", + "external_cost_measured=true when external_tokens are present", + "quality_gate=pass", + ], + }, + "proxy_only_fields": { + "byte_metrics": ["bytes_before", "bytes_after"], + "token_proxy": "chars_div_4_proxy_only", + "provider_cache": "diagnostic_telemetry_not_contextguard_token_reduction", + }, + "missing_future_run_identity_fields": [ + "repo_revision", + "agent_harness", + "feature_flags", + "provider_name", + "success_command_identity", + ], + "claim_boundary": { + "descriptive_contract_only": True, + "enables_savings_claims_by_itself": False, + "requires_matched_successful_tasks": True, + "requires_shifted_cost_accounting_for_cost_claims": True, + "raw_proxy_estimates_are_not_hosted_api_token_savings": True, + }, + } + + def summarize_benchmark_rows(rows: list[dict[str, str]], baseline_variant: str) -> dict[str, Any]: by_variant: dict[str, dict[str, Any]] = {} successful_rows_by_variant_task: dict[str, dict[str, list[dict[str, str]]]] = {} @@ -2191,6 +2263,7 @@ def matched_pair_evidence_entry( "schema": "context-guard-bench-report-v1", "baseline_variant": baseline_variant, "row_count": len(rows), + "measurement_baseline": measurement_baseline_contract(), "summary_by_variant": by_variant, "comparisons": comparisons, "matched_pair_evidence": matched_pair_evidence, diff --git a/context-guard-kit/cache_score.py b/context-guard-kit/cache_score.py index db642cd..c330c9d 100755 --- a/context-guard-kit/cache_score.py +++ b/context-guard-kit/cache_score.py @@ -23,6 +23,9 @@ SCHEMA_VERSION = "contextguard.cache-score.v1" DEFAULT_MAX_INPUT_BYTES = 1_000_000 TOKEN_PROXY_CHARS_PER_TOKEN = 4 +DEFAULT_EXPECTED_REUSES = 1 +MAX_EXPECTED_REUSES = 1_000_000 +MAX_CACHE_MULTIPLIER = 1_000_000.0 PROVIDER_MINIMUM_CACHEABLE_TOKENS = { # Provider and model minimums move over time. These defaults are advisory # and can be overridden with --minimum-cacheable-tokens. @@ -110,6 +113,30 @@ def bounded_int(value: object, *, default: int, minimum: int, maximum: int, name return number +def bounded_float( + value: object, + *, + minimum: float, + maximum: float, + name: str, +) -> float | None: + if value is None: + return None + if isinstance(value, bool): + fail(f"{name} must be a finite number") + try: + number = float(value) + except (TypeError, ValueError, OverflowError): + fail(f"{name} must be a finite number") + if not math.isfinite(number): + fail(f"{name} must be finite") + if number < minimum: + fail(f"{name} must be >= {minimum:g}") + if number > maximum: + fail(f"{name} must be <= {maximum:g}") + return number + + def normalized_link_target(parent: Path, raw_target: str) -> Path: target = Path(raw_target) if not target.is_absolute(): @@ -252,7 +279,103 @@ def json_shape_warnings(text: str) -> tuple[str, list[dict[str, Any]]]: return "json", warnings -def score_prompt(text: str, *, provider: str, minimum_cacheable_tokens: int) -> dict[str, Any]: +def build_amortization_report( + *, + eligible: bool, + prefix_tokens: int, + expected_reuses: int, + cache_write_multiplier: float | None, + cache_read_multiplier: float | None, +) -> dict[str, Any]: + """Return advisory cache amortization math using user-supplied multipliers. + + ``expected_reuses`` means future cache reads after the initial cache write. + Multipliers are relative to uncached prefix input cost = 1.0. Provider + pricing/cache policies change, so ContextGuard intentionally does not ship + provider-specific multiplier defaults. + """ + supplied = cache_write_multiplier is not None and cache_read_multiplier is not None + break_even_reuses: int | None = None + expected_uncached_relative_cost: float | None = None + expected_cached_relative_cost: float | None = None + expected_relative_savings: float | None = None + status = "multipliers_not_supplied" + risk = "unknown" + + if not eligible: + status = "not_cacheable" + risk = "high" + elif not supplied: + status = "multipliers_not_supplied" + risk = "unknown" + else: + expected_uncached_relative_cost = 1.0 + expected_reuses + expected_cached_relative_cost = cache_write_multiplier + (expected_reuses * cache_read_multiplier) + expected_relative_savings = expected_uncached_relative_cost - expected_cached_relative_cost + if cache_read_multiplier < 1.0: + if cache_write_multiplier <= 1.0: + break_even_reuses = 0 + else: + break_even_reuses = int(math.ceil((cache_write_multiplier - 1.0) / (1.0 - cache_read_multiplier))) + if expected_reuses >= break_even_reuses: + status = "already_break_even_on_write" if break_even_reuses == 0 else "amortizes_with_expected_reuses" + risk = "low" + elif expected_reuses > 0: + status = "not_enough_expected_reuses" + risk = "medium" + else: + status = "not_enough_expected_reuses" + risk = "high" + elif cache_read_multiplier == 1.0 and cache_write_multiplier <= 1.0: + break_even_reuses = 0 + status = "already_break_even_on_write" + risk = "low" + elif cache_read_multiplier > 1.0 and cache_write_multiplier <= 1.0 and expected_reuses == 0: + break_even_reuses = 0 + status = "already_break_even_on_write" + risk = "low" + elif cache_read_multiplier > 1.0 and expected_relative_savings >= 0: + break_even_reuses = 0 if cache_write_multiplier <= 1.0 else None + status = "amortizes_with_expected_reuses" + risk = "medium" + else: + status = "no_read_discount" + risk = "high" + + return { + "expected_reuses": expected_reuses, + "expected_reuses_semantics": "future_cache_reads_after_initial_write", + "cacheable_prefix_tokens": prefix_tokens, + "break_even_reuses": break_even_reuses, + "status": status, + "risk": risk, + "cache_write_multiplier": cache_write_multiplier, + "cache_read_multiplier": cache_read_multiplier, + "expected_uncached_relative_cost": expected_uncached_relative_cost, + "expected_cached_relative_cost": expected_cached_relative_cost, + "expected_relative_savings": expected_relative_savings, + "multiplier_baseline": "uncached_prefix_input_cost_equals_1.0", + "user_supplied_multipliers": supplied, + "formula": "expected_cached=write_multiplier + expected_reuses*read_multiplier; expected_uncached=1 + expected_reuses; break_even=ceil((write_multiplier - 1.0)/(1.0-read_multiplier)) only when read_multiplier<1", + "claim_boundary": { + "advisory_only": True, + "provider_pricing_defaults_included": False, + "provider_measured_cache_hit": False, + "hosted_api_token_or_cost_savings_claim_allowed": False, + "requires_user_supplied_or_provider_documented_multipliers": True, + }, + } + + +def score_prompt( + text: str, + *, + provider: str, + minimum_cacheable_tokens: int, + expected_reuses: int = DEFAULT_EXPECTED_REUSES, + cache_write_multiplier: float | None = None, + cache_read_multiplier: float | None = None, +) -> dict[str, Any]: prompt_kind, shape_warnings = json_shape_warnings(text) dynamic_offset, dynamic_marker = first_dynamic_marker(text) prefix_text = text if dynamic_offset is None else text[:dynamic_offset] @@ -282,13 +405,14 @@ def score_prompt(text: str, *, provider: str, minimum_cacheable_tokens: int) -> "message": "Anthropic caching usually requires cache_control around the reusable prefix.", }) + eligible = prefix_estimated >= minimum_cacheable_tokens return { "tool": TOOL_NAME, "schema_version": SCHEMA_VERSION, "provider": provider, "prompt_kind": prompt_kind, "minimum_cacheable_tokens": minimum_cacheable_tokens, - "eligible": prefix_estimated >= minimum_cacheable_tokens, + "eligible": eligible, "estimated_tokens": estimated, "cacheable_prefix_tokens": prefix_estimated, "token_estimate": { @@ -305,6 +429,13 @@ def score_prompt(text: str, *, provider: str, minimum_cacheable_tokens: int) -> "static_prefix_ratio": round(static_ratio, 6), "warnings": warnings, "provider_caveat": PROVIDER_CAVEATS[provider], + "amortization": build_amortization_report( + eligible=eligible, + prefix_tokens=prefix_estimated, + expected_reuses=expected_reuses, + cache_write_multiplier=cache_write_multiplier, + cache_read_multiplier=cache_read_multiplier, + ), "raw_prompt_stored": False, "claim_boundary": { "advisory_only": True, @@ -320,11 +451,15 @@ def render_text(report: dict[str, Any]) -> str: status = "eligible" if report.get("eligible") else "not eligible" warnings = report.get("warnings") if isinstance(report.get("warnings"), list) else [] warning_codes = ", ".join(str(item.get("code")) for item in warnings if isinstance(item, dict)) or "none" + amortization = report.get("amortization") if isinstance(report.get("amortization"), dict) else {} return ( f"{TOOL_NAME}: {status} for {report['provider']} " f"(static_prefix≈{report['cacheable_prefix_tokens']} char/4 tokens, " f"minimum={report['minimum_cacheable_tokens']})\n" f"warnings: {warning_codes}\n" + f"amortization: {amortization.get('status', 'unknown')} " + f"(risk={amortization.get('risk', 'unknown')}, " + f"break_even_reuses={amortization.get('break_even_reuses')})\n" "claim boundary: advisory static lint only; not a measured provider cache hit or cost saving.\n" ) @@ -344,6 +479,24 @@ def build_parser() -> argparse.ArgumentParser: help="override provider threshold for model/platform-specific cache minimums", ) parser.add_argument("--max-input-bytes", default=DEFAULT_MAX_INPUT_BYTES, help=f"maximum input bytes (default: {DEFAULT_MAX_INPUT_BYTES})") + parser.add_argument( + "--expected-reuses", + default=DEFAULT_EXPECTED_REUSES, + help=( + "future cache reads expected after the initial write; advisory only " + f"(default: {DEFAULT_EXPECTED_REUSES})" + ), + ) + parser.add_argument( + "--cache-write-multiplier", + default=None, + help="optional user-supplied cache write multiplier relative to uncached prefix input cost=1.0", + ) + parser.add_argument( + "--cache-read-multiplier", + default=None, + help="optional user-supplied cache read multiplier relative to uncached prefix input cost=1.0", + ) parser.add_argument("--json", action="store_true", help="emit stable JSON") return parser @@ -362,8 +515,34 @@ def main(argv: list[str] | None = None) -> int: maximum=10_000_000, name="--minimum-cacheable-tokens", ) + expected_reuses = bounded_int( + args.expected_reuses, + default=DEFAULT_EXPECTED_REUSES, + minimum=0, + maximum=MAX_EXPECTED_REUSES, + name="--expected-reuses", + ) + cache_write_multiplier = bounded_float( + args.cache_write_multiplier, + minimum=0.0, + maximum=MAX_CACHE_MULTIPLIER, + name="--cache-write-multiplier", + ) + cache_read_multiplier = bounded_float( + args.cache_read_multiplier, + minimum=0.0, + maximum=MAX_CACHE_MULTIPLIER, + name="--cache-read-multiplier", + ) text = read_limited_path(Path(args.input), max_input_bytes) if args.input else read_limited_stdin(max_input_bytes) - report = score_prompt(text, provider=provider, minimum_cacheable_tokens=minimum) + report = score_prompt( + text, + provider=provider, + minimum_cacheable_tokens=minimum, + expected_reuses=expected_reuses, + cache_write_multiplier=cache_write_multiplier, + cache_read_multiplier=cache_read_multiplier, + ) if args.json: sys.stdout.write(json_bytes(report, indent=2) + "\n") else: diff --git a/context-guard-kit/tool_schema_pruner.py b/context-guard-kit/tool_schema_pruner.py index c070c42..d2ae4a1 100755 --- a/context-guard-kit/tool_schema_pruner.py +++ b/context-guard-kit/tool_schema_pruner.py @@ -844,7 +844,14 @@ def defer_report(args: argparse.Namespace) -> str: namespace_top=namespace_top, ) all_schema_bytes = sum(byte_len_json(cand.schema) for cand in ranked) + listed_deferred_schema_bytes = sum(byte_len_json(cand.schema) for cand in deferred_candidates) + total_deferred_schema_bytes = sum(byte_len_json(cand.schema) for cand in ranked[core_top:]) tool_stub_report_bytes = byte_len_json(core_tools) + byte_len_json(deferred_tools) + all_schema_tokens = proxy_tokens(all_schema_bytes) + inline_core_schema_tokens = proxy_tokens(core_schema_bytes) + listed_deferred_schema_tokens = proxy_tokens(listed_deferred_schema_bytes) + total_deferred_schema_tokens = proxy_tokens(total_deferred_schema_bytes) + tool_stub_report_tokens = proxy_tokens(tool_stub_report_bytes) result = { "tool": TOOL_NAME, "schema_version": DEFER_SCHEMA_VERSION, @@ -862,6 +869,7 @@ def defer_report(args: argparse.Namespace) -> str: "deferred_tools_truncated_count": max(0, len(ranked) - core_top - len(deferred_tools)), "deferred_namespaces": deferred_namespaces, "deferred_namespaces_truncated_count": deferred_namespaces_truncated_count, + "deferred_schema_retrieval_required_before_use": True, "receipt": { **receipt, "bytes": receipt_size, @@ -871,9 +879,21 @@ def defer_report(args: argparse.Namespace) -> str: "method": "char4_proxy", "chars_per_token": TOKEN_PROXY_CHARS_PER_TOKEN, "all_schema_bytes": all_schema_bytes, + "inline_core_schema_bytes": core_schema_bytes, + "listed_deferred_schema_bytes": listed_deferred_schema_bytes, + "total_deferred_schema_bytes": total_deferred_schema_bytes, "tool_stub_report_bytes": tool_stub_report_bytes, - "all_schema_tokens_estimated": proxy_tokens(all_schema_bytes), - "tool_stub_report_tokens_estimated": proxy_tokens(tool_stub_report_bytes), + "all_schema_tokens_estimated": all_schema_tokens, + "inline_core_schema_tokens_estimated": inline_core_schema_tokens, + "listed_deferred_schema_tokens_estimated": listed_deferred_schema_tokens, + "total_deferred_schema_tokens_estimated": total_deferred_schema_tokens, + "tool_stub_report_tokens_estimated": tool_stub_report_tokens, + "gross_listed_deferred_schema_tokens_avoided": listed_deferred_schema_tokens, + "gross_total_deferred_schema_tokens_avoided": total_deferred_schema_tokens, + "net_initial_report_tokens_delta": tool_stub_report_tokens - all_schema_tokens, + "net_initial_report_tokens_delta_semantics": "tool_stub_report_tokens_estimated_minus_all_schema_tokens_estimated", + "estimated_initial_schema_tokens_avoided": max(0, all_schema_tokens - tool_stub_report_tokens), + "estimated_initial_schema_tokens_avoided_semantics": "max(0, all_schema_tokens_estimated - tool_stub_report_tokens_estimated)", "claim_boundary": "proxy_only_not_provider_billed_tokens", }, "provider_patterns": [ @@ -899,11 +919,13 @@ def defer_report(args: argparse.Namespace) -> str: "provider_tool_search_configured": False, "hosted_api_token_or_cost_savings_claim_allowed": False, "requires_provider_measured_matched_tasks_for_savings_claims": True, + "deferred_schema_retrieval_required_before_use": True, }, "redaction": {"redacted_values": total_redactions}, "caveats": [ "Deferred loading is an application strategy report, not a native provider integration.", "Token proxy values are char/4 estimates over sanitized local JSON, not billed provider tokens.", + "Deferred schema token fields are initial-prompt proxy accounting; full schemas must be retrieved before deferred tool use.", "Use receipt get commands to retrieve full sanitized schemas before using deferred tools.", ], } diff --git a/plugins/context-guard/README.ko.md b/plugins/context-guard/README.ko.md index a86ec77..9340a80 100644 --- a/plugins/context-guard/README.ko.md +++ b/plugins/context-guard/README.ko.md @@ -79,7 +79,9 @@ context-guard-sanitize-output -- git diff context-guard-pack auto --root . --query "failing tests review" --diff HEAD --manifest-out suggested-pack.json --pack-out context-pack.md --budget-bytes 12000 --json --explain context-guard-pack build --root . --manifest suggested-pack.json --budget-bytes 12000 --json context-guard-pack slice --root . --path README.md --lines 1:40 --json +context-guard-cache-score --input prompt.json --provider openai --json context-guard-tool-prune select --catalog tools.json --query "review failing tests" --top 5 --budget-bytes 12000 --json +context-guard-tool-prune defer-report --catalog tools.json --query "review failing tests" --core-top 3 --deferred-top 20 --json context-guard-tool-prune get --tool read_file --json context-guard-statusline context-guard-statusline-merged @@ -92,15 +94,15 @@ context-guard-statusline-merged - **대용량 읽기 가드와 심볼 리더**는 파일 전체 읽기 전에 검색, 심볼 구간, 작은 줄 범위 읽기 순서로 에이전트를 안내합니다. Python, JavaScript/TypeScript, Go, Rust 소스 구간 읽기를 지원합니다. - **로컬 로그 보관소**는 큰 명령 출력을 기본적으로 `.context-guard/artifacts`에 가림 처리해 저장하고, 줄 번호가 있는 top error, 중복 라인 그룹, 가림 처리된 bounded suggested query가 담긴 요약 기록이나 요청한 정확한 줄 범위만 반환합니다. `get`과 `list`는 리브랜딩 이전의 `.claude-token-optimizer/artifacts` 요약 기록도 읽을 수 있습니다. - **예산 기반 컨텍스트 패커**는 우선순위가 있는 로컬 파일 근거를 렌더링된 바이트 예산 안의 Markdown pack으로 조립하고, 포함·부분 포함·누락 source 메타데이터, bounded `.context-guard/packs` 요약 기록, 안전할 때만 정확한 가림 처리 `slice` 명령, 안전하지 않을 때의 `retrieval_omitted_reason`을 남깁니다. 추가된 `auto` 하위 명령은 추천과 pack build를 한 번에 실행하고, `auto --explain`은 manifest, pack 본문, receipt, byte budget을 바꾸지 않으면서 결정적 로컬 선택/build 이유를 짧게 추가합니다. JSON explain의 bounded repo-map은 sampled byte/token-proxy tree, category-only secret risk count, signature-first hint, explain-only graph rank, 기존 `slice`/symbol 재조회 힌트를 제공하지만 pack 선택이나 provider savings claim은 아닙니다. `suggest`는 로컬 query, diff, 명시 파일, 가림 처리된 output/test-output 신호를 `build`와 호환되는 manifest로 순위화하며 네트워크·모델 호출·임베딩·provider 비용 추정은 하지 않습니다. 토큰 수는 측정된 provider token 절감이 아니라 추정 `chars_div_4` proxy입니다. -- **Tool/MCP schema pruner**는 로컬 tool catalog를 bounded top-k 자문 리포트로 순위화하고, compact 요약 기록과 payload integrity check로 전체 가림 처리된 schema 재조회를 보존합니다. +- **Tool/MCP schema pruner**는 로컬 tool catalog를 bounded top-k 자문 리포트로 순위화하고, compact 요약 기록과 payload integrity check로 전체 가림 처리된 schema 재조회를 보존합니다. `defer-report`는 core inline tool과 deferred stub/namespace 요약을 나누고 gross deferred-schema 및 net initial-report `chars_div_4` proxy 회계를 보여주지만, deferred tool을 쓰기 전에는 전체 schema를 다시 조회해야 합니다. - **보수적 압축기**는 가림 처리된 stdin을 JSON, diff, 로그, 검색 출력, 코드, 산문으로 분류하고, 관측 바이트 근거와 추정 토큰 proxy를 함께 노출합니다. -- **Anthropic 비용 가드와 route advisor**는 `context-guard cost preflight/observe/ledger/compile`로 호출 전 비용 추정, provider usage 대조, keyed-HMAC cache 위험 기록, 안정적인 prefix 배치 안내를 제공합니다. `context-guard route-advisor`는 caller가 제공한 workload JSON, provider feature 선언, usage telemetry, 외부·로컬 shifted cost를 읽는 local-only passive advisor이며 queue를 시작하거나 provider를 호출하거나 pricing 문서를 새로 가져오거나 provider feature 지식을 authoritative하게 취급하지 않고 total-cost accounting, batchability blocker, route 후보를 출력합니다. 원문 프롬프트를 저장하지 않고 Anthropic/provider prompt cache를 대체하지 않으며, 추천은 matched successful task, 비열등 quality evidence, shifted-cost accounting 없이는 hosted token/cost 절감 주장이 아닙니다. +- **정적 cache-score lint와 Anthropic 비용 가드/route advisor**는 `context-guard-cache-score`로 로컬 prompt/request cache layout과 사용자 제공 cache write/read multiplier 기반 amortization 위험을 안내하고, `context-guard cost preflight/observe/ledger/compile`로 호출 전 비용 추정, provider usage 대조, keyed-HMAC cache 위험 기록, 안정적인 prefix 배치 안내를 제공합니다. `context-guard route-advisor`는 caller가 제공한 workload JSON, provider feature 선언, usage telemetry, 외부·로컬 shifted cost를 읽는 local-only passive advisor이며 queue를 시작하거나 provider를 호출하거나 pricing 문서를 새로 가져오거나 provider feature 지식을 authoritative하게 취급하지 않고 total-cost accounting, batchability blocker, route 후보를 출력합니다. 원문 프롬프트를 저장하지 않고 Anthropic/provider prompt cache를 대체하지 않으며, 추천은 matched successful task, 비열등 quality evidence, shifted-cost accounting 없이는 hosted token/cost 절감 주장이 아닙니다. - **출력 축약기**는 감싼 명령의 종료 코드를 보존하면서 긴 로그를 줄이고, `--digest markdown` 또는 `--digest json`으로 실행기 실패 정보, 가림 처리된 failure signature, 중복 라인 그룹, 다음 조회 제안이 담긴 요약을 만들 수 있습니다. - **민감정보 가림 도구**는 검색, diff, 로그 출력에서 자격 증명 패턴, 비공개 키 블록, 인증 헤더, 자격 증명이 포함된 URL, 민감해 보이는 경로를 가립니다. - **상태표시줄**은 모델, 컨텍스트, 비용 신호를 짧게 보여주고, 대화 기록 데이터가 있으면 캐시 읽기와 캐시 재사용 신호도 함께 표시합니다. - **대화 기록 감사**는 usage/cost/cache bucket을 집계하고, 토큰 집중 지점, `cache_friendliness` 프롬프트 배치 신호, `cache_layout_advice` 확인/실험 우선순위를 제한된 가림 처리된 segment hash로 보고합니다. 원문 프롬프트는 출력하지 않습니다. - **반복 실패 알림**은 Bash 실패가 반복될 때 같은 경로를 계속 재시도하지 않고 전략을 바꾸도록 안내합니다. -- **벤치마크 헬퍼**는 기준/변형 실행을 대응해 실제 토큰·비용 필드, 별도의 바이트 감소 간접 증거, 진단용 `wall_time_seconds`, `provider_cached_tokens`, provider-cache 사용 가능성 텔레메트리, 파일 기반 `variant_prompt_files`, 선택적 run별 `self_hosted_metrics` JSONL ledger sidecar를 기록합니다. 이 sidecar는 hosted API 절감 주장에 합치지 않습니다. +- **벤치마크 헬퍼**는 기준/변형 실행을 대응해 실제 토큰·비용 필드, 별도의 바이트 감소 간접 증거, 진단용 `wall_time_seconds`, `provider_cached_tokens`, provider-cache 사용 가능성 텔레메트리, report-level measurement-baseline contract, 파일 기반 `variant_prompt_files`, 선택적 run별 `self_hosted_metrics` JSONL ledger sidecar를 기록합니다. 이 sidecar는 hosted API 절감 주장에 합치지 않습니다. 비용 가드의 로컬 HMAC 키는 기본적으로 `.context-guard/cost-ledger/hmac.key`에 자동 생성됩니다. 관리자가 직접 주입하는 경우 파일에는 필수 padding을 포함한 canonical URL-safe base64 32바이트 키만 정확히 들어 있어야 하며, trailing newline이나 공백은 허용하지 않습니다. 리포트는 키와 원문 프롬프트를 출력하지 않고, 로컬 ledger는 Anthropic/provider prompt cache를 대체하지 않습니다. diff --git a/plugins/context-guard/README.md b/plugins/context-guard/README.md index b628282..d3c10c1 100644 --- a/plugins/context-guard/README.md +++ b/plugins/context-guard/README.md @@ -103,15 +103,15 @@ context-guard-statusline-merged - **Declarative output filter** validates user-owned JSON filter files outside package code and applies the first matching line filter only as an explicit `run --config ... -- ` wrapper. Invalid configs, no-match commands, filter errors, empty filtered output, and protected `git`/test/lint/`gh` command failures preserve original stdout/stderr and exit code. Filtered mode applies line rules to combined stdout+stderr and writes the filtered result to stdout; `--json-report` diagnostics go to stderr, except protected nonzero passthrough suppresses reports to keep stderr raw. It is local and opt-in, with no savings guarantee. - **Artifact store** saves large sanitized command output under `.context-guard/artifacts` by default and returns compact receipts, local sandbox search results, or exact requested slices. JSON receipts include line-numbered top errors, duplicate-line groups, and sanitized bounded suggested queries. `search` scans sanitized local artifacts by literal substring, emits capped match/context records, and includes `get --lines START:END` rehydration commands without hosted token/cost savings claims. Custom `--dir` raw paths stay redacted by default; reuse the same `--dir` or opt into `search --show-paths` for a directly executable local command. In suggested `--lines START:END` queries, `--max-lines` is only the returned-line cap for that selected range, not a wider selector. `get`, `list`, and `search` can also read legacy `.claude-token-optimizer/artifacts` receipts. - **Budgeted context packer** assembles prioritized local file evidence into a rendered byte-budgeted Markdown pack with included/partial/omitted source metadata, bounded `.context-guard/packs` receipts, exact sanitized `slice` commands when safe, and `retrieval_omitted_reason` when a path/root should not be echoed. The additive `auto` subcommand runs that recommendation and pack build in one step, and `auto --explain` adds compact deterministic local selection/build reasons without changing the manifest, pack body, receipt, or byte budget. JSON explain also includes bounded repo-map metadata: sampled byte/token-proxy tree entries, category-only secret-risk counts, signature-first hints, explain-only graph ranks, and exact `slice`/symbol retrieval hints. `suggest` remains available to rank local query, diff, explicit file, and sanitized output/test-output signals into a build-compatible manifest without network, model, embedding, or provider-cost calls. `suggest/auto --adaptive-k` adds advisory-only shrink/expand top-k metadata from local score distribution, byte-budget fit, and score-mass recall/precision proxies; it never applies the recommendation automatically or changes the manifest, pack body, receipt, or byte budget. `auto --symbol-memory` adds repo-map-derived symbol/graph advisory metadata with exact `slice`/`read-symbol` verification hints and still does not change selection or pack output. Token counts are estimated `chars_div_4` proxies, not measured provider-token savings. -- **Tool/MCP schema pruner** ranks local tool catalogs into bounded top-k advisory reports while preserving full sanitized schema fallback through compact receipts and payload integrity checks. +- **Tool/MCP schema pruner** ranks local tool catalogs into bounded top-k advisory reports while preserving full sanitized schema fallback through compact receipts and payload integrity checks. `defer-report` additionally separates core inline tools from deferred stubs/namespaces and reports gross deferred-schema plus net initial-report char/4 proxy accounting; full schemas still must be retrieved before deferred tool use. - **Conservative compressor** classifies sanitized stdin as JSON, diff, log, search output, code, or prose and shrinks it with observed byte evidence plus estimated token proxies. Add `--protected-policy` for opt-in protected-zone class/count metadata that denies semantic rewrites for code fences, diffs, identifiers, numeric constants, hashes, paths, stack frames, quoted strings, and JSON keys while preserving exact-retrieval guidance. Add `--mode readable` only for sanitized prose previews: it uses deterministic sentence windows, blocks prompt-like/high-risk protected signals, stores no raw protected spans, and does not run learned compressors, models, embeddings, or rerankers. -- **Anthropic cost guard and route advisor** provides `context-guard cost preflight/observe/ledger/compile` for passive pre-call estimates, provider-usage reconciliation, keyed-HMAC cache-risk history, and stable-prefix layout advice. `context-guard route-advisor` is a local-only passive advisor for caller-supplied workload JSON, provider feature declarations, usage telemetry, and shifted external/local costs; it emits total-cost accounting, batchability blockers, and route candidates without starting a queue, calling providers, refreshing pricing docs, or treating provider feature knowledge as authoritative. It stores no raw prompt text, does not replace Anthropic/provider prompt caching, and its recommendations are not hosted token/cost savings claims without matched successful tasks, non-inferior quality evidence, and shifted-cost accounting. +- **Static cache-score lint plus Anthropic cost guard and route advisor** provides `context-guard-cache-score` for local prompt/request cache layout checks, with optional user-supplied cache write/read multiplier amortization risk, and `context-guard cost preflight/observe/ledger/compile` for passive pre-call estimates, provider-usage reconciliation, keyed-HMAC cache-risk history, and stable-prefix layout advice. `context-guard route-advisor` is a local-only passive advisor for caller-supplied workload JSON, provider feature declarations, usage telemetry, and shifted external/local costs; it emits total-cost accounting, batchability blockers, and route candidates without starting a queue, calling providers, refreshing pricing docs, or treating provider feature knowledge as authoritative. It stores no raw prompt text, does not replace Anthropic/provider prompt caching, and its recommendations are not hosted token/cost savings claims without matched successful tasks, non-inferior quality evidence, and shifted-cost accounting. - **Output trimmer** preserves the wrapped command exit code, trims long logs, and can emit `--digest markdown` or `--digest json` summaries with runner failure facts, sanitized failure signatures, duplicate-line groups, and suggested next queries. Add `--artifact-receipt` with digest mode to store the exact sanitized full output as a local artifact receipt and re-expand omitted slices with the emitted `context-guard-artifact get ...` command. - **Sanitizer** redacts common credential patterns, private key blocks, auth headers, credential URLs, and sensitive-looking paths from search, diff, and log output. - **Statusline** displays compact model/context/cost signals and, when transcript data is available, cache-read and cache-reuse signals. - **Transcript audit** aggregates usage/cost/cache buckets, flags likely token hotspots, and exposes `cache_friendliness`, additive [`cache_diagnostics`](https://github.com/ictechgy/context-guard/blob/main/docs/cache-diagnostics-schema.md), and `cache_layout_advice` experiment priorities from bounded usage fields, timestamped cache telemetry records, and redacted segment hashes without printing raw prompt text or claiming provider-cache savings. - **Repeated-failure nudge** warns after repeated Bash failures so the agent switches strategy instead of retrying the same context-heavy path. -- **Benchmark helper** records matched baseline/variant runs with real token and cost fields, separate byte-reduction proxy evidence, diagnostic `wall_time_seconds`, `provider_cached_tokens`, provider-cache availability telemetry, file-backed `variant_prompt_files`, and optional per-run `self_hosted_metrics` JSONL ledger sidecars that stay out of hosted API savings claims. +- **Benchmark helper** records matched baseline/variant runs with real token and cost fields, separate byte-reduction proxy evidence, diagnostic `wall_time_seconds`, `provider_cached_tokens`, provider-cache availability telemetry, a report-level measurement-baseline contract, file-backed `variant_prompt_files`, and optional per-run `self_hosted_metrics` JSONL ledger sidecars that stay out of hosted API savings claims. Cost guard creates its local HMAC key automatically at `.context-guard/cost-ledger/hmac.key`. If you provision that file yourself, it must contain exactly one canonical URL-safe base64 32-byte key with required padding and no trailing newline or whitespace. Reports never emit the key or raw prompt text, and the local ledger does not replace Anthropic/provider prompt caching. diff --git a/plugins/context-guard/bin/context-guard-bench b/plugins/context-guard/bin/context-guard-bench index 70afd68..e338b88 100755 --- a/plugins/context-guard/bin/context-guard-bench +++ b/plugins/context-guard/bin/context-guard-bench @@ -184,6 +184,7 @@ MAX_USAGE_COST_USD = 10**9 TOKEN_PROXY_BYTES_PER_TOKEN = 4 BENCH_RUN_EVIDENCE_SCHEMA_VERSION = "contextguard.bench.run-evidence.v1" MATCHED_PAIR_EVIDENCE_SCHEMA_VERSION = "contextguard.bench.matched-pair.v1" +MEASUREMENT_BASELINE_SCHEMA_VERSION = "contextguard.bench.measurement-baseline.v1" SELF_HOSTED_METRICS_SCHEMA_VERSION = "contextguard.bench.self-hosted-metrics.v1" SELF_HOSTED_METRICS_KEY = "self_hosted_metrics" SELF_HOSTED_METRICS_CLAIM_BOUNDARY = "self_hosted_metrics_only_not_hosted_api_token_or_cost_savings" @@ -1546,6 +1547,77 @@ def row_cost_shift_measured(row: dict[str, str]) -> bool: ) +def measurement_baseline_contract() -> dict[str, Any]: + """Describe the benchmark report's current measurement baseline contract. + + This block is descriptive. It does not change the CSV schema and does not + grant token/cost savings claims by itself; those remain gated by matched + successful tasks, measured primary tokens/costs, shifted-cost accounting, + and quality gates. + """ + return { + "schema_version": MEASUREMENT_BASELINE_SCHEMA_VERSION, + "csv_schema_unchanged": True, + "csv_columns": list(CSV_COLUMNS), + "captured_fields": { + "task_identity": ["task_id", "variant"], + "run_configuration": ["model", "effort", "claude_version"], + "primary_token_buckets": [ + "input_tokens", + "output_tokens", + "cache_read", + "cache_creation", + "total_tokens", + "primary_tokens_measured", + ], + "primary_cost": ["cost_usd", "cost_measured"], + "provider_cache_telemetry": ["provider_cached_tokens", "provider_cached_tokens_measured"], + "latency": ["wall_time_seconds"], + "quality_and_result": ["success", "corrections", "notes"], + "tooling_and_proxy_metrics": ["turns", "hook_triggers", "bytes_before", "bytes_after", "artifacts_used"], + "shifted_cost_accounting": [ + "external_tokens", + "external_tokens_measured", + "external_cost_usd", + "external_cost_measured", + "total_cost_with_shift_usd", + ], + }, + "claim_eligible_fields": { + "token_savings": [ + "matched successful baseline and variant tasks", + "primary_tokens_measured=true on both sides", + "quality_gate=pass", + ], + "shifted_cost_savings": [ + "matched successful baseline and variant tasks", + "cost_measured=true on both sides", + "external_cost_measured=true when external_tokens are present", + "quality_gate=pass", + ], + }, + "proxy_only_fields": { + "byte_metrics": ["bytes_before", "bytes_after"], + "token_proxy": "chars_div_4_proxy_only", + "provider_cache": "diagnostic_telemetry_not_contextguard_token_reduction", + }, + "missing_future_run_identity_fields": [ + "repo_revision", + "agent_harness", + "feature_flags", + "provider_name", + "success_command_identity", + ], + "claim_boundary": { + "descriptive_contract_only": True, + "enables_savings_claims_by_itself": False, + "requires_matched_successful_tasks": True, + "requires_shifted_cost_accounting_for_cost_claims": True, + "raw_proxy_estimates_are_not_hosted_api_token_savings": True, + }, + } + + def summarize_benchmark_rows(rows: list[dict[str, str]], baseline_variant: str) -> dict[str, Any]: by_variant: dict[str, dict[str, Any]] = {} successful_rows_by_variant_task: dict[str, dict[str, list[dict[str, str]]]] = {} @@ -2191,6 +2263,7 @@ def summarize_benchmark_rows(rows: list[dict[str, str]], baseline_variant: str) "schema": "context-guard-bench-report-v1", "baseline_variant": baseline_variant, "row_count": len(rows), + "measurement_baseline": measurement_baseline_contract(), "summary_by_variant": by_variant, "comparisons": comparisons, "matched_pair_evidence": matched_pair_evidence, diff --git a/plugins/context-guard/bin/context-guard-cache-score b/plugins/context-guard/bin/context-guard-cache-score index db642cd..c330c9d 100755 --- a/plugins/context-guard/bin/context-guard-cache-score +++ b/plugins/context-guard/bin/context-guard-cache-score @@ -23,6 +23,9 @@ TOOL_NAME = "context-guard-cache-score" SCHEMA_VERSION = "contextguard.cache-score.v1" DEFAULT_MAX_INPUT_BYTES = 1_000_000 TOKEN_PROXY_CHARS_PER_TOKEN = 4 +DEFAULT_EXPECTED_REUSES = 1 +MAX_EXPECTED_REUSES = 1_000_000 +MAX_CACHE_MULTIPLIER = 1_000_000.0 PROVIDER_MINIMUM_CACHEABLE_TOKENS = { # Provider and model minimums move over time. These defaults are advisory # and can be overridden with --minimum-cacheable-tokens. @@ -110,6 +113,30 @@ def bounded_int(value: object, *, default: int, minimum: int, maximum: int, name return number +def bounded_float( + value: object, + *, + minimum: float, + maximum: float, + name: str, +) -> float | None: + if value is None: + return None + if isinstance(value, bool): + fail(f"{name} must be a finite number") + try: + number = float(value) + except (TypeError, ValueError, OverflowError): + fail(f"{name} must be a finite number") + if not math.isfinite(number): + fail(f"{name} must be finite") + if number < minimum: + fail(f"{name} must be >= {minimum:g}") + if number > maximum: + fail(f"{name} must be <= {maximum:g}") + return number + + def normalized_link_target(parent: Path, raw_target: str) -> Path: target = Path(raw_target) if not target.is_absolute(): @@ -252,7 +279,103 @@ def json_shape_warnings(text: str) -> tuple[str, list[dict[str, Any]]]: return "json", warnings -def score_prompt(text: str, *, provider: str, minimum_cacheable_tokens: int) -> dict[str, Any]: +def build_amortization_report( + *, + eligible: bool, + prefix_tokens: int, + expected_reuses: int, + cache_write_multiplier: float | None, + cache_read_multiplier: float | None, +) -> dict[str, Any]: + """Return advisory cache amortization math using user-supplied multipliers. + + ``expected_reuses`` means future cache reads after the initial cache write. + Multipliers are relative to uncached prefix input cost = 1.0. Provider + pricing/cache policies change, so ContextGuard intentionally does not ship + provider-specific multiplier defaults. + """ + supplied = cache_write_multiplier is not None and cache_read_multiplier is not None + break_even_reuses: int | None = None + expected_uncached_relative_cost: float | None = None + expected_cached_relative_cost: float | None = None + expected_relative_savings: float | None = None + status = "multipliers_not_supplied" + risk = "unknown" + + if not eligible: + status = "not_cacheable" + risk = "high" + elif not supplied: + status = "multipliers_not_supplied" + risk = "unknown" + else: + expected_uncached_relative_cost = 1.0 + expected_reuses + expected_cached_relative_cost = cache_write_multiplier + (expected_reuses * cache_read_multiplier) + expected_relative_savings = expected_uncached_relative_cost - expected_cached_relative_cost + if cache_read_multiplier < 1.0: + if cache_write_multiplier <= 1.0: + break_even_reuses = 0 + else: + break_even_reuses = int(math.ceil((cache_write_multiplier - 1.0) / (1.0 - cache_read_multiplier))) + if expected_reuses >= break_even_reuses: + status = "already_break_even_on_write" if break_even_reuses == 0 else "amortizes_with_expected_reuses" + risk = "low" + elif expected_reuses > 0: + status = "not_enough_expected_reuses" + risk = "medium" + else: + status = "not_enough_expected_reuses" + risk = "high" + elif cache_read_multiplier == 1.0 and cache_write_multiplier <= 1.0: + break_even_reuses = 0 + status = "already_break_even_on_write" + risk = "low" + elif cache_read_multiplier > 1.0 and cache_write_multiplier <= 1.0 and expected_reuses == 0: + break_even_reuses = 0 + status = "already_break_even_on_write" + risk = "low" + elif cache_read_multiplier > 1.0 and expected_relative_savings >= 0: + break_even_reuses = 0 if cache_write_multiplier <= 1.0 else None + status = "amortizes_with_expected_reuses" + risk = "medium" + else: + status = "no_read_discount" + risk = "high" + + return { + "expected_reuses": expected_reuses, + "expected_reuses_semantics": "future_cache_reads_after_initial_write", + "cacheable_prefix_tokens": prefix_tokens, + "break_even_reuses": break_even_reuses, + "status": status, + "risk": risk, + "cache_write_multiplier": cache_write_multiplier, + "cache_read_multiplier": cache_read_multiplier, + "expected_uncached_relative_cost": expected_uncached_relative_cost, + "expected_cached_relative_cost": expected_cached_relative_cost, + "expected_relative_savings": expected_relative_savings, + "multiplier_baseline": "uncached_prefix_input_cost_equals_1.0", + "user_supplied_multipliers": supplied, + "formula": "expected_cached=write_multiplier + expected_reuses*read_multiplier; expected_uncached=1 + expected_reuses; break_even=ceil((write_multiplier - 1.0)/(1.0-read_multiplier)) only when read_multiplier<1", + "claim_boundary": { + "advisory_only": True, + "provider_pricing_defaults_included": False, + "provider_measured_cache_hit": False, + "hosted_api_token_or_cost_savings_claim_allowed": False, + "requires_user_supplied_or_provider_documented_multipliers": True, + }, + } + + +def score_prompt( + text: str, + *, + provider: str, + minimum_cacheable_tokens: int, + expected_reuses: int = DEFAULT_EXPECTED_REUSES, + cache_write_multiplier: float | None = None, + cache_read_multiplier: float | None = None, +) -> dict[str, Any]: prompt_kind, shape_warnings = json_shape_warnings(text) dynamic_offset, dynamic_marker = first_dynamic_marker(text) prefix_text = text if dynamic_offset is None else text[:dynamic_offset] @@ -282,13 +405,14 @@ def score_prompt(text: str, *, provider: str, minimum_cacheable_tokens: int) -> "message": "Anthropic caching usually requires cache_control around the reusable prefix.", }) + eligible = prefix_estimated >= minimum_cacheable_tokens return { "tool": TOOL_NAME, "schema_version": SCHEMA_VERSION, "provider": provider, "prompt_kind": prompt_kind, "minimum_cacheable_tokens": minimum_cacheable_tokens, - "eligible": prefix_estimated >= minimum_cacheable_tokens, + "eligible": eligible, "estimated_tokens": estimated, "cacheable_prefix_tokens": prefix_estimated, "token_estimate": { @@ -305,6 +429,13 @@ def score_prompt(text: str, *, provider: str, minimum_cacheable_tokens: int) -> "static_prefix_ratio": round(static_ratio, 6), "warnings": warnings, "provider_caveat": PROVIDER_CAVEATS[provider], + "amortization": build_amortization_report( + eligible=eligible, + prefix_tokens=prefix_estimated, + expected_reuses=expected_reuses, + cache_write_multiplier=cache_write_multiplier, + cache_read_multiplier=cache_read_multiplier, + ), "raw_prompt_stored": False, "claim_boundary": { "advisory_only": True, @@ -320,11 +451,15 @@ def render_text(report: dict[str, Any]) -> str: status = "eligible" if report.get("eligible") else "not eligible" warnings = report.get("warnings") if isinstance(report.get("warnings"), list) else [] warning_codes = ", ".join(str(item.get("code")) for item in warnings if isinstance(item, dict)) or "none" + amortization = report.get("amortization") if isinstance(report.get("amortization"), dict) else {} return ( f"{TOOL_NAME}: {status} for {report['provider']} " f"(static_prefix≈{report['cacheable_prefix_tokens']} char/4 tokens, " f"minimum={report['minimum_cacheable_tokens']})\n" f"warnings: {warning_codes}\n" + f"amortization: {amortization.get('status', 'unknown')} " + f"(risk={amortization.get('risk', 'unknown')}, " + f"break_even_reuses={amortization.get('break_even_reuses')})\n" "claim boundary: advisory static lint only; not a measured provider cache hit or cost saving.\n" ) @@ -344,6 +479,24 @@ def build_parser() -> argparse.ArgumentParser: help="override provider threshold for model/platform-specific cache minimums", ) parser.add_argument("--max-input-bytes", default=DEFAULT_MAX_INPUT_BYTES, help=f"maximum input bytes (default: {DEFAULT_MAX_INPUT_BYTES})") + parser.add_argument( + "--expected-reuses", + default=DEFAULT_EXPECTED_REUSES, + help=( + "future cache reads expected after the initial write; advisory only " + f"(default: {DEFAULT_EXPECTED_REUSES})" + ), + ) + parser.add_argument( + "--cache-write-multiplier", + default=None, + help="optional user-supplied cache write multiplier relative to uncached prefix input cost=1.0", + ) + parser.add_argument( + "--cache-read-multiplier", + default=None, + help="optional user-supplied cache read multiplier relative to uncached prefix input cost=1.0", + ) parser.add_argument("--json", action="store_true", help="emit stable JSON") return parser @@ -362,8 +515,34 @@ def main(argv: list[str] | None = None) -> int: maximum=10_000_000, name="--minimum-cacheable-tokens", ) + expected_reuses = bounded_int( + args.expected_reuses, + default=DEFAULT_EXPECTED_REUSES, + minimum=0, + maximum=MAX_EXPECTED_REUSES, + name="--expected-reuses", + ) + cache_write_multiplier = bounded_float( + args.cache_write_multiplier, + minimum=0.0, + maximum=MAX_CACHE_MULTIPLIER, + name="--cache-write-multiplier", + ) + cache_read_multiplier = bounded_float( + args.cache_read_multiplier, + minimum=0.0, + maximum=MAX_CACHE_MULTIPLIER, + name="--cache-read-multiplier", + ) text = read_limited_path(Path(args.input), max_input_bytes) if args.input else read_limited_stdin(max_input_bytes) - report = score_prompt(text, provider=provider, minimum_cacheable_tokens=minimum) + report = score_prompt( + text, + provider=provider, + minimum_cacheable_tokens=minimum, + expected_reuses=expected_reuses, + cache_write_multiplier=cache_write_multiplier, + cache_read_multiplier=cache_read_multiplier, + ) if args.json: sys.stdout.write(json_bytes(report, indent=2) + "\n") else: diff --git a/plugins/context-guard/bin/context-guard-tool-prune b/plugins/context-guard/bin/context-guard-tool-prune index c070c42..d2ae4a1 100755 --- a/plugins/context-guard/bin/context-guard-tool-prune +++ b/plugins/context-guard/bin/context-guard-tool-prune @@ -844,7 +844,14 @@ def defer_report(args: argparse.Namespace) -> str: namespace_top=namespace_top, ) all_schema_bytes = sum(byte_len_json(cand.schema) for cand in ranked) + listed_deferred_schema_bytes = sum(byte_len_json(cand.schema) for cand in deferred_candidates) + total_deferred_schema_bytes = sum(byte_len_json(cand.schema) for cand in ranked[core_top:]) tool_stub_report_bytes = byte_len_json(core_tools) + byte_len_json(deferred_tools) + all_schema_tokens = proxy_tokens(all_schema_bytes) + inline_core_schema_tokens = proxy_tokens(core_schema_bytes) + listed_deferred_schema_tokens = proxy_tokens(listed_deferred_schema_bytes) + total_deferred_schema_tokens = proxy_tokens(total_deferred_schema_bytes) + tool_stub_report_tokens = proxy_tokens(tool_stub_report_bytes) result = { "tool": TOOL_NAME, "schema_version": DEFER_SCHEMA_VERSION, @@ -862,6 +869,7 @@ def defer_report(args: argparse.Namespace) -> str: "deferred_tools_truncated_count": max(0, len(ranked) - core_top - len(deferred_tools)), "deferred_namespaces": deferred_namespaces, "deferred_namespaces_truncated_count": deferred_namespaces_truncated_count, + "deferred_schema_retrieval_required_before_use": True, "receipt": { **receipt, "bytes": receipt_size, @@ -871,9 +879,21 @@ def defer_report(args: argparse.Namespace) -> str: "method": "char4_proxy", "chars_per_token": TOKEN_PROXY_CHARS_PER_TOKEN, "all_schema_bytes": all_schema_bytes, + "inline_core_schema_bytes": core_schema_bytes, + "listed_deferred_schema_bytes": listed_deferred_schema_bytes, + "total_deferred_schema_bytes": total_deferred_schema_bytes, "tool_stub_report_bytes": tool_stub_report_bytes, - "all_schema_tokens_estimated": proxy_tokens(all_schema_bytes), - "tool_stub_report_tokens_estimated": proxy_tokens(tool_stub_report_bytes), + "all_schema_tokens_estimated": all_schema_tokens, + "inline_core_schema_tokens_estimated": inline_core_schema_tokens, + "listed_deferred_schema_tokens_estimated": listed_deferred_schema_tokens, + "total_deferred_schema_tokens_estimated": total_deferred_schema_tokens, + "tool_stub_report_tokens_estimated": tool_stub_report_tokens, + "gross_listed_deferred_schema_tokens_avoided": listed_deferred_schema_tokens, + "gross_total_deferred_schema_tokens_avoided": total_deferred_schema_tokens, + "net_initial_report_tokens_delta": tool_stub_report_tokens - all_schema_tokens, + "net_initial_report_tokens_delta_semantics": "tool_stub_report_tokens_estimated_minus_all_schema_tokens_estimated", + "estimated_initial_schema_tokens_avoided": max(0, all_schema_tokens - tool_stub_report_tokens), + "estimated_initial_schema_tokens_avoided_semantics": "max(0, all_schema_tokens_estimated - tool_stub_report_tokens_estimated)", "claim_boundary": "proxy_only_not_provider_billed_tokens", }, "provider_patterns": [ @@ -899,11 +919,13 @@ def defer_report(args: argparse.Namespace) -> str: "provider_tool_search_configured": False, "hosted_api_token_or_cost_savings_claim_allowed": False, "requires_provider_measured_matched_tasks_for_savings_claims": True, + "deferred_schema_retrieval_required_before_use": True, }, "redaction": {"redacted_values": total_redactions}, "caveats": [ "Deferred loading is an application strategy report, not a native provider integration.", "Token proxy values are char/4 estimates over sanitized local JSON, not billed provider tokens.", + "Deferred schema token fields are initial-prompt proxy accounting; full schemas must be retrieved before deferred tool use.", "Use receipt get commands to retrieve full sanitized schemas before using deferred tools.", ], } diff --git a/tests/test_context_guard_kit.py b/tests/test_context_guard_kit.py index 0a4141c..c5c8335 100644 --- a/tests/test_context_guard_kit.py +++ b/tests/test_context_guard_kit.py @@ -10321,7 +10321,19 @@ def test_cache_score_reports_static_prefix_and_claim_boundary(self): prompt = stable + "\nrequest_id: 123e4567-e89b-12d3-a456-426614174000\nuser: fix CI" for script in CACHE_SCORE_SCRIPTS: with self.subTest(script=script): - proc = self._run_cache_score(script, "--provider", "openai", "--json", input_data=prompt) + proc = self._run_cache_score( + script, + "--provider", + "openai", + "--expected-reuses", + "3", + "--cache-write-multiplier", + "1.25", + "--cache-read-multiplier", + "0.1", + "--json", + input_data=prompt, + ) data = json.loads(proc.stdout) self.assertEqual(data["tool"], "context-guard-cache-score") self.assertEqual(data["schema_version"], "contextguard.cache-score.v1") @@ -10334,10 +10346,44 @@ def test_cache_score_reports_static_prefix_and_claim_boundary(self): self.assertFalse(data["raw_prompt_stored"]) self.assertFalse(data["claim_boundary"]["hosted_api_token_or_cost_savings_claim_allowed"]) self.assertTrue(data["claim_boundary"]["requires_provider_usage_fields_for_claims"]) + amortization = data["amortization"] + self.assertEqual(amortization["expected_reuses"], 3) + self.assertEqual(amortization["expected_reuses_semantics"], "future_cache_reads_after_initial_write") + self.assertEqual(amortization["cache_write_multiplier"], 1.25) + self.assertEqual(amortization["cache_read_multiplier"], 0.1) + self.assertEqual(amortization["break_even_reuses"], 1) + self.assertEqual(amortization["status"], "amortizes_with_expected_reuses") + self.assertEqual(amortization["risk"], "low") + self.assertAlmostEqual(amortization["expected_uncached_relative_cost"], 4.0) + self.assertAlmostEqual(amortization["expected_cached_relative_cost"], 1.55) + self.assertAlmostEqual(amortization["expected_relative_savings"], 2.45) + self.assertTrue(amortization["user_supplied_multipliers"]) + self.assertFalse(amortization["claim_boundary"]["hosted_api_token_or_cost_savings_claim_allowed"]) warning_codes = {item["code"] for item in data["warnings"]} self.assertIn("dynamic_marker_in_prompt", warning_codes) self.assertNotIn(stable[:80], proc.stdout) + premium_proc = self._run_cache_score( + script, + "--provider", + "openai", + "--expected-reuses", + "1", + "--cache-write-multiplier", + "0.5", + "--cache-read-multiplier", + "2", + "--json", + input_data=prompt, + ) + premium = json.loads(premium_proc.stdout)["amortization"] + self.assertEqual(premium["status"], "no_read_discount") + self.assertEqual(premium["risk"], "high") + self.assertIsNone(premium["break_even_reuses"]) + self.assertAlmostEqual(premium["expected_uncached_relative_cost"], 2.0) + self.assertAlmostEqual(premium["expected_cached_relative_cost"], 2.5) + self.assertLess(premium["expected_relative_savings"], 0) + def test_cache_score_json_order_provider_thresholds_and_help(self): request = { "tools": [ @@ -10367,6 +10413,8 @@ def test_cache_score_json_order_provider_thresholds_and_help(self): self.assertIn("json_object_key_order_not_sorted", codes) self.assertIn("tool_order_not_sorted", codes) self.assertIn("anthropic_cache_control_not_detected", codes) + self.assertEqual(data["amortization"]["status"], "not_cacheable") + self.assertFalse(data["amortization"]["user_supplied_multipliers"]) warning_paths = {item.get("path") for item in data["warnings"]} self.assertIn("$.[redacted-key]", warning_paths) self.assertNotIn("$.timestamp", warning_paths) @@ -10399,6 +10447,12 @@ def test_cache_score_rejects_symlink_and_oversized_input(self): oversized = self._run_cache_score(script, "--max-input-bytes", "5", input_data="0123456789", check=False) self.assertNotEqual(oversized.returncode, 0) self.assertIn("max-input-bytes", oversized.stderr) + bad_reuses = self._run_cache_score(script, "--expected-reuses", "-1", input_data="stable", check=False) + self.assertNotEqual(bad_reuses.returncode, 0) + self.assertIn("expected-reuses", bad_reuses.stderr) + bad_multiplier = self._run_cache_score(script, "--cache-read-multiplier", "NaN", input_data="stable", check=False) + self.assertNotEqual(bad_multiplier.returncode, 0) + self.assertIn("cache-read-multiplier", bad_multiplier.stderr) def _run_tool_prune(self, script: Path, cwd: Path, *args: str, input_data: str | None = None, check: bool = True) -> subprocess.CompletedProcess[str]: @@ -10514,6 +10568,8 @@ def test_tool_prune_defer_report_splits_core_deferred_and_preserves_receipt(self self.assertFalse(data["native_provider_integration"]) self.assertFalse(data["claim_boundary"]["native_provider_integration"]) self.assertFalse(data["claim_boundary"]["hosted_api_token_or_cost_savings_claim_allowed"]) + self.assertTrue(data["claim_boundary"]["deferred_schema_retrieval_required_before_use"]) + self.assertTrue(data["deferred_schema_retrieval_required_before_use"]) self.assertEqual(len(data["core_tools"]), 1) self.assertEqual(len(data["deferred_tools"]), 2) self.assertFalse(data["core_tools"][0]["schema_included"]) @@ -10523,6 +10579,31 @@ def test_tool_prune_defer_report_splits_core_deferred_and_preserves_receipt(self self.assertEqual(data["token_proxy"]["chars_per_token"], 4) self.assertIn("tool_stub_report_bytes", data["token_proxy"]) self.assertNotIn("inline_report_bytes", data["token_proxy"]) + self.assertIn("inline_core_schema_bytes", data["token_proxy"]) + self.assertIn("listed_deferred_schema_bytes", data["token_proxy"]) + self.assertIn("total_deferred_schema_bytes", data["token_proxy"]) + self.assertIn("gross_listed_deferred_schema_tokens_avoided", data["token_proxy"]) + self.assertIn("gross_total_deferred_schema_tokens_avoided", data["token_proxy"]) + self.assertIn("net_initial_report_tokens_delta", data["token_proxy"]) + self.assertIn("estimated_initial_schema_tokens_avoided", data["token_proxy"]) + self.assertEqual( + data["token_proxy"]["net_initial_report_tokens_delta"], + data["token_proxy"]["tool_stub_report_tokens_estimated"] + - data["token_proxy"]["all_schema_tokens_estimated"], + ) + self.assertEqual( + data["token_proxy"]["estimated_initial_schema_tokens_avoided"], + max( + 0, + data["token_proxy"]["all_schema_tokens_estimated"] + - data["token_proxy"]["tool_stub_report_tokens_estimated"], + ), + ) + self.assertGreater(data["token_proxy"]["listed_deferred_schema_tokens_estimated"], 0) + self.assertGreaterEqual( + data["token_proxy"]["total_deferred_schema_tokens_estimated"], + data["token_proxy"]["listed_deferred_schema_tokens_estimated"], + ) self.assertIn("proxy_only_not_provider_billed_tokens", data["token_proxy"]["claim_boundary"]) self.assertEqual(data["listed_deferred_count"], 2) self.assertEqual(data["total_deferred_count"], 2) @@ -23757,6 +23838,15 @@ def test_benchmark_report_does_not_claim_shifted_cost_when_cost_unmeasured(self) ) self.assertEqual(report["claim_status"], "token_savings_observed_cost_unmeasured") self.assertIsNone(report["comparisons"][0]["cost_savings_pct_with_shift"]) + baseline = report["measurement_baseline"] + self.assertEqual(baseline["schema_version"], "contextguard.bench.measurement-baseline.v1") + self.assertTrue(baseline["csv_schema_unchanged"]) + self.assertIn("total_cost_with_shift_usd", baseline["csv_columns"]) + self.assertIn("primary_token_buckets", baseline["captured_fields"]) + self.assertIn("primary_tokens_measured", baseline["captured_fields"]["primary_token_buckets"]) + self.assertIn("repo_revision", baseline["missing_future_run_identity_fields"]) + self.assertFalse(baseline["claim_boundary"]["enables_savings_claims_by_itself"]) + self.assertTrue(baseline["claim_boundary"]["requires_matched_successful_tasks"]) def test_benchmark_report_treats_missing_external_cost_as_unmeasured(self): for index, script in enumerate(BENCH_SCRIPTS):