Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ All notable changes for the ContextGuard plugin are documented here.

## [Unreleased]

- Extended Batch 1 token-savings advisory reports with cache-score amortization risk fields, tool-prune deferred-schema proxy accounting, and a benchmark measurement-baseline contract while preserving local-only/no-savings-claim boundaries.

## [0.4.10] - 2026-06-14

- Added `context-guard-artifact search`, a local sanitized artifact sandbox search that returns capped literal matches with exact `get --lines` rehydration commands and no hosted savings claims.
Expand Down
7 changes: 6 additions & 1 deletion README.ko.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ brief 모드는 코딩 에이전트가 군더더기를 줄이도록 요청하되
- `context-guard-audit`가 보고한 대화 기록 사용량 집중 지점, `cache_friendliness` 프롬프트 배치 신호, `cache_layout_advice` 실험 우선순위
- 상태표시줄의 `cache` / `reuse` 값: ContextGuard가 직접 만든 절감 효과가 아니라 관찰된 대화 기록·provider cache 신호입니다.
- `context-guard cost preflight`로 Anthropic 요청 JSON의 추정 비용을 보고, 호출 뒤 `context-guard cost observe`로 provider usage 필드(`cache_creation_input_tokens`, `cache_read_input_tokens`)를 대조합니다.
- `context-guard-cache-score`로 정적 cache layout과, 사용자가 직접 넣은 cache write/read multiplier 기반 amortization 위험을 안내받습니다. char/4 토큰 값은 provider 측정 절감이 아니라 추정 proxy입니다.
- `context-guard-bench`로 성공한 기준/변형 실행을 쌍으로 맞춰 비교한 결과
- 큰 tool/MCP catalog와 `context-guard-tool-prune` top-k 리포트 및 요약 기록 재조회 방식의 차이
- [`research/experimental-token-reduction-radar.md`](research/experimental-token-reduction-radar.md)의 선택적 실험 lane과 마찬가지로, [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md)의 fixture-only 시작 예시도 절감 주장을 하려면 같은 matched-task benchmark gate를 먼저 통과해야 합니다.
Expand Down Expand Up @@ -282,10 +283,14 @@ long-command 2>&1 | ./plugins/context-guard/bin/context-guard-artifact store --c
--catalog tools.json \
--query "review failing tests" \
--top 5 --budget-bytes 12000 --json
./plugins/context-guard/bin/context-guard-tool-prune defer-report \
--catalog tools.json \
--query "review failing tests" \
--core-top 3 --deferred-top 20 --json
./plugins/context-guard/bin/context-guard-tool-prune get <receipt_id> --tool read_file --json
```

`context-guard-tool-prune`은 로컬 tool 또는 MCP catalog를 결정적 lexical heuristic(어휘 기반 휴리스틱)으로 순위화해 제한된 top-k 자문 리포트를 만듭니다. inline schema는 관측된 UTF-8 바이트 예산을 지키고, 누락되거나 예산 때문에 생략된 schema는 `.context-guard/tool-prune`의 compact 요약 기록과 별도 가림 처리 payload로 다시 조회할 수 있습니다. 이 기능은 안내용이며 MCP 설정을 변경하지 않습니다. 토큰 값은 provider가 측정한 절감 수치가 아니라 추정 proxy입니다.
`context-guard-tool-prune`은 로컬 tool 또는 MCP catalog를 결정적 lexical heuristic(어휘 기반 휴리스틱)으로 순위화해 제한된 top-k 자문 리포트를 만듭니다. inline schema는 관측된 UTF-8 바이트 예산을 지키고, 누락되거나 예산 때문에 생략된 schema는 `.context-guard/tool-prune`의 compact 요약 기록과 별도 가림 처리 payload로 다시 조회할 수 있습니다. `defer-report`는 core inline tool과 deferred tool stub/namespace 요약을 나누고, 첫 프롬프트에서 빠진 schema의 gross/net char/4 proxy 회계를 함께 보여줍니다. 이 기능은 안내용이며 MCP 설정이나 native provider tool search를 변경하지 않습니다. 토큰 값은 provider가 측정한 절감 수치가 아니라 추정 proxy입니다.

### 총비용, batchability, routing 후보 자문

Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ When you need a savings claim, measure it on your own tasks:
- transcript hotspots reported by `context-guard-audit`, including `cache_friendliness` prompt-layout signals and `cache_layout_advice` experiment priorities
- statusline `cache` / `reuse` as observed transcript/provider-cache signals, not savings caused by ContextGuard
- `context-guard cost preflight` estimates for Anthropic request JSON, followed by `context-guard cost observe` using provider usage fields (`cache_creation_input_tokens`, `cache_read_input_tokens`) after the call
- static prompt/request cache layout checks from `context-guard-cache-score`; its char/4 token estimates and warnings are advisory only until provider usage fields confirm real cache hits
- static prompt/request cache layout checks from `context-guard-cache-score`, including optional user-supplied cache write/read multiplier amortization risk; its char/4 token estimates and warnings are advisory only until provider usage fields confirm real cache hits
- matched successful baseline/variant runs from `context-guard-bench`
- large tool/MCP catalogs versus `context-guard-tool-prune` top-k reports plus receipt retrieval
- optional experimental lanes in [`research/experimental-token-reduction-radar.md`](research/experimental-token-reduction-radar.md); fixture-only starters in [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md) use the same matched-task benchmark gates before any savings claim
Expand Down Expand Up @@ -303,7 +303,7 @@ The packer uses deterministic standard-library heuristics only: no network, mode
./plugins/context-guard/bin/context-guard-tool-prune get <receipt_id> --tool read_file --json
```

`context-guard-tool-prune` ranks a local tool or MCP catalog with deterministic lexical heuristics and emits a bounded top-k advisory report. Inline selected schemas respect an observed UTF-8 byte budget, and omitted or budget-skipped schemas remain recoverable from a compact local receipt plus a separate sanitized payload under `.context-guard/tool-prune`. `defer-report` uses the same receipt path to split a catalog into core inline tools plus deferred tool stubs and namespace summaries. This is advisory only: it does not mutate MCP configuration, does not configure native provider tool search, and token counts remain estimated proxies rather than measured provider savings.
`context-guard-tool-prune` ranks a local tool or MCP catalog with deterministic lexical heuristics and emits a bounded top-k advisory report. Inline selected schemas respect an observed UTF-8 byte budget, and omitted or budget-skipped schemas remain recoverable from a compact local receipt plus a separate sanitized payload under `.context-guard/tool-prune`. `defer-report` uses the same receipt path to split a catalog into core inline tools plus deferred tool stubs and namespace summaries, and reports gross deferred-schema plus net initial-report char/4 proxy accounting so you can see what moved out of the first prompt. This is advisory only: it does not mutate MCP configuration, does not configure native provider tool search, and token counts remain estimated proxies rather than measured provider savings.

### Score static prompt cacheability

Expand All @@ -312,7 +312,7 @@ The packer uses deterministic standard-library heuristics only: no network, mode
./plugins/context-guard/bin/context-guard cache-score --input prompt.txt --provider anthropic --json
```

`context-guard-cache-score` is a local static lint for prompt/request layout. It estimates total and cacheable-prefix size with a tokenizer-free char/4 proxy, warns about dynamic-looking values near the prefix, and records provider caveats for OpenAI, Anthropic, Gemini, or a generic threshold. It does not call providers, store raw prompts, estimate prices, observe cache hits, or prove token/cost savings; verify real cache behavior with provider usage telemetry.
`context-guard-cache-score` is a local static lint for prompt/request layout. It estimates total and cacheable-prefix size with a tokenizer-free char/4 proxy, warns about dynamic-looking values near the prefix, and records provider caveats for OpenAI, Anthropic, Gemini, or a generic threshold. Optional `--expected-reuses`, `--cache-write-multiplier`, and `--cache-read-multiplier` inputs add an advisory amortization-risk section using user-supplied economics only. It does not call providers, store raw prompts, estimate prices from bundled defaults, observe cache hits, or prove token/cost savings; verify real cache behavior with provider usage telemetry.

### Advise on total cost, batchability, and routing

Expand Down
4 changes: 2 additions & 2 deletions context-guard-kit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,9 +57,9 @@ python3 context-guard-kit/sanitize_output.py -- git diff

`context_filter.py`는 opt-in declarative output filter helper입니다. filter JSON은 사용자가 package code 밖(예: `.context-guard/filter-dsl.json`)에 두고 `validate`로 검증한 뒤 `run --config ... -- <command>`로 적용합니다. invalid config, no-match, filter error, empty output, protected `git`/test/lint/`gh` failure는 원래 command stdout/stderr와 exit code를 passthrough합니다. filtered mode는 stdout+stderr를 합친 line에 filter를 적용해 stdout으로 쓰고, passthrough mode는 stdout/stderr stream을 그대로 보존합니다. `--json-report`는 stdout을 command/filter output 전용으로 두기 위해 stderr에만 diagnostic JSON을 쓰지만, protected nonzero passthrough에서는 stderr 원문 보존을 위해 report를 생략합니다. token/cost 절감 수치는 측정 claim이 아니라 local presentation 변화로만 다루세요.

`cache_score.py`는 provider 호출 없이 prompt/request 파일 또는 stdin을 정적으로 검사하는 cacheability lint입니다. OpenAI/Anthropic/Gemini/generic threshold를 기준으로 stable prefix, 첫 dynamic marker, JSON/tool ordering hint, char/4 token proxy, provider caveat, claim boundary를 출력합니다. raw prompt를 저장하지 않으며, 가격/ledger/cache hit 관측은 `cost_guard.py`와 provider usage field의 영역입니다.
`cache_score.py`는 provider 호출 없이 prompt/request 파일 또는 stdin을 정적으로 검사하는 cacheability lint입니다. OpenAI/Anthropic/Gemini/generic threshold를 기준으로 stable prefix, 첫 dynamic marker, JSON/tool ordering hint, char/4 token proxy, provider caveat, claim boundary를 출력합니다. 선택적으로 `--expected-reuses`, `--cache-write-multiplier`, `--cache-read-multiplier`를 받아 사용자가 제공한 경제성 가정으로만 amortization risk를 표시합니다. raw prompt를 저장하지 않으며, 번들 가격 추정/ledger/cache hit 관측은 `cost_guard.py`와 provider usage field의 영역입니다.

`tool_schema_pruner.py`는 provider-neutral tool/MCP catalog helper입니다. `select`는 task query와 lexical overlap으로 top-k tool을 고르고, inline schema는 `--budget-bytes` 안에만 넣으며, compact receipt와 별도 sanitized payload를 `.context-guard/tool-prune`에 기록합니다. `defer-report`는 같은 receipt path를 사용해 core inline tools와 deferred tool stubs/namespace summaries를 분리합니다. `get`은 payload size/SHA-256을 검증한 뒤 전체 정제 schema를 반환합니다. 이 helper는 MCP 설정이나 native provider tool search를 바꾸지 않으며, token 절감은 측정값이 아니라 추정 proxy로만 표현합니다.
`tool_schema_pruner.py`는 provider-neutral tool/MCP catalog helper입니다. `select`는 task query와 lexical overlap으로 top-k tool을 고르고, inline schema는 `--budget-bytes` 안에만 넣으며, compact receipt와 별도 sanitized payload를 `.context-guard/tool-prune`에 기록합니다. `defer-report`는 같은 receipt path를 사용해 core inline tools와 deferred tool stubs/namespace summaries를 분리하고, gross deferred-schema 및 net initial-report `chars_div_4` proxy 회계를 표시합니다. `get`은 payload size/SHA-256을 검증한 뒤 전체 정제 schema를 반환합니다. 이 helper는 MCP 설정이나 native provider tool search를 바꾸지 않으며, token 절감은 측정값이 아니라 추정 proxy로만 표현합니다.

`context_compress.py --protected-policy`는 기본 압축 동작을 바꾸지 않고 code fence, diff, identifier, numeric constant, hash, path, stack frame, quoted string, JSON key 같은 보호-zone class/count 정책 메타데이터를 추가합니다. 보호-zone 정책은 semantic/paraphrase rewrite를 금지하고 structural dedupe/window/truncate 및 artifact retrieval만 허용합니다. raw span은 receipt에 저장하지 않으며, lossy structural transform에는 정확 재조회가 필요하다는 hint를 남깁니다. `context_compress.py --mode readable`은 가림 처리된 prose에만 deterministic sentence-window preview를 시도하고, prompt-like/high-risk protected signal이 있으면 보수 모드로 차단합니다. learned compressor, model, embedding, reranker, hosted savings claim은 포함하지 않습니다.

Expand Down
73 changes: 73 additions & 0 deletions context-guard-kit/benchmark_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,7 @@
TOKEN_PROXY_BYTES_PER_TOKEN = 4
BENCH_RUN_EVIDENCE_SCHEMA_VERSION = "contextguard.bench.run-evidence.v1"
MATCHED_PAIR_EVIDENCE_SCHEMA_VERSION = "contextguard.bench.matched-pair.v1"
MEASUREMENT_BASELINE_SCHEMA_VERSION = "contextguard.bench.measurement-baseline.v1"
SELF_HOSTED_METRICS_SCHEMA_VERSION = "contextguard.bench.self-hosted-metrics.v1"
SELF_HOSTED_METRICS_KEY = "self_hosted_metrics"
SELF_HOSTED_METRICS_CLAIM_BOUNDARY = "self_hosted_metrics_only_not_hosted_api_token_or_cost_savings"
Expand Down Expand Up @@ -1546,6 +1547,77 @@ def row_cost_shift_measured(row: dict[str, str]) -> bool:
)


def measurement_baseline_contract() -> dict[str, Any]:
"""Describe the benchmark report's current measurement baseline contract.

This block is descriptive. It does not change the CSV schema and does not
grant token/cost savings claims by itself; those remain gated by matched
successful tasks, measured primary tokens/costs, shifted-cost accounting,
and quality gates.
"""
return {
"schema_version": MEASUREMENT_BASELINE_SCHEMA_VERSION,
"csv_schema_unchanged": True,
"csv_columns": list(CSV_COLUMNS),
"captured_fields": {
"task_identity": ["task_id", "variant"],
"run_configuration": ["model", "effort", "claude_version"],
"primary_token_buckets": [
"input_tokens",
"output_tokens",
"cache_read",
"cache_creation",
"total_tokens",
"primary_tokens_measured",
],
"primary_cost": ["cost_usd", "cost_measured"],
"provider_cache_telemetry": ["provider_cached_tokens", "provider_cached_tokens_measured"],
"latency": ["wall_time_seconds"],
"quality_and_result": ["success", "corrections", "notes"],
"tooling_and_proxy_metrics": ["turns", "hook_triggers", "bytes_before", "bytes_after", "artifacts_used"],
"shifted_cost_accounting": [
"external_tokens",
"external_tokens_measured",
"external_cost_usd",
"external_cost_measured",
"total_cost_with_shift_usd",
],
},
"claim_eligible_fields": {
"token_savings": [
"matched successful baseline and variant tasks",
"primary_tokens_measured=true on both sides",
"quality_gate=pass",
],
"shifted_cost_savings": [
"matched successful baseline and variant tasks",
"cost_measured=true on both sides",
"external_cost_measured=true when external_tokens are present",
"quality_gate=pass",
],
},
"proxy_only_fields": {
"byte_metrics": ["bytes_before", "bytes_after"],
"token_proxy": "chars_div_4_proxy_only",
"provider_cache": "diagnostic_telemetry_not_contextguard_token_reduction",
},
"missing_future_run_identity_fields": [
"repo_revision",
"agent_harness",
"feature_flags",
"provider_name",
"success_command_identity",
],
"claim_boundary": {
"descriptive_contract_only": True,
"enables_savings_claims_by_itself": False,
"requires_matched_successful_tasks": True,
"requires_shifted_cost_accounting_for_cost_claims": True,
"raw_proxy_estimates_are_not_hosted_api_token_savings": True,
},
}


def summarize_benchmark_rows(rows: list[dict[str, str]], baseline_variant: str) -> dict[str, Any]:
by_variant: dict[str, dict[str, Any]] = {}
successful_rows_by_variant_task: dict[str, dict[str, list[dict[str, str]]]] = {}
Expand Down Expand Up @@ -2191,6 +2263,7 @@ def matched_pair_evidence_entry(
"schema": "context-guard-bench-report-v1",
"baseline_variant": baseline_variant,
"row_count": len(rows),
"measurement_baseline": measurement_baseline_contract(),
"summary_by_variant": by_variant,
"comparisons": comparisons,
"matched_pair_evidence": matched_pair_evidence,
Expand Down
Loading