diff --git a/CHANGELOG.md b/CHANGELOG.md
index 72b866d..b980540 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,7 @@ All notable changes for the ContextGuard plugin are documented here.
 
 ## [Unreleased]
 
+- Added `context-guard-bench --evidence-jsonl` replay and `--dashboard-md` rendering so synthetic/local benchmark evidence can regenerate CSV/report/dashboard artifacts while remaining non-public-claim-eligible unless provider-export provenance is complete.
 - Extended Batch 1 token-savings advisory reports with cache-score amortization risk fields, tool-prune deferred-schema proxy accounting, and a benchmark measurement-baseline contract while preserving local-only/no-savings-claim boundaries.
 - Clarified cache-score amortization output for cache-read multipliers above uncached cost by reporting a bounded `max_profitable_reuses` instead of a monotonic break-even reuse count.
 
diff --git a/README.ko.md b/README.ko.md
index 542bc8b..b46ef36 100644
--- a/README.ko.md
+++ b/README.ko.md
@@ -375,7 +375,7 @@ JSON 출력에는 여러 증거 surface가 포함될 수 있습니다.
 - 비용 필드가 0이거나 없으면 토큰 절감만 표시하고 실제 비용 절감은 주장하지 않습니다.
 - CSV 스키마는 엄격하게 검사합니다. 벤치마크 헬퍼를 업그레이드한 뒤에는 새 `--csv` 파일을 시작하거나 mismatch 오류가 알려주는 헤더로 마이그레이션하세요.
 
-최소 보고서 형태 예시는 [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json)을, 작업 유형별 합성 예시와 안전한 해석 경계는 [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md)을, fixture-only 실험 시작 예시는 [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md)을 참고하세요.
+최소 보고서 형태 예시는 [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json)을, 작업 유형별 합성 예시와 안전한 해석 경계는 [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md)을, fixture-only 실험 시작 예시는 [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md)을 참고하세요. live provider 실행 전 deterministic local replay가 필요하면 `--evidence-jsonl docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl --dashboard-md ... --baseline-variant baseline_full_context_fixture`를 사용하세요. Replay mode는 provider와 `success_command`를 실행하지 않고 CSV/report/dashboard를 만들지만 synthetic/manual evidence는 public hosted-savings claim 불가로 표시합니다.
 
 ### 실험 기능 opt-in 관리
 
diff --git a/README.md b/README.md
index dd467e1..89c70f6 100644
--- a/README.md
+++ b/README.md
@@ -406,9 +406,12 @@ These fields can flag likely volatile content near the prompt prefix, stable-pre
 ```bash
 ./plugins/context-guard/bin/context-guard-bench \
   --tasks bench/tasks.json --variants bench/variants.json --csv bench/results.csv \
-  --ledger-jsonl bench/cost-shift.jsonl --report-json bench/report.json
+  --ledger-jsonl bench/cost-shift.jsonl --report-json bench/report.json \
+  --dashboard-md bench/dashboard.md
 ```
 
+For deterministic local replay before a live provider run, add `--evidence-jsonl docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl` and, for the 12-task fixture, `--baseline-variant baseline_full_context_fixture`. Replay mode skips provider and `success_command` execution, writes the same CSV/report/dashboard surfaces, and marks synthetic/manual evidence as non-public-claim-eligible.
+
 Read the report through its claim boundaries before writing any savings statement:
 
 - Successful baseline/variant runs are compared by real tokens and `cost_usd + external_cost_usd`; byte reductions stay proxy evidence.
@@ -419,7 +422,7 @@ Read the report through its claim boundaries before writing any savings statemen
 - If cost fields are zero or unavailable, the report can still mark token savings but will not claim shifted-cost savings.
 - CSV schemas are strict; after upgrading the benchmark helper, start a new `--csv` file or migrate the header named in the mismatch error.
 
-See [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json) for a minimal report-shape example, [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md) for workflow-specific synthetic examples, and [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md) for fixture-only experimental task/variant starters.
+See [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json) for a minimal report-shape example, [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md) for workflow-specific synthetic examples, and [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md) for fixture-only experimental task/variant starters plus synthetic evidence replay.
 
 ### Manage experimental opt-ins
 
diff --git a/context-guard-kit/README.md b/context-guard-kit/README.md
index bdb71c9..79675f0 100644
--- a/context-guard-kit/README.md
+++ b/context-guard-kit/README.md
@@ -79,7 +79,7 @@ python3 context-guard-kit/sanitize_output.py -- git diff
 
 `experimental_registry.py plan local-proxy`는 localhost-only dry-run 안내 plan입니다. `experimental_registry.py plan local-proxy-external-forwarding`은 future external forwarding을 위한 design-only dry-run gate이며 explicit intent, HTTPS allowlist, threat model note, credential redaction policy, provider-evidence boundary를 요구하고 DNS lookup, external service call, traffic forwarding은 하지 않습니다. `experimental_registry.py record local-proxy-runtime-gate --ledger-jsonl ...`은 listener 시작, traffic forwarding, DNS lookup 없이 local gate row 하나만 기록하는 명시적 runtime입니다. `experimental_registry.py serve local-proxy`는 명시적 one-shot loopback forwarding MVP이며 `--runtime-gate-ack --forwarding-gate-ack --once`, private `--ready-file` nonce handoff, literal loopback bind/target IP, hostname DNS target 금지, nonzero port, byte/time limit, credential-free request가 필요합니다. API key를 저장하지 않고, external forwarding이나 CONNECT/TLS proxying을 지원하지 않으며, hosted savings claim도 만들지 않습니다. 선택적 `--diagnostic-ledger-jsonl`은 successful forwarded request 뒤에 raw header/body나 hosted-savings evidence 없이 shifted-cost diagnostic row 하나만 추가합니다. External proxy forwarding runtime은 shipped가 아니며, 나머지 roadmap lane은 별도 runtime gate가 생기기 전까지 안내 상태로 남습니다.
 
-`benchmark_runner.py`는 `research/benchmark-plan.md`의 고정 task/variant 실험을 실행합니다. `variant_prompt_files`는 선택된 task/variant를 필터링한 뒤 필요한 file-backed prompt만 읽으므로 선택하지 않은 fixture의 누락 파일이 선택된 실행을 깨지 않습니다. `--ledger-jsonl`은 subagent·artifact 등 외부 실행 표면으로 옮겨간 token/cost와 run별 측정 가능 여부를 남기고, 선택적 `self_hosted_metrics` provider payload는 run별 sidecar로만 기록합니다. `--report-json`은 baseline 대비 실제 token/cost 절감과 proxy byte 감소를 분리한 A/B report를 생성하며, `self_hosted_metrics`는 CSV/report 요약에 접지 않습니다. Report의 `matched_pair_evidence`는 성공한 baseline/variant task bucket을 transform, quality gate, 측정 가능 여부, claim boundary와 연결하므로 절감 주장을 쓰기 전에 이 항목을 확인하세요.
+`benchmark_runner.py`는 `research/benchmark-plan.md`의 고정 task/variant 실험을 실행합니다. `variant_prompt_files`는 선택된 task/variant를 필터링한 뒤 필요한 file-backed prompt만 읽으므로 선택하지 않은 fixture의 누락 파일이 선택된 실행을 깨지 않습니다. `--ledger-jsonl`은 subagent·artifact 등 외부 실행 표면으로 옮겨간 token/cost와 run별 측정 가능 여부를 남기고, 선택적 `self_hosted_metrics` provider payload는 run별 sidecar로만 기록합니다. `--report-json`은 baseline 대비 실제 token/cost 절감과 proxy byte 감소를 분리한 A/B report를 생성하며, `--dashboard-md`는 같은 report에서 Markdown dashboard를 렌더링합니다. `--evidence-jsonl` replay는 provider와 `success_command`를 실행하지 않는 deterministic import mode이고, synthetic/manual evidence는 public hosted-savings claim 불가로 강제됩니다. `self_hosted_metrics`는 CSV/report 요약에 접지 않습니다. Report의 `matched_pair_evidence`는 성공한 baseline/variant task bucket을 transform, quality gate, 측정 가능 여부, claim boundary와 연결하므로 절감 주장을 쓰기 전에 이 항목을 확인하세요.
 
 `../research/experimental-token-reduction-radar.md`는 learned compression, generated crop/OCR/visual-token pruning, self-hosted KV/latent inference optimization 같은 선택적 미래 실험을 문서화한 gate입니다. `../docs/experimental-benchmark-fixtures.md`에는 fixture-only task/variant 시작 예시가 있습니다. 이 radar와 fixture는 hosted API token/cost 절감을 보장하지 않습니다. 현재 제공되는 helper surface는 명시적 local context-diff emit, visual evidence-pack emit, learned candidate emit, self-hosted metrics record, local proxy gate record, one-shot literal-loopback local proxy serve, design-only external-forwarding plan 같은 좁은 local surface뿐이며, hosted API token/cost 절감 주장은 provider가 측정한 matched-task 근거가 있을 때만 허용합니다. Radar의 later-roadmap gate는 neural/semantic compression, trust-tiered injection-aware compression, generated visual-token reduction, broader external/daemon/hostname-DNS/credential-bearing local proxy forwarding constraints를 별도 미래 PR이 gate를 통과하기 전까지 experimental/non-shipped로 유지합니다.
 
diff --git a/context-guard-kit/benchmark_runner.py b/context-guard-kit/benchmark_runner.py
index e338b88..a1af3ab 100755
--- a/context-guard-kit/benchmark_runner.py
+++ b/context-guard-kit/benchmark_runner.py
@@ -178,6 +178,8 @@
 )
 MAX_USAGE_TOKEN_COUNT = 10**12
 MAX_USAGE_COST_USD = 10**9
+MAX_EVIDENCE_JSONL_BYTES = 5_000_000
+MAX_EVIDENCE_JSONL_LINES = 100_000
 # Byte -> token proxy 환산 계수. 측정된 모델 토큰이 아니라 byte delta 기반 보수적
 # 추정치이며, report에서 evidence="inferred"로 분명히 라벨링한다. 영어 텍스트 기준
 # ~4 bytes/token의 통용 근사값을 사용한다.
@@ -188,6 +190,25 @@
 SELF_HOSTED_METRICS_SCHEMA_VERSION = "contextguard.bench.self-hosted-metrics.v1"
 SELF_HOSTED_METRICS_KEY = "self_hosted_metrics"
 SELF_HOSTED_METRICS_CLAIM_BOUNDARY = "self_hosted_metrics_only_not_hosted_api_token_or_cost_savings"
+EVIDENCE_REPLAY_SOURCE_TYPES = frozenset({"synthetic_fixture", "provider_export", "manual_audit"})
+PROVIDER_EXPORT_PUBLIC_CLAIM_SCOPES = frozenset({
+    "provider_measured_matched_task",
+    "provider_measured_matched_task_public_claim",
+    "hosted_api_provider_measured_matched_task",
+})
+REPLAY_PUBLIC_CLAIM_CANDIDATE_STATUS = "provider_export_public_claim_candidate"
+REPLAY_PROVIDER_CLAIM_GATES_NOT_MET_STATUS = "provider_export_claim_gates_not_met"
+REPLAY_NOT_PUBLIC_CLAIM_STATUS = "replay_only_not_public_claim"
+REPLAY_UNKNOWN_MIXED_CSV_STATUS = "unknown_mixed_csv"
+REPLAY_PUBLIC_CLAIM_ELIGIBLE_RAW_STATUSES = frozenset({
+    "token_and_shifted_cost_savings_observed",
+})
+REPLAY_CLAIM_BOUNDARY = (
+    "Evidence replay is an import/replay mode. Synthetic fixtures and manual audits are never "
+    "hosted API token/cost savings evidence; public claims require complete provider_export "
+    "provenance for every report row plus the normal matched-task quality, token, cost, and "
+    "shifted-cost gates."
+)
 MAX_SELF_HOSTED_LABEL_CHARS = 120
 MAX_SELF_HOSTED_LATENCY_MS = 7 * 24 * 60 * 60 * 1000
 MAX_SELF_HOSTED_MEMORY_MB = 10_000_000
@@ -401,6 +422,36 @@ class RunResult:
     self_hosted_metrics: dict[str, Any] | None = None
 
 
+@dataclass
+class EvidenceReplayRow:
+    result: RunResult
+    source_type: str
+    provider_name: str | None
+    capture_command_or_export_id: str | None
+    claim_scope: str
+    provider_export_provenance_complete: bool
+    public_claim_eligible: bool
+    line_number: int
+
+    @property
+    def key(self) -> tuple[str, str]:
+        return (self.result.task_id, self.result.variant)
+
+    def provenance_payload(self) -> dict[str, Any]:
+        return {
+            "schema_version": BENCH_RUN_EVIDENCE_SCHEMA_VERSION,
+            "mode": "evidence_jsonl_replay",
+            "evidence_source_type": self.source_type,
+            "provider_name": self.provider_name,
+            "capture_command_or_export_id": self.capture_command_or_export_id,
+            "claim_scope": self.claim_scope,
+            "provider_export_provenance_complete": self.provider_export_provenance_complete,
+            "public_claim_eligible": self.public_claim_eligible,
+            "line_number": self.line_number,
+            "claim_boundary": REPLAY_CLAIM_BOUNDARY,
+        }
+
+
 @dataclass
 class BoundedProcessResult:
     returncode: int
@@ -1362,7 +1413,13 @@ def write_text_no_follow(path: Path, text: str) -> None:
             os.close(fd)
 
 
-def append_cost_shift_ledger(path: Path, claude_ver: str, result: RunResult) -> None:
+def append_cost_shift_ledger(
+    path: Path,
+    claude_ver: str,
+    result: RunResult,
+    *,
+    replay_provenance: dict[str, Any] | None = None,
+) -> None:
     shifted_cost_known = cost_shift_measured(result)
     byte_metrics_observed = bool(result.bytes_before or result.bytes_after)
     payload = {
@@ -1413,6 +1470,10 @@ def append_cost_shift_ledger(path: Path, claude_ver: str, result: RunResult) ->
     }
     if result.self_hosted_metrics is not None:
         payload["self_hosted_metrics"] = result.self_hosted_metrics
+    if replay_provenance is not None:
+        payload["replay_provenance"] = replay_provenance
+        payload["evidence_source_type"] = replay_provenance.get("evidence_source_type")
+        payload["public_claim_eligible"] = bool(replay_provenance.get("public_claim_eligible"))
     with csv_file_lock(path, create_parent=True):
         fd = _open_regular_no_symlink(path, os.O_CREAT | os.O_APPEND | os.O_WRONLY, 0o600, create_parent=True)
         try:
@@ -1488,6 +1549,354 @@ def read_csv_rows(csv_path: Path) -> list[dict[str, str]]:
             os.close(fd)
 
 
+def file_has_content_no_follow(path: Path) -> bool:
+    try:
+        fd = _open_regular_no_symlink(path)
+    except FileNotFoundError:
+        return False
+    try:
+        return os.fstat(fd).st_size > 0
+    finally:
+        os.close(fd)
+
+
+def require_evidence_object(raw: Any, *, owner: str) -> dict[str, Any]:
+    if not isinstance(raw, dict):
+        raise SystemExit(f"{owner} evidence row must be a JSON object")
+    return raw
+
+
+def evidence_non_empty_string(raw: Any, *, field: str, owner: str, required: bool = True) -> str | None:
+    if raw is None:
+        if required:
+            raise SystemExit(f"{owner} {field} must be a non-empty string")
+        return None
+    if not isinstance(raw, str):
+        raise SystemExit(f"{owner} {field} must be a string")
+    text = sanitize_note_text(raw)
+    if not text:
+        if required:
+            raise SystemExit(f"{owner} {field} must be a non-empty string")
+        return None
+    return text
+
+
+def evidence_bool(raw: Any, *, field: str, owner: str, default: bool = False) -> bool:
+    if raw is None:
+        return default
+    if not isinstance(raw, bool):
+        raise SystemExit(f"{owner} {field} must be a boolean")
+    return raw
+
+
+def evidence_nonnegative_int(
+    raw: Any,
+    *,
+    field: str,
+    owner: str,
+    default: int = 0,
+    maximum: int = MAX_USAGE_TOKEN_COUNT,
+) -> int:
+    if raw is None:
+        return default
+    value = normalize_usage_token(raw)
+    if value is None or value > maximum:
+        raise SystemExit(f"{owner} {field} must be a finite non-negative integer")
+    return value
+
+
+def evidence_nonnegative_float(
+    raw: Any,
+    *,
+    field: str,
+    owner: str,
+    default: float = 0.0,
+    maximum: float = MAX_USAGE_COST_USD,
+) -> float:
+    if raw is None:
+        return default
+    if isinstance(raw, bool) or not isinstance(raw, (int, float)):
+        raise SystemExit(f"{owner} {field} must be a finite non-negative number")
+    value = float(raw)
+    if not math.isfinite(value) or value < 0 or value > maximum:
+        raise SystemExit(f"{owner} {field} must be a finite non-negative number")
+    return value
+
+
+def evidence_first(raw: dict[str, Any], *keys: str) -> Any:
+    for key in keys:
+        if key in raw:
+            return raw[key]
+    return None
+
+
+def parse_evidence_provenance(raw: dict[str, Any], *, owner: str) -> dict[str, Any]:
+    provenance = raw.get("provenance")
+    if provenance is not None and not isinstance(provenance, dict):
+        raise SystemExit(f"{owner} provenance must be a JSON object")
+    source_raw = (
+        provenance.get("evidence_source_type")
+        if isinstance(provenance, dict) and "evidence_source_type" in provenance
+        else raw.get("evidence_source_type")
+    )
+    source_type = evidence_non_empty_string(source_raw, field="evidence_source_type", owner=owner)
+    assert source_type is not None
+    if source_type not in EVIDENCE_REPLAY_SOURCE_TYPES:
+        raise SystemExit(
+            f"{owner} evidence_source_type must be one of: {', '.join(sorted(EVIDENCE_REPLAY_SOURCE_TYPES))}"
+        )
+    provider_name = evidence_non_empty_string(
+        provenance.get("provider_name") if isinstance(provenance, dict) else raw.get("provider_name"),
+        field="provider_name",
+        owner=owner,
+        required=False,
+    )
+    capture_id = evidence_non_empty_string(
+        (
+            provenance.get("capture_command_or_export_id")
+            if isinstance(provenance, dict) and "capture_command_or_export_id" in provenance
+            else raw.get("capture_command_or_export_id")
+        ),
+        field="capture_command_or_export_id",
+        owner=owner,
+        required=False,
+    )
+    claim_scope = evidence_non_empty_string(
+        provenance.get("claim_scope") if isinstance(provenance, dict) else raw.get("claim_scope"),
+        field="claim_scope",
+        owner=owner,
+    )
+    assert claim_scope is not None
+    provider_authority = (
+        source_type == "provider_export"
+        and provider_name is not None
+        and capture_id is not None
+        and claim_scope in PROVIDER_EXPORT_PUBLIC_CLAIM_SCOPES
+    )
+    return {
+        "source_type": source_type,
+        "provider_name": provider_name,
+        "capture_command_or_export_id": capture_id,
+        "claim_scope": claim_scope,
+        "provider_public_claim_authority": provider_authority,
+    }
+
+
+def parse_evidence_tokens(raw: dict[str, Any], *, owner: str) -> tuple[dict[str, int], set[str]]:
+    token_block = raw.get("tokens")
+    if token_block is not None and not isinstance(token_block, dict):
+        raise SystemExit(f"{owner} tokens must be a JSON object")
+    tokens: dict[str, int] = {}
+    observed: set[str] = set()
+    source = token_block if isinstance(token_block, dict) else {}
+    for bucket, _keys in USAGE_KEY_GROUPS:
+        value = source.get(bucket) if bucket in source else raw.get(bucket)
+        if value is not None:
+            observed.add(bucket)
+        tokens[bucket] = evidence_nonnegative_int(value, field=bucket, owner=owner)
+    return tokens, observed
+
+
+def parse_evidence_row(raw_value: Any, *, owner: str, line_number: int) -> EvidenceReplayRow:
+    raw = require_evidence_object(raw_value, owner=owner)
+    schema = evidence_non_empty_string(raw.get("schema_version"), field="schema_version", owner=owner)
+    if schema != BENCH_RUN_EVIDENCE_SCHEMA_VERSION:
+        raise SystemExit(
+            f"{owner} schema_version must be {BENCH_RUN_EVIDENCE_SCHEMA_VERSION}"
+        )
+    task_id = evidence_non_empty_string(raw.get("task_id"), field="task_id", owner=owner)
+    variant = evidence_non_empty_string(raw.get("variant"), field="variant", owner=owner)
+    assert task_id is not None and variant is not None
+    provenance = parse_evidence_provenance(raw, owner=owner)
+    provider_authority = bool(provenance["provider_public_claim_authority"])
+    raw_primary_tokens_measured = evidence_bool(
+        raw.get("primary_tokens_measured"),
+        field="primary_tokens_measured",
+        owner=owner,
+    )
+    raw_cost_measured = evidence_bool(
+        evidence_first(raw, "cost_measured", "primary_cost_measured"),
+        field="cost_measured",
+        owner=owner,
+    )
+    if provenance["source_type"] in {"synthetic_fixture", "manual_audit"}:
+        primary_tokens_measured = False
+        cost_measured = False
+    elif provider_authority:
+        primary_tokens_measured = raw_primary_tokens_measured
+        cost_measured = raw_cost_measured
+    else:
+        if raw_primary_tokens_measured or raw_cost_measured:
+            raise SystemExit(
+                f"{owner} provider_export measured flags require provider_name, "
+                "capture_command_or_export_id, and a provider-measured matched-task claim_scope"
+            )
+        primary_tokens_measured = False
+        cost_measured = False
+
+    tokens, observed_token_buckets = parse_evidence_tokens(raw, owner=owner)
+    if primary_tokens_measured and not {"input_tokens", "output_tokens"}.issubset(observed_token_buckets):
+        raise SystemExit(
+            f"{owner} primary_tokens_measured=true requires input_tokens and output_tokens evidence"
+        )
+    cost_usd = evidence_nonnegative_float(
+        evidence_first(raw, "cost_usd", "primary_cost_usd"),
+        field="cost_usd",
+        owner=owner,
+    )
+    if cost_measured and "cost_usd" not in raw and "primary_cost_usd" not in raw:
+        raise SystemExit(f"{owner} cost_measured=true requires cost_usd evidence")
+
+    if "success" not in raw:
+        raise SystemExit(f"{owner} success must be a boolean")
+    success = evidence_bool(raw.get("success"), field="success", owner=owner)
+    notes = evidence_non_empty_string(raw.get("notes"), field="notes", owner=owner, required=False)
+    model = evidence_non_empty_string(raw.get("model"), field="model", owner=owner, required=False) or "evidence-replay"
+    effort = evidence_non_empty_string(raw.get("effort"), field="effort", owner=owner, required=False) or ""
+    self_hosted_metrics = None
+    if SELF_HOSTED_METRICS_KEY in raw:
+        self_hosted_metrics = normalize_self_hosted_metrics(
+            raw.get(SELF_HOSTED_METRICS_KEY),
+            source="evidence_jsonl.self_hosted_metrics",
+        )
+        if self_hosted_metrics is None:
+            raise SystemExit(f"{owner} self_hosted_metrics must be normalized explicit metrics")
+
+    result = RunResult(
+        task_id=task_id,
+        variant=variant,
+        model=model,
+        effort=effort,
+        tokens=tokens,
+        cost_usd=cost_usd,
+        success=success,
+        notes=notes or f"evidence replay ({provenance['source_type']})",
+        corrections=evidence_nonnegative_int(raw.get("corrections"), field="corrections", owner=owner),
+        cost_measured=cost_measured,
+        wall_time_seconds=evidence_nonnegative_float(
+            raw.get("wall_time_seconds"),
+            field="wall_time_seconds",
+            owner=owner,
+            maximum=MAX_SELF_HOSTED_LATENCY_MS / 1000,
+        ),
+        turns=evidence_nonnegative_int(raw.get("turns"), field="turns", owner=owner),
+        hook_triggers=evidence_nonnegative_int(raw.get("hook_triggers"), field="hook_triggers", owner=owner),
+        bytes_before=evidence_nonnegative_int(raw.get("bytes_before"), field="bytes_before", owner=owner),
+        bytes_after=evidence_nonnegative_int(raw.get("bytes_after"), field="bytes_after", owner=owner),
+        artifacts_used=evidence_nonnegative_int(raw.get("artifacts_used"), field="artifacts_used", owner=owner),
+        external_tokens=evidence_nonnegative_int(raw.get("external_tokens"), field="external_tokens", owner=owner),
+        external_tokens_measured=evidence_bool(
+            raw.get("external_tokens_measured"),
+            field="external_tokens_measured",
+            owner=owner,
+        ),
+        external_cost_usd=evidence_nonnegative_float(
+            raw.get("external_cost_usd"),
+            field="external_cost_usd",
+            owner=owner,
+        ),
+        external_cost_measured=evidence_bool(
+            raw.get("external_cost_measured"),
+            field="external_cost_measured",
+            owner=owner,
+        ),
+        provider_cached_tokens=evidence_nonnegative_int(
+            raw.get("provider_cached_tokens"),
+            field="provider_cached_tokens",
+            owner=owner,
+        ),
+        provider_cached_tokens_measured=evidence_bool(
+            raw.get("provider_cached_tokens_measured"),
+            field="provider_cached_tokens_measured",
+            owner=owner,
+        ),
+        primary_tokens_measured=primary_tokens_measured,
+        self_hosted_metrics=self_hosted_metrics,
+    )
+    return EvidenceReplayRow(
+        result=result,
+        source_type=str(provenance["source_type"]),
+        provider_name=provenance["provider_name"],
+        capture_command_or_export_id=provenance["capture_command_or_export_id"],
+        claim_scope=str(provenance["claim_scope"]),
+        provider_export_provenance_complete=provider_authority,
+        public_claim_eligible=False,
+        line_number=line_number,
+    )
+
+
+def read_evidence_jsonl(path: Path) -> list[EvidenceReplayRow]:
+    fd = _open_regular_no_symlink(path)
+    try:
+        size = os.fstat(fd).st_size
+        if size > MAX_EVIDENCE_JSONL_BYTES:
+            raise SystemExit(
+                f"evidence JSONL exceeds {MAX_EVIDENCE_JSONL_BYTES} bytes: {path}"
+            )
+        rows: list[EvidenceReplayRow] = []
+        with os.fdopen(fd, "r", encoding="utf-8") as handle:
+            fd = -1
+            for line_number, line in enumerate(handle, start=1):
+                if line_number > MAX_EVIDENCE_JSONL_LINES:
+                    raise SystemExit(
+                        f"evidence JSONL line limit exceeded for {path}: > {MAX_EVIDENCE_JSONL_LINES}"
+                    )
+                if not line.strip():
+                    continue
+                try:
+                    payload = json.loads(line)
+                except json.JSONDecodeError as exc:
+                    raise SystemExit(
+                        f"{path}:{line_number} evidence row must be JSON: {exc.msg}"
+                    ) from None
+                rows.append(parse_evidence_row(payload, owner=f"{path}:{line_number}", line_number=line_number))
+    finally:
+        if fd != -1:
+            os.close(fd)
+    if not rows:
+        raise SystemExit(f"evidence JSONL contains no rows: {path}")
+    return rows
+
+
+def validate_evidence_coverage(
+    evidence_rows: list[EvidenceReplayRow],
+    runnable_targets: list[tuple[TaskFixture, Variant]],
+) -> dict[tuple[str, str], EvidenceReplayRow]:
+    by_key: dict[tuple[str, str], EvidenceReplayRow] = {}
+    for row in evidence_rows:
+        if row.key in by_key:
+            raise SystemExit(
+                f"duplicate evidence row for {row.key[0]}/{row.key[1]} "
+                f"(lines {by_key[row.key].line_number} and {row.line_number})"
+            )
+        by_key[row.key] = row
+    missing = [
+        f"{task.id}/{variant.name}"
+        for task, variant in runnable_targets
+        if (task.id, variant.name) not in by_key
+    ]
+    if missing:
+        raise SystemExit(f"missing evidence row(s) for selected targets: {', '.join(missing)}")
+    return {
+        (task.id, variant.name): by_key[(task.id, variant.name)]
+        for task, variant in runnable_targets
+    }
+
+
+def run_evidence_fixture(task: TaskFixture, variant: Variant, evidence: EvidenceReplayRow) -> RunResult:
+    result = evidence.result
+    if result.task_id != task.id or result.variant != variant.name:
+        raise SystemExit(
+            f"evidence target mismatch: expected {task.id}/{variant.name}, "
+            f"got {result.task_id}/{result.variant}"
+        )
+    if result.model == "evidence-replay":
+        result.model = task.model
+    if not result.effort:
+        result.effort = task.effort or ""
+    return result
+
+
 def row_int(row: dict[str, str], key: str) -> int:
     try:
         return int(float(row.get(key) or 0))
@@ -2277,18 +2686,230 @@ def matched_pair_evidence_entry(
         ),
     }
 
+def annotate_replay_report(
+    report: dict[str, Any],
+    replay_rows: list[EvidenceReplayRow],
+    *,
+    mixed_csv: bool,
+) -> dict[str, Any]:
+    source_types = sorted({row.source_type for row in replay_rows})
+    provider_names = sorted({row.provider_name for row in replay_rows if row.provider_name})
+    claim_scopes = sorted({row.claim_scope for row in replay_rows})
+    same_run_complete = (not mixed_csv) and len(replay_rows) == int(report.get("row_count") or 0)
+    all_provider_claim_authority = bool(replay_rows) and all(
+        row.provider_export_provenance_complete for row in replay_rows
+    )
+    raw_claim_status = str(report.get("claim_status") or "")
+    matched_pair_evidence = report.get("matched_pair_evidence")
+    matched_claim_gates_allow_public_claim = (
+        isinstance(matched_pair_evidence, list)
+        and bool(matched_pair_evidence)
+        and all(
+            isinstance(item, dict)
+            and isinstance(item.get("claim_boundary"), dict)
+            and bool(item["claim_boundary"].get("token_savings_claim_allowed"))
+            and bool(item["claim_boundary"].get("shifted_cost_claim_allowed"))
+            for item in matched_pair_evidence
+        )
+    )
+    report_claim_gates_allow_public_claim = (
+        raw_claim_status in REPLAY_PUBLIC_CLAIM_ELIGIBLE_RAW_STATUSES
+        and matched_claim_gates_allow_public_claim
+    )
+    if not same_run_complete:
+        public_claim_status = REPLAY_UNKNOWN_MIXED_CSV_STATUS
+        public_claim_eligible = False
+    elif all_provider_claim_authority and report_claim_gates_allow_public_claim:
+        public_claim_status = REPLAY_PUBLIC_CLAIM_CANDIDATE_STATUS
+        public_claim_eligible = True
+    elif all_provider_claim_authority:
+        public_claim_status = REPLAY_PROVIDER_CLAIM_GATES_NOT_MET_STATUS
+        public_claim_eligible = False
+    else:
+        public_claim_status = REPLAY_NOT_PUBLIC_CLAIM_STATUS
+        public_claim_eligible = False
+    report["raw_metric_claim_status"] = raw_claim_status
+    report["public_claim_status"] = public_claim_status
+    report["public_claim_eligible"] = public_claim_eligible
+    if not public_claim_eligible:
+        report["claim_status"] = public_claim_status
+    report["replay_evidence"] = {
+        "schema_version": BENCH_RUN_EVIDENCE_SCHEMA_VERSION,
+        "mode": "evidence_jsonl_replay",
+        "row_count": len(replay_rows),
+        "source_types": source_types,
+        "provider_names": provider_names,
+        "claim_scopes": claim_scopes,
+        "same_run_complete": same_run_complete,
+        "mixed_csv": mixed_csv,
+        "provider_export_provenance_complete": all_provider_claim_authority,
+        "report_claim_gates_allow_public_claim": report_claim_gates_allow_public_claim,
+        "public_claim_status": public_claim_status,
+        "public_claim_eligible": public_claim_eligible,
+        "target_keys": [f"{row.result.task_id}/{row.result.variant}" for row in replay_rows],
+        "claim_boundary": REPLAY_CLAIM_BOUNDARY,
+    }
+    return report
+
+
+def report_public_claim_status(report: dict[str, Any]) -> tuple[str, bool | None]:
+    if "public_claim_status" in report:
+        return str(report.get("public_claim_status")), bool(report.get("public_claim_eligible"))
+    return (
+        "csv_provenance_unknown_requires_original_evidence_or_trusted_ledger",
+        None,
+    )
+
+
+def markdown_value(value: Any) -> str:
+    if value is None:
+        return "n/a"
+    if isinstance(value, bool):
+        return "true" if value else "false"
+    if isinstance(value, float):
+        return f"{value:.6g}"
+    text = sanitize_note_text(value)
+    return text.replace("|", "\\|") or "n/a"
+
+
+def render_dashboard_markdown(report: dict[str, Any]) -> str:
+    public_claim_status, public_claim_eligible = report_public_claim_status(report)
+    metric_claim_status = report.get("raw_metric_claim_status", report.get("claim_status"))
+    lines = [
+        "# ContextGuard Benchmark Dashboard",
+        "",
+        f"- Schema: `{markdown_value(report.get('schema'))}`",
+        f"- Baseline variant: `{markdown_value(report.get('baseline_variant'))}`",
+        f"- Rows: {markdown_value(report.get('row_count'))}",
+        f"- Metric claim status: `{markdown_value(metric_claim_status)}`",
+        f"- Public claim status: `{markdown_value(public_claim_status)}`",
+        f"- Public claim eligible: `{markdown_value(public_claim_eligible)}`",
+        "",
+        "> Claim boundary: this dashboard is not a hosted savings claim unless report claim gates "
+        "allow it and public-claim provenance is complete. Proxy byte reductions are diagnostic "
+        "and are not hosted API token savings.",
+        "",
+        "## Variant summary",
+        "",
+        "| Variant | Runs | Successes | Failure rate | Tokens/success | Bytes saved | Token proxy saved | Quality notes |",
+        "| --- | ---: | ---: | ---: | ---: | ---: | ---: | --- |",
+    ]
+    summaries = report.get("summary_by_variant") if isinstance(report.get("summary_by_variant"), dict) else {}
+    comparison_by_variant = {
+        item.get("variant"): item
+        for item in report.get("comparisons", [])
+        if isinstance(item, dict)
+    }
+    for variant, summary in sorted(summaries.items()):
+        if not isinstance(summary, dict):
+            continue
+        comparison = comparison_by_variant.get(variant, {})
+        quality = comparison.get("quality_gate") if isinstance(comparison, dict) else None
+        if quality is None and summary.get("is_baseline_strategy"):
+            quality = "baseline"
+        lines.append(
+            "| "
+            + " | ".join([
+                markdown_value(variant),
+                markdown_value(summary.get("runs")),
+                markdown_value(summary.get("successful_runs")),
+                markdown_value(summary.get("failure_rate")),
+                markdown_value(summary.get("tokens_per_successful_task")),
+                markdown_value(summary.get("bytes_saved_successful")),
+                markdown_value(summary.get("token_proxy_saved_successful")),
+                markdown_value(quality),
+            ])
+            + " |"
+        )
+    lines.extend([
+        "",
+        "## Comparisons",
+        "",
+        "| Variant | Quality gate | Matched tasks | Token paired tasks | Token savings % | Shifted cost savings % |",
+        "| --- | --- | ---: | ---: | ---: | ---: |",
+    ])
+    comparisons = report.get("comparisons") if isinstance(report.get("comparisons"), list) else []
+    if comparisons:
+        for item in comparisons:
+            if not isinstance(item, dict):
+                continue
+            lines.append(
+                "| "
+                + " | ".join([
+                    markdown_value(item.get("variant")),
+                    markdown_value(item.get("quality_gate")),
+                    markdown_value(item.get("matched_successful_task_count")),
+                    markdown_value(item.get("paired_token_task_count")),
+                    markdown_value(item.get("token_savings_pct")),
+                    markdown_value(item.get("cost_savings_pct_with_shift")),
+                ])
+                + " |"
+            )
+    else:
+        lines.append("| n/a | n/a | 0 | 0 | n/a | n/a |")
+    replay = report.get("replay_evidence") if isinstance(report.get("replay_evidence"), dict) else None
+    if replay is not None:
+        lines.extend([
+            "",
+            "## Replay evidence provenance",
+            "",
+            f"- Source types: `{markdown_value(', '.join(replay.get('source_types') or []))}`",
+            f"- Claim scopes: `{markdown_value(', '.join(replay.get('claim_scopes') or []))}`",
+            f"- Same-run complete: `{markdown_value(replay.get('same_run_complete'))}`",
+            f"- Mixed/pre-existing CSV: `{markdown_value(replay.get('mixed_csv'))}`",
+            f"- Boundary: {markdown_value(replay.get('claim_boundary'))}",
+        ])
+    else:
+        lines.extend([
+            "",
+            "## Provenance note",
+            "",
+            "- CSV-only dashboards have unknown public-claim provenance unless regenerated from "
+            "the original evidence JSONL or a future trusted provenance ledger.",
+        ])
+    lines.extend([
+        "",
+        "## Re-run context",
+        "",
+        "- Evidence replay: `context-guard-bench --tasks <tasks.json> --variants <variants.json> "
+        "--evidence-jsonl <evidence.jsonl> --csv <results.csv> --report-json <report.json> "
+        "--dashboard-md <dashboard.md>`",
+    ])
+    return "\n".join(lines) + "\n"
+
+
+def write_report_outputs(
+    csv_path: Path,
+    report_path: Path | None,
+    dashboard_path: Path | None,
+    baseline_variant: str,
+    *,
+    replay_rows: list[EvidenceReplayRow] | None = None,
+    mixed_csv: bool = False,
+) -> dict[str, Any]:
+    # Keep lock order stable across all derived writes: source CSV first, then
+    # report, then dashboard. Do not introduce a derived-output -> CSV path.
+    with csv_file_lock(csv_path, create_parent=True):
+        report = summarize_benchmark_rows(read_csv_rows(csv_path), baseline_variant)
+        if replay_rows is not None:
+            report = annotate_replay_report(report, replay_rows, mixed_csv=mixed_csv)
+        if report_path is not None:
+            with csv_file_lock(report_path, create_parent=True):
+                write_text_no_follow(
+                    report_path,
+                    json.dumps(report, ensure_ascii=False, indent=2, sort_keys=True) + "\n",
+                )
+        if dashboard_path is not None:
+            with csv_file_lock(dashboard_path, create_parent=True):
+                write_text_no_follow(dashboard_path, render_dashboard_markdown(report))
+    return report
+
+
 def write_report_json(csv_path: Path, report_path: Path, baseline_variant: str) -> dict[str, Any]:
     # Keep lock order stable across all report writes: source CSV first, derived
     # report second. Do not introduce a report -> CSV path; that can deadlock
     # concurrent report generation.
-    with csv_file_lock(csv_path, create_parent=True):
-        report = summarize_benchmark_rows(read_csv_rows(csv_path), baseline_variant)
-        with csv_file_lock(report_path, create_parent=True):
-            write_text_no_follow(
-                report_path,
-                json.dumps(report, ensure_ascii=False, indent=2, sort_keys=True) + "\n",
-            )
-    return report
+    return write_report_outputs(csv_path, report_path, None, baseline_variant)
 
 
 def sanitize_note_text(value: Any) -> str:
@@ -2351,8 +2972,18 @@ def existing_file_identity(path: Path) -> tuple[int, int] | None:
         os.close(fd)
 
 
-def validate_distinct_output_paths(csv_path: Path, ledger_path: Path | None, report_path: Path | None) -> None:
-    outputs = [("csv", csv_path), ("ledger-jsonl", ledger_path), ("report-json", report_path)]
+def validate_distinct_output_paths(
+    csv_path: Path,
+    ledger_path: Path | None,
+    report_path: Path | None,
+    dashboard_path: Path | None = None,
+) -> None:
+    outputs = [
+        ("csv", csv_path),
+        ("ledger-jsonl", ledger_path),
+        ("report-json", report_path),
+        ("dashboard-md", dashboard_path),
+    ]
     seen: dict[Path, str] = {}
     seen_identity: dict[tuple[int, int], str] = {}
     for label, path in outputs:
@@ -2391,12 +3022,16 @@ def main() -> int:
                         help="optional JSONL ledger path for cost-shift accounting per run")
     parser.add_argument("--report-json", default=None, type=Path,
                         help="optional A/B summary report JSON path generated from --csv after real runs")
+    parser.add_argument("--dashboard-md", default=None, type=Path,
+                        help="optional Markdown dashboard path generated from the benchmark report")
+    parser.add_argument("--evidence-jsonl", default=None, type=Path,
+                        help="optional validated run-evidence JSONL replay input; skips provider invocation")
     parser.add_argument("--baseline-variant", default="baseline",
                         help="variant name used as the report baseline (default: baseline)")
     args = parser.parse_args()
 
     require_no_follow_file_ops_supported()
-    validate_distinct_output_paths(args.csv, args.ledger_jsonl, args.report_json)
+    validate_distinct_output_paths(args.csv, args.ledger_jsonl, args.report_json, args.dashboard_md)
 
     variants = parse_variants(args.variants)
     tasks = parse_tasks(args.tasks, variants=variants)
@@ -2411,6 +3046,61 @@ def main() -> int:
         for task, variant in targets
         if (task.id, variant.name) not in skip_keys
     ]
+    if args.evidence_jsonl is not None:
+        if args.dry_run:
+            for task, variant in targets:
+                if (task.id, variant.name) in skip_keys:
+                    print(f"skip {task.id}/{variant.name} (already in {args.csv})")
+                    continue
+                print(f"evidence replay dry-run: {task.id}/{variant.name} <- {args.evidence_jsonl}")
+            print("completed 0 run(s); results in (dry-run; no CSV writes)")
+            return 0
+        csv_had_preexisting_content = file_has_content_no_follow(args.csv)
+        evidence_rows = read_evidence_jsonl(args.evidence_jsonl)
+        evidence_by_key = validate_evidence_coverage(evidence_rows, runnable_targets)
+        claude_ver = "evidence-replay"
+        completed = 0
+        replay_rows_written: list[EvidenceReplayRow] = []
+        for task, variant in targets:
+            if (task.id, variant.name) in skip_keys:
+                print(f"skip {task.id}/{variant.name} (already in {args.csv})")
+                continue
+            evidence = evidence_by_key[(task.id, variant.name)]
+            print(f"replay {task.id}/{variant.name} ...", flush=True)
+            result = run_evidence_fixture(task, variant, evidence)
+            wrote = append_csv(args.csv, claude_ver, result, skip_existing=args.resume)
+            if wrote:
+                replay_rows_written.append(evidence)
+                if args.ledger_jsonl is not None:
+                    append_cost_shift_ledger(
+                        args.ledger_jsonl,
+                        claude_ver,
+                        result,
+                        replay_provenance=evidence.provenance_payload(),
+                    )
+            completed += 1
+            status = "ok" if result.success else "FAIL"
+            suffix = "" if wrote else " (CSV not updated; row already present)"
+            print(
+                f"  {status} tokens={sum(result.tokens.values())} cost=${result.cost_usd:.4f} "
+                f"wall_time={result.wall_time_seconds:.3f}s {sanitize_note_text(result.notes)}{suffix}"
+            )
+        if args.report_json is not None or args.dashboard_md is not None:
+            report = write_report_outputs(
+                args.csv,
+                args.report_json,
+                args.dashboard_md,
+                args.baseline_variant,
+                replay_rows=replay_rows_written,
+                mixed_csv=csv_had_preexisting_content or bool(skip_keys) or len(replay_rows_written) != int(completed),
+            )
+            if args.report_json is not None:
+                print(f"report {args.report_json}: {report['claim_status']}")
+            if args.dashboard_md is not None:
+                print(f"dashboard {args.dashboard_md}: {report_public_claim_status(report)[0]}")
+        print(f"completed {completed} run(s); results in {args.csv}")
+        return 0
+
     placeholder_targets = [
         f"{task.id}/{variant.name}"
         for task, variant in runnable_targets
@@ -2463,9 +3153,12 @@ def main() -> int:
             f"wall_time={result.wall_time_seconds:.3f}s {sanitize_note_text(result.notes)}{suffix}"
         )
     target = args.csv if not args.dry_run else "(dry-run; no CSV writes)"
-    if args.report_json is not None and not args.dry_run:
-        report = write_report_json(args.csv, args.report_json, args.baseline_variant)
-        print(f"report {args.report_json}: {report['claim_status']}")
+    if (args.report_json is not None or args.dashboard_md is not None) and not args.dry_run:
+        report = write_report_outputs(args.csv, args.report_json, args.dashboard_md, args.baseline_variant)
+        if args.report_json is not None:
+            print(f"report {args.report_json}: {report['claim_status']}")
+        if args.dashboard_md is not None:
+            print(f"dashboard {args.dashboard_md}: {report_public_claim_status(report)[0]}")
     print(f"completed {completed} run(s); results in {target}")
     return 0
 
diff --git a/docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl b/docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl
new file mode 100644
index 0000000..f632e00
--- /dev/null
+++ b/docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl
@@ -0,0 +1,24 @@
+{"artifacts_used": 0, "bytes_after": 9450, "bytes_before": 9450, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_01_bugfix", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1715, "output_tokens": 229}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.17}
+{"artifacts_used": 1, "bytes_after": 5481, "bytes_before": 9450, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_01_bugfix", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1131, "output_tokens": 210}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.13}
+{"artifacts_used": 0, "bytes_after": 9900, "bytes_before": 9900, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_02_exploration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1830, "output_tokens": 238}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.34}
+{"artifacts_used": 1, "bytes_after": 5742, "bytes_before": 9900, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_02_exploration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1207, "output_tokens": 218}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.26}
+{"artifacts_used": 0, "bytes_after": 10350, "bytes_before": 10350, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_03_code_review", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1945, "output_tokens": 247}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.51}
+{"artifacts_used": 1, "bytes_after": 6003, "bytes_before": 10350, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_03_code_review", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1283, "output_tokens": 227}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.39}
+{"artifacts_used": 0, "bytes_after": 10800, "bytes_before": 10800, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_04_long_log_analysis", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2060, "output_tokens": 256}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.68}
+{"artifacts_used": 1, "bytes_after": 6264, "bytes_before": 10800, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_04_long_log_analysis", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1359, "output_tokens": 235}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.52}
+{"artifacts_used": 0, "bytes_after": 11250, "bytes_before": 11250, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_05_migration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2175, "output_tokens": 265}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.85}
+{"artifacts_used": 1, "bytes_after": 6525, "bytes_before": 11250, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_05_migration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1435, "output_tokens": 243}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.65}
+{"artifacts_used": 0, "bytes_after": 11700, "bytes_before": 11700, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_06_docs", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2290, "output_tokens": 274}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.02}
+{"artifacts_used": 1, "bytes_after": 6785, "bytes_before": 11700, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_06_docs", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1511, "output_tokens": 252}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.78}
+{"artifacts_used": 0, "bytes_after": 12150, "bytes_before": 12150, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_07_refactor", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2405, "output_tokens": 283}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.19}
+{"artifacts_used": 1, "bytes_after": 7046, "bytes_before": 12150, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_07_refactor", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1587, "output_tokens": 260}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.91}
+{"artifacts_used": 0, "bytes_after": 12600, "bytes_before": 12600, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_08_performance", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2520, "output_tokens": 292}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.36}
+{"artifacts_used": 1, "bytes_after": 7307, "bytes_before": 12600, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_08_performance", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1663, "output_tokens": 268}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.04}
+{"artifacts_used": 0, "bytes_after": 13050, "bytes_before": 13050, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_09_telemetry", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2635, "output_tokens": 301}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.53}
+{"artifacts_used": 1, "bytes_after": 7568, "bytes_before": 13050, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_09_telemetry", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1739, "output_tokens": 276}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.17}
+{"artifacts_used": 0, "bytes_after": 13500, "bytes_before": 13500, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_10_cache_layout", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2750, "output_tokens": 310}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.7}
+{"artifacts_used": 1, "bytes_after": 7829, "bytes_before": 13500, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_10_cache_layout", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1815, "output_tokens": 285}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.3}
+{"artifacts_used": 0, "bytes_after": 13950, "bytes_before": 13950, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_11_tool_schema", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2865, "output_tokens": 319}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.87}
+{"artifacts_used": 1, "bytes_after": 8090, "bytes_before": 13950, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_11_tool_schema", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1890, "output_tokens": 293}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.43}
+{"artifacts_used": 0, "bytes_after": 14400, "bytes_before": 14400, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_12_artifact_receipt", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2980, "output_tokens": 328}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 13.04}
+{"artifacts_used": 1, "bytes_after": 8352, "bytes_before": 14400, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_12_artifact_receipt", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1966, "output_tokens": 301}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.56}
diff --git a/docs/benchmark-workflow-examples.md b/docs/benchmark-workflow-examples.md
index 246fd4c..0661ed3 100644
--- a/docs/benchmark-workflow-examples.md
+++ b/docs/benchmark-workflow-examples.md
@@ -26,6 +26,7 @@ Use them to decide what evidence a workflow has and what it does **not** prove:
 3. Treat `comparisons[].quality_gate != "pass"` as a warning to inspect failures, correction burden, and unmatched tasks before discussing savings.
 4. Keep byte-proxy, provider-cache, wall-time, and shifted-cost evidence in separate language from provider-measured token/cost claims. Provider-cache telemetry is not independent savings proof.
 5. Keep self-hosted local/model-server latency, memory, and quality metrics in the run-evidence ledger sidecar; do not fold them into hosted API token/cost savings claims unless provider-measured matched-task evidence separately supports that claim.
+6. For deterministic local replay, add `--evidence-jsonl ... --dashboard-md ...`. Synthetic/manual replay evidence regenerates CSV/report/dashboard artifacts, but the report is marked `replay_only_not_public_claim` or `unknown_mixed_csv` unless every report row has complete provider-export provenance.
 
 ## Safe wording
 
@@ -42,3 +43,5 @@ The `.example.json` fixtures intentionally use full `context-guard-bench-report-
 The self-hosted metrics example is a JSONL run-evidence sidecar, not a full report shape. Its fields are additive ledger evidence only: `latency_ms`, `peak_memory_mb`, and normalized `quality_score` describe local/model-server behavior and leave hosted API report calculations unchanged. Use `context-guard experiments plan self-hosted-metrics-ledger --json ...` only as a dry-run ledger-preview checker for explicit metrics; it does not write the benchmark ledger.
 
 For task/variant starter fixtures rather than full report-shape examples, see [`experimental-benchmark-fixtures.md`](experimental-benchmark-fixtures.md). Those files are fixture-only and synthetic dry-run-only starters until users replace the placeholder prompts and success checks; they are not shipped OCR, visual-token, learned-compression, or output-transform benchmark results, and real claims still require provider-measured matched successful tasks plus failure-rate, correction, and shifted-cost guardrails.
+
+The token-savings 12-task starter also includes [`benchmark-fixtures/token-savings-12task.evidence.example.jsonl`](benchmark-fixtures/token-savings-12task.evidence.example.jsonl) for `context-guard-bench --evidence-jsonl` replay. That file is synthetic local replay evidence, not provider-measured savings proof; use it to validate dashboards and claim-boundary handling before collecting real provider exports.
diff --git a/docs/experimental-benchmark-fixtures.md b/docs/experimental-benchmark-fixtures.md
index c8469c7..9d913ec 100644
--- a/docs/experimental-benchmark-fixtures.md
+++ b/docs/experimental-benchmark-fixtures.md
@@ -12,6 +12,23 @@ Use them when designing an experiment that starts from ContextGuard's existing b
 5. Treat byte counts, image dimensions, OCR confidence, and local compressor ratios as proxy evidence. Real token/cost claims require **provider-measured** primary token/cost fields on both sides.
 6. Keep private screenshots, raw secrets, and external service endpoints out of fixture files.
 
+## Local replay evidence
+
+`context-guard-bench --evidence-jsonl <path>` can replay pre-recorded run evidence into the normal CSV/report pipeline without invoking `claude` or any task `success_command`. Pair it with `--report-json` and `--dashboard-md` to regenerate a deterministic local dashboard:
+
+```bash
+context-guard-bench \
+  --tasks docs/benchmark-fixtures/token-savings-12task.tasks.example.json \
+  --variants docs/benchmark-fixtures/token-savings-12task.variants.example.json \
+  --evidence-jsonl docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl \
+  --csv /tmp/contextguard-token-savings.csv \
+  --report-json /tmp/contextguard-token-savings.report.json \
+  --dashboard-md /tmp/contextguard-token-savings.dashboard.md \
+  --baseline-variant baseline_full_context_fixture
+```
+
+The included token-savings evidence file is deliberately `synthetic_fixture` provenance. It validates replay/dashboard mechanics and byte-proxy reporting only: replay forces synthetic/manual rows to `primary_tokens_measured=false` and `cost_measured=false`, so it is not public hosted API token/cost savings evidence even when token-looking numbers are present. A public claim still requires matched successful tasks, provider-export provenance, provider-measured primary tokens/cost, quality non-inferiority, and shifted-cost accounting.
+
 ## Runner-native variant prompt files
 
 `context-guard-bench` supports optional file-backed `variant_prompt_files` in task fixtures. The map is keyed by variant name and lets a single logical task swap sanitized prompt evidence per variant, for example a baseline raw-output prompt versus a digest plus artifact receipt prompt. Prompt files are resolved relative to the task JSON, must be relative paths, and are read with the same no-follow/symlink-safe posture as task and variant fixtures.
@@ -20,12 +37,12 @@ This runner-native swap only proves command shape and prompt selection until the
 
 ## Included fixture sets
 
-| Fixture set | Task file | Variant file | Intended future experiment |
-| --- | --- | --- | --- |
-| Visual/OCR evidence | [`benchmark-fixtures/visual-ocr.tasks.example.json`](benchmark-fixtures/visual-ocr.tasks.example.json) | [`benchmark-fixtures/visual-ocr.variants.example.json`](benchmark-fixtures/visual-ocr.variants.example.json) | Compare full visual evidence against cropped or OCR-derived evidence after the user supplies sanitized textual evidence, missed-context notes, crop/OCR telemetry, and provider telemetry. |
-| Learned compression | [`benchmark-fixtures/learned-compression.tasks.example.json`](benchmark-fixtures/learned-compression.tasks.example.json) | [`benchmark-fixtures/learned-compression.variants.example.json`](benchmark-fixtures/learned-compression.variants.example.json) | Compare sanitized baseline context packs against a fixture-only compressed digest candidate after exact retrieval or receipt fallback, quality gates, and shifted costs are measured. |
-| Reversible output transform | [`benchmark-fixtures/output-transform.tasks.example.json`](benchmark-fixtures/output-transform.tasks.example.json) | [`benchmark-fixtures/output-transform.variants.example.json`](benchmark-fixtures/output-transform.variants.example.json) | Compare raw sanitized command output against a digest plus artifact receipt after variant prompt files, success checks, and provider telemetry are supplied. |
-| Token-savings 12-task roadmap | [`benchmark-fixtures/token-savings-12task.tasks.example.json`](benchmark-fixtures/token-savings-12task.tasks.example.json) | [`benchmark-fixtures/token-savings-12task.variants.example.json`](benchmark-fixtures/token-savings-12task.variants.example.json) | Exercise a canonical 12-task spread for bugfix, exploration, review, log analysis, migration, docs, refactor, performance, telemetry, cache layout, tool-schema deferral, and artifact receipt experiments after real success commands and provider telemetry are supplied. |
+| Fixture set | Task file | Variant file | Evidence replay file | Intended future experiment |
+| --- | --- | --- | --- | --- |
+| Visual/OCR evidence | [`benchmark-fixtures/visual-ocr.tasks.example.json`](benchmark-fixtures/visual-ocr.tasks.example.json) | [`benchmark-fixtures/visual-ocr.variants.example.json`](benchmark-fixtures/visual-ocr.variants.example.json) | n/a | Compare full visual evidence against cropped or OCR-derived evidence after the user supplies sanitized textual evidence, missed-context notes, crop/OCR telemetry, and provider telemetry. |
+| Learned compression | [`benchmark-fixtures/learned-compression.tasks.example.json`](benchmark-fixtures/learned-compression.tasks.example.json) | [`benchmark-fixtures/learned-compression.variants.example.json`](benchmark-fixtures/learned-compression.variants.example.json) | n/a | Compare sanitized baseline context packs against a fixture-only compressed digest candidate after exact retrieval or receipt fallback, quality gates, and shifted costs are measured. |
+| Reversible output transform | [`benchmark-fixtures/output-transform.tasks.example.json`](benchmark-fixtures/output-transform.tasks.example.json) | [`benchmark-fixtures/output-transform.variants.example.json`](benchmark-fixtures/output-transform.variants.example.json) | n/a | Compare raw sanitized command output against a digest plus artifact receipt after variant prompt files, success checks, and provider telemetry are supplied. |
+| Token-savings 12-task roadmap | [`benchmark-fixtures/token-savings-12task.tasks.example.json`](benchmark-fixtures/token-savings-12task.tasks.example.json) | [`benchmark-fixtures/token-savings-12task.variants.example.json`](benchmark-fixtures/token-savings-12task.variants.example.json) | [`benchmark-fixtures/token-savings-12task.evidence.example.jsonl`](benchmark-fixtures/token-savings-12task.evidence.example.jsonl) | Exercise a canonical 12-task spread for bugfix, exploration, review, log analysis, migration, docs, refactor, performance, telemetry, cache layout, tool-schema deferral, and artifact receipt experiments after real success commands and provider telemetry are supplied. |
 
 ## Visual/OCR fixture notes
 
@@ -41,7 +58,7 @@ The output-transform fixtures describe already-sanitized command output comparis
 
 ## Token-savings 12-task roadmap fixture notes
 
-The token-savings 12-task fixtures are a canonical **fixture-only** spread for roadmap-level A/B design. They demonstrate `variant_prompt_files` for a baseline full-context prompt versus a ContextGuard advisory-foundations prompt that may later include cache layout lint, core-vs-deferred tool schemas, artifact receipts, and claim-safe telemetry. They do not execute `context-guard-cache-score`, `context-guard-tool-prune`, or any provider call.
+The token-savings 12-task fixtures are a canonical **fixture-only** spread for roadmap-level A/B design. They demonstrate `variant_prompt_files` for a baseline full-context prompt versus a ContextGuard advisory-foundations prompt that may later include cache layout lint, core-vs-deferred tool schemas, artifact receipts, and claim-safe telemetry. They do not execute `context-guard-cache-score`, `context-guard-tool-prune`, or any provider call. The companion `token-savings-12task.evidence.example.jsonl` lets users replay deterministic synthetic rows into CSV/report/dashboard outputs while preserving the same non-claim boundary.
 
 For real non-dry-run experiments, replace every placeholder `success_command`, keep task IDs matched across baseline and candidate variants, and require provider-measured primary token/cost data before interpreting `tokens_per_successful_task`, `total_cost_with_shift_usd`, or `external_cost_usd`. Cache predictions, char/4 token proxies, local latency, and byte reductions remain diagnostic proxy evidence unless the generated report contains matched successful task evidence and stays within the 10%p failure-rate guardrail.
 
diff --git a/package.json b/package.json
index 215b611..03f9d9b 100644
--- a/package.json
+++ b/package.json
@@ -59,6 +59,7 @@
     "docs/benchmark-workflows/*.example.jsonl",
     "docs/benchmark-workflow-examples.md",
     "docs/benchmark-fixtures/*.example.json",
+    "docs/benchmark-fixtures/*.example.jsonl",
     "docs/benchmark-fixtures/*.prompt.example.md",
     "docs/experimental-benchmark-fixtures.md",
     "packaging/homebrew/context-guard.rb.template"
diff --git a/plugins/context-guard/README.ko.md b/plugins/context-guard/README.ko.md
index 9340a80..8345784 100644
--- a/plugins/context-guard/README.ko.md
+++ b/plugins/context-guard/README.ko.md
@@ -114,7 +114,7 @@ brief 모드는 코딩 에이전트가 군더더기를 줄이도록 요청하되
 
 ## 절감 수치를 과장하지 않습니다
 
-이 헬퍼들은 흔히 컨텍스트를 불필요하게 키우는 원인을 줄이지만, 고정된 절감률을 보장하지 않습니다. 실제 전후 비교 증거가 필요하면 `context-guard-bench --ledger-jsonl ... --report-json ...`로 본인 작업에서 측정하세요. 토큰 절감 주장은 대응 태스크 양쪽 모두에 `primary_tokens_measured`가 있을 때만 계산하며, report의 `matched_pair_evidence`가 성공한 baseline/variant task bucket을 transform, quality gate, 측정 가능 여부, claim boundary와 연결합니다. wall-time과 provider-cache 필드는 진단용 텔레메트리이지 단독 절감 증거가 아닙니다. 감사의 `cache_friendliness`, [`cache_diagnostics`](https://github.com/ictechgy/context-guard/blob/main/docs/cache-diagnostics-schema.md), `cache_layout_advice`는 관측/추론/가설/불가 경계를 둔 휴리스틱 배치·cache-read 신호와 순위화된 확인/실험이며 청구 기준이나 provider-cache 증명이 아닙니다. 벤치마크 CSV 스키마는 엄격하므로 헬퍼 업그레이드 후에는 새 CSV를 시작하거나 헤더를 마이그레이션하세요. 작업 유형별 합성 예시는 [`docs/benchmark-workflow-examples.md`](https://github.com/ictechgy/context-guard/blob/main/docs/benchmark-workflow-examples.md)에 있고, fixture-only 실험 시작 예시는 [`docs/experimental-benchmark-fixtures.md`](https://github.com/ictechgy/context-guard/blob/main/docs/experimental-benchmark-fixtures.md)에 있습니다.
+이 헬퍼들은 흔히 컨텍스트를 불필요하게 키우는 원인을 줄이지만, 고정된 절감률을 보장하지 않습니다. 실제 전후 비교 증거가 필요하면 `context-guard-bench --ledger-jsonl ... --report-json ... --dashboard-md ...`로 본인 작업에서 측정하세요. `--evidence-jsonl ...`는 deterministic local replay용이며 provider-export provenance가 완전하지 않으면 public claim 불가로 표시됩니다. 토큰 절감 주장은 대응 태스크 양쪽 모두에 `primary_tokens_measured`가 있을 때만 계산하며, report의 `matched_pair_evidence`가 성공한 baseline/variant task bucket을 transform, quality gate, 측정 가능 여부, claim boundary와 연결합니다. wall-time과 provider-cache 필드는 진단용 텔레메트리이지 단독 절감 증거가 아닙니다. 감사의 `cache_friendliness`, [`cache_diagnostics`](https://github.com/ictechgy/context-guard/blob/main/docs/cache-diagnostics-schema.md), `cache_layout_advice`는 관측/추론/가설/불가 경계를 둔 휴리스틱 배치·cache-read 신호와 순위화된 확인/실험이며 청구 기준이나 provider-cache 증명이 아닙니다. 벤치마크 CSV 스키마는 엄격하므로 헬퍼 업그레이드 후에는 새 CSV를 시작하거나 헤더를 마이그레이션하세요. 작업 유형별 합성 예시는 [`docs/benchmark-workflow-examples.md`](https://github.com/ictechgy/context-guard/blob/main/docs/benchmark-workflow-examples.md)에 있고, fixture-only 실험 시작 예시는 [`docs/experimental-benchmark-fixtures.md`](https://github.com/ictechgy/context-guard/blob/main/docs/experimental-benchmark-fixtures.md)에 있습니다.
 
 ContextGuard는 모델 토큰을 줄이기 위해 작업을 외부 AI 서비스로 전송하지 않습니다. 모든 헬퍼 명령은 로컬에서 동작합니다. 로컬 RAM/디스크 보관본은 다음에 보낼 컨텍스트를 줄이는 데 도움될 수 있지만 provider prompt cache를 대체하지 않습니다. Anthropic 배포나 청구 설명 전에는 공식 prompt caching/pricing 문서를 다시 확인하세요: https://docs.anthropic.com/en/build-with-claude/prompt-caching 및 https://platform.claude.com/docs/en/about-claude/pricing.
 
diff --git a/plugins/context-guard/README.md b/plugins/context-guard/README.md
index d3c10c1..d625ebd 100644
--- a/plugins/context-guard/README.md
+++ b/plugins/context-guard/README.md
@@ -123,7 +123,7 @@ Three deterministic levels — `lite`, `standard`, `ultra` — live under [`brie
 
 ## Conservative claims
 
-These helpers reduce common sources of context bloat, but they do not guarantee a fixed percentage savings. Use `context-guard-bench --ledger-jsonl ... --report-json ...` when you need measured before/after evidence for your own tasks; token-savings claims require `primary_tokens_measured` on both matched sides, and the report's `matched_pair_evidence` links each successful baseline/variant task bucket to the transform, quality gate, measurement availability, and claim boundary. Wall-time/provider-cache fields are diagnostic telemetry, not standalone savings proof. Audit `cache_friendliness`, [`cache_diagnostics`](https://github.com/ictechgy/context-guard/blob/main/docs/cache-diagnostics-schema.md), and `cache_layout_advice` findings are heuristic layout/cache-read signals and ranked checks/experiments with observed/inferred/hypothesis/unavailable boundaries, not billing authority or provider-cache proof. Benchmark CSV schemas are strict, so start a new CSV or migrate the header after helper upgrades. Workflow-specific synthetic examples live in [`docs/benchmark-workflow-examples.md`](https://github.com/ictechgy/context-guard/blob/main/docs/benchmark-workflow-examples.md), and fixture-only experimental task/variant starters live in [`docs/experimental-benchmark-fixtures.md`](https://github.com/ictechgy/context-guard/blob/main/docs/experimental-benchmark-fixtures.md).
+These helpers reduce common sources of context bloat, but they do not guarantee a fixed percentage savings. Use `context-guard-bench --ledger-jsonl ... --report-json ... --dashboard-md ...` when you need measured before/after evidence for your own tasks; add `--evidence-jsonl ...` only for deterministic local replay that remains non-claim-eligible unless provider-export provenance is complete; token-savings claims require `primary_tokens_measured` on both matched sides, and the report's `matched_pair_evidence` links each successful baseline/variant task bucket to the transform, quality gate, measurement availability, and claim boundary. Wall-time/provider-cache fields are diagnostic telemetry, not standalone savings proof. Audit `cache_friendliness`, [`cache_diagnostics`](https://github.com/ictechgy/context-guard/blob/main/docs/cache-diagnostics-schema.md), and `cache_layout_advice` findings are heuristic layout/cache-read signals and ranked checks/experiments with observed/inferred/hypothesis/unavailable boundaries, not billing authority or provider-cache proof. Benchmark CSV schemas are strict, so start a new CSV or migrate the header after helper upgrades. Workflow-specific synthetic examples live in [`docs/benchmark-workflow-examples.md`](https://github.com/ictechgy/context-guard/blob/main/docs/benchmark-workflow-examples.md), and fixture-only experimental task/variant starters live in [`docs/experimental-benchmark-fixtures.md`](https://github.com/ictechgy/context-guard/blob/main/docs/experimental-benchmark-fixtures.md).
 
 ContextGuard also does not send work to external AI providers to save model tokens. All helper commands run locally. Local RAM/disk receipts can reduce what you choose to send, but they do not replace a provider prompt cache. Before release or billing claims for Anthropic, recheck the official prompt-caching and pricing docs: https://docs.anthropic.com/en/build-with-claude/prompt-caching and https://platform.claude.com/docs/en/about-claude/pricing.
 
diff --git a/plugins/context-guard/bin/context-guard-bench b/plugins/context-guard/bin/context-guard-bench
index e338b88..a1af3ab 100755
--- a/plugins/context-guard/bin/context-guard-bench
+++ b/plugins/context-guard/bin/context-guard-bench
@@ -178,6 +178,8 @@ EXTERNAL_SOURCE_KEY_GROUPS: tuple[tuple[str, tuple[str, ...], tuple[str, ...]],
 )
 MAX_USAGE_TOKEN_COUNT = 10**12
 MAX_USAGE_COST_USD = 10**9
+MAX_EVIDENCE_JSONL_BYTES = 5_000_000
+MAX_EVIDENCE_JSONL_LINES = 100_000
 # Byte -> token proxy 환산 계수. 측정된 모델 토큰이 아니라 byte delta 기반 보수적
 # 추정치이며, report에서 evidence="inferred"로 분명히 라벨링한다. 영어 텍스트 기준
 # ~4 bytes/token의 통용 근사값을 사용한다.
@@ -188,6 +190,25 @@ MEASUREMENT_BASELINE_SCHEMA_VERSION = "contextguard.bench.measurement-baseline.v
 SELF_HOSTED_METRICS_SCHEMA_VERSION = "contextguard.bench.self-hosted-metrics.v1"
 SELF_HOSTED_METRICS_KEY = "self_hosted_metrics"
 SELF_HOSTED_METRICS_CLAIM_BOUNDARY = "self_hosted_metrics_only_not_hosted_api_token_or_cost_savings"
+EVIDENCE_REPLAY_SOURCE_TYPES = frozenset({"synthetic_fixture", "provider_export", "manual_audit"})
+PROVIDER_EXPORT_PUBLIC_CLAIM_SCOPES = frozenset({
+    "provider_measured_matched_task",
+    "provider_measured_matched_task_public_claim",
+    "hosted_api_provider_measured_matched_task",
+})
+REPLAY_PUBLIC_CLAIM_CANDIDATE_STATUS = "provider_export_public_claim_candidate"
+REPLAY_PROVIDER_CLAIM_GATES_NOT_MET_STATUS = "provider_export_claim_gates_not_met"
+REPLAY_NOT_PUBLIC_CLAIM_STATUS = "replay_only_not_public_claim"
+REPLAY_UNKNOWN_MIXED_CSV_STATUS = "unknown_mixed_csv"
+REPLAY_PUBLIC_CLAIM_ELIGIBLE_RAW_STATUSES = frozenset({
+    "token_and_shifted_cost_savings_observed",
+})
+REPLAY_CLAIM_BOUNDARY = (
+    "Evidence replay is an import/replay mode. Synthetic fixtures and manual audits are never "
+    "hosted API token/cost savings evidence; public claims require complete provider_export "
+    "provenance for every report row plus the normal matched-task quality, token, cost, and "
+    "shifted-cost gates."
+)
 MAX_SELF_HOSTED_LABEL_CHARS = 120
 MAX_SELF_HOSTED_LATENCY_MS = 7 * 24 * 60 * 60 * 1000
 MAX_SELF_HOSTED_MEMORY_MB = 10_000_000
@@ -401,6 +422,36 @@ class RunResult:
     self_hosted_metrics: dict[str, Any] | None = None
 
 
+@dataclass
+class EvidenceReplayRow:
+    result: RunResult
+    source_type: str
+    provider_name: str | None
+    capture_command_or_export_id: str | None
+    claim_scope: str
+    provider_export_provenance_complete: bool
+    public_claim_eligible: bool
+    line_number: int
+
+    @property
+    def key(self) -> tuple[str, str]:
+        return (self.result.task_id, self.result.variant)
+
+    def provenance_payload(self) -> dict[str, Any]:
+        return {
+            "schema_version": BENCH_RUN_EVIDENCE_SCHEMA_VERSION,
+            "mode": "evidence_jsonl_replay",
+            "evidence_source_type": self.source_type,
+            "provider_name": self.provider_name,
+            "capture_command_or_export_id": self.capture_command_or_export_id,
+            "claim_scope": self.claim_scope,
+            "provider_export_provenance_complete": self.provider_export_provenance_complete,
+            "public_claim_eligible": self.public_claim_eligible,
+            "line_number": self.line_number,
+            "claim_boundary": REPLAY_CLAIM_BOUNDARY,
+        }
+
+
 @dataclass
 class BoundedProcessResult:
     returncode: int
@@ -1362,7 +1413,13 @@ def write_text_no_follow(path: Path, text: str) -> None:
             os.close(fd)
 
 
-def append_cost_shift_ledger(path: Path, claude_ver: str, result: RunResult) -> None:
+def append_cost_shift_ledger(
+    path: Path,
+    claude_ver: str,
+    result: RunResult,
+    *,
+    replay_provenance: dict[str, Any] | None = None,
+) -> None:
     shifted_cost_known = cost_shift_measured(result)
     byte_metrics_observed = bool(result.bytes_before or result.bytes_after)
     payload = {
@@ -1413,6 +1470,10 @@ def append_cost_shift_ledger(path: Path, claude_ver: str, result: RunResult) ->
     }
     if result.self_hosted_metrics is not None:
         payload["self_hosted_metrics"] = result.self_hosted_metrics
+    if replay_provenance is not None:
+        payload["replay_provenance"] = replay_provenance
+        payload["evidence_source_type"] = replay_provenance.get("evidence_source_type")
+        payload["public_claim_eligible"] = bool(replay_provenance.get("public_claim_eligible"))
     with csv_file_lock(path, create_parent=True):
         fd = _open_regular_no_symlink(path, os.O_CREAT | os.O_APPEND | os.O_WRONLY, 0o600, create_parent=True)
         try:
@@ -1488,6 +1549,354 @@ def read_csv_rows(csv_path: Path) -> list[dict[str, str]]:
             os.close(fd)
 
 
+def file_has_content_no_follow(path: Path) -> bool:
+    try:
+        fd = _open_regular_no_symlink(path)
+    except FileNotFoundError:
+        return False
+    try:
+        return os.fstat(fd).st_size > 0
+    finally:
+        os.close(fd)
+
+
+def require_evidence_object(raw: Any, *, owner: str) -> dict[str, Any]:
+    if not isinstance(raw, dict):
+        raise SystemExit(f"{owner} evidence row must be a JSON object")
+    return raw
+
+
+def evidence_non_empty_string(raw: Any, *, field: str, owner: str, required: bool = True) -> str | None:
+    if raw is None:
+        if required:
+            raise SystemExit(f"{owner} {field} must be a non-empty string")
+        return None
+    if not isinstance(raw, str):
+        raise SystemExit(f"{owner} {field} must be a string")
+    text = sanitize_note_text(raw)
+    if not text:
+        if required:
+            raise SystemExit(f"{owner} {field} must be a non-empty string")
+        return None
+    return text
+
+
+def evidence_bool(raw: Any, *, field: str, owner: str, default: bool = False) -> bool:
+    if raw is None:
+        return default
+    if not isinstance(raw, bool):
+        raise SystemExit(f"{owner} {field} must be a boolean")
+    return raw
+
+
+def evidence_nonnegative_int(
+    raw: Any,
+    *,
+    field: str,
+    owner: str,
+    default: int = 0,
+    maximum: int = MAX_USAGE_TOKEN_COUNT,
+) -> int:
+    if raw is None:
+        return default
+    value = normalize_usage_token(raw)
+    if value is None or value > maximum:
+        raise SystemExit(f"{owner} {field} must be a finite non-negative integer")
+    return value
+
+
+def evidence_nonnegative_float(
+    raw: Any,
+    *,
+    field: str,
+    owner: str,
+    default: float = 0.0,
+    maximum: float = MAX_USAGE_COST_USD,
+) -> float:
+    if raw is None:
+        return default
+    if isinstance(raw, bool) or not isinstance(raw, (int, float)):
+        raise SystemExit(f"{owner} {field} must be a finite non-negative number")
+    value = float(raw)
+    if not math.isfinite(value) or value < 0 or value > maximum:
+        raise SystemExit(f"{owner} {field} must be a finite non-negative number")
+    return value
+
+
+def evidence_first(raw: dict[str, Any], *keys: str) -> Any:
+    for key in keys:
+        if key in raw:
+            return raw[key]
+    return None
+
+
+def parse_evidence_provenance(raw: dict[str, Any], *, owner: str) -> dict[str, Any]:
+    provenance = raw.get("provenance")
+    if provenance is not None and not isinstance(provenance, dict):
+        raise SystemExit(f"{owner} provenance must be a JSON object")
+    source_raw = (
+        provenance.get("evidence_source_type")
+        if isinstance(provenance, dict) and "evidence_source_type" in provenance
+        else raw.get("evidence_source_type")
+    )
+    source_type = evidence_non_empty_string(source_raw, field="evidence_source_type", owner=owner)
+    assert source_type is not None
+    if source_type not in EVIDENCE_REPLAY_SOURCE_TYPES:
+        raise SystemExit(
+            f"{owner} evidence_source_type must be one of: {', '.join(sorted(EVIDENCE_REPLAY_SOURCE_TYPES))}"
+        )
+    provider_name = evidence_non_empty_string(
+        provenance.get("provider_name") if isinstance(provenance, dict) else raw.get("provider_name"),
+        field="provider_name",
+        owner=owner,
+        required=False,
+    )
+    capture_id = evidence_non_empty_string(
+        (
+            provenance.get("capture_command_or_export_id")
+            if isinstance(provenance, dict) and "capture_command_or_export_id" in provenance
+            else raw.get("capture_command_or_export_id")
+        ),
+        field="capture_command_or_export_id",
+        owner=owner,
+        required=False,
+    )
+    claim_scope = evidence_non_empty_string(
+        provenance.get("claim_scope") if isinstance(provenance, dict) else raw.get("claim_scope"),
+        field="claim_scope",
+        owner=owner,
+    )
+    assert claim_scope is not None
+    provider_authority = (
+        source_type == "provider_export"
+        and provider_name is not None
+        and capture_id is not None
+        and claim_scope in PROVIDER_EXPORT_PUBLIC_CLAIM_SCOPES
+    )
+    return {
+        "source_type": source_type,
+        "provider_name": provider_name,
+        "capture_command_or_export_id": capture_id,
+        "claim_scope": claim_scope,
+        "provider_public_claim_authority": provider_authority,
+    }
+
+
+def parse_evidence_tokens(raw: dict[str, Any], *, owner: str) -> tuple[dict[str, int], set[str]]:
+    token_block = raw.get("tokens")
+    if token_block is not None and not isinstance(token_block, dict):
+        raise SystemExit(f"{owner} tokens must be a JSON object")
+    tokens: dict[str, int] = {}
+    observed: set[str] = set()
+    source = token_block if isinstance(token_block, dict) else {}
+    for bucket, _keys in USAGE_KEY_GROUPS:
+        value = source.get(bucket) if bucket in source else raw.get(bucket)
+        if value is not None:
+            observed.add(bucket)
+        tokens[bucket] = evidence_nonnegative_int(value, field=bucket, owner=owner)
+    return tokens, observed
+
+
+def parse_evidence_row(raw_value: Any, *, owner: str, line_number: int) -> EvidenceReplayRow:
+    raw = require_evidence_object(raw_value, owner=owner)
+    schema = evidence_non_empty_string(raw.get("schema_version"), field="schema_version", owner=owner)
+    if schema != BENCH_RUN_EVIDENCE_SCHEMA_VERSION:
+        raise SystemExit(
+            f"{owner} schema_version must be {BENCH_RUN_EVIDENCE_SCHEMA_VERSION}"
+        )
+    task_id = evidence_non_empty_string(raw.get("task_id"), field="task_id", owner=owner)
+    variant = evidence_non_empty_string(raw.get("variant"), field="variant", owner=owner)
+    assert task_id is not None and variant is not None
+    provenance = parse_evidence_provenance(raw, owner=owner)
+    provider_authority = bool(provenance["provider_public_claim_authority"])
+    raw_primary_tokens_measured = evidence_bool(
+        raw.get("primary_tokens_measured"),
+        field="primary_tokens_measured",
+        owner=owner,
+    )
+    raw_cost_measured = evidence_bool(
+        evidence_first(raw, "cost_measured", "primary_cost_measured"),
+        field="cost_measured",
+        owner=owner,
+    )
+    if provenance["source_type"] in {"synthetic_fixture", "manual_audit"}:
+        primary_tokens_measured = False
+        cost_measured = False
+    elif provider_authority:
+        primary_tokens_measured = raw_primary_tokens_measured
+        cost_measured = raw_cost_measured
+    else:
+        if raw_primary_tokens_measured or raw_cost_measured:
+            raise SystemExit(
+                f"{owner} provider_export measured flags require provider_name, "
+                "capture_command_or_export_id, and a provider-measured matched-task claim_scope"
+            )
+        primary_tokens_measured = False
+        cost_measured = False
+
+    tokens, observed_token_buckets = parse_evidence_tokens(raw, owner=owner)
+    if primary_tokens_measured and not {"input_tokens", "output_tokens"}.issubset(observed_token_buckets):
+        raise SystemExit(
+            f"{owner} primary_tokens_measured=true requires input_tokens and output_tokens evidence"
+        )
+    cost_usd = evidence_nonnegative_float(
+        evidence_first(raw, "cost_usd", "primary_cost_usd"),
+        field="cost_usd",
+        owner=owner,
+    )
+    if cost_measured and "cost_usd" not in raw and "primary_cost_usd" not in raw:
+        raise SystemExit(f"{owner} cost_measured=true requires cost_usd evidence")
+
+    if "success" not in raw:
+        raise SystemExit(f"{owner} success must be a boolean")
+    success = evidence_bool(raw.get("success"), field="success", owner=owner)
+    notes = evidence_non_empty_string(raw.get("notes"), field="notes", owner=owner, required=False)
+    model = evidence_non_empty_string(raw.get("model"), field="model", owner=owner, required=False) or "evidence-replay"
+    effort = evidence_non_empty_string(raw.get("effort"), field="effort", owner=owner, required=False) or ""
+    self_hosted_metrics = None
+    if SELF_HOSTED_METRICS_KEY in raw:
+        self_hosted_metrics = normalize_self_hosted_metrics(
+            raw.get(SELF_HOSTED_METRICS_KEY),
+            source="evidence_jsonl.self_hosted_metrics",
+        )
+        if self_hosted_metrics is None:
+            raise SystemExit(f"{owner} self_hosted_metrics must be normalized explicit metrics")
+
+    result = RunResult(
+        task_id=task_id,
+        variant=variant,
+        model=model,
+        effort=effort,
+        tokens=tokens,
+        cost_usd=cost_usd,
+        success=success,
+        notes=notes or f"evidence replay ({provenance['source_type']})",
+        corrections=evidence_nonnegative_int(raw.get("corrections"), field="corrections", owner=owner),
+        cost_measured=cost_measured,
+        wall_time_seconds=evidence_nonnegative_float(
+            raw.get("wall_time_seconds"),
+            field="wall_time_seconds",
+            owner=owner,
+            maximum=MAX_SELF_HOSTED_LATENCY_MS / 1000,
+        ),
+        turns=evidence_nonnegative_int(raw.get("turns"), field="turns", owner=owner),
+        hook_triggers=evidence_nonnegative_int(raw.get("hook_triggers"), field="hook_triggers", owner=owner),
+        bytes_before=evidence_nonnegative_int(raw.get("bytes_before"), field="bytes_before", owner=owner),
+        bytes_after=evidence_nonnegative_int(raw.get("bytes_after"), field="bytes_after", owner=owner),
+        artifacts_used=evidence_nonnegative_int(raw.get("artifacts_used"), field="artifacts_used", owner=owner),
+        external_tokens=evidence_nonnegative_int(raw.get("external_tokens"), field="external_tokens", owner=owner),
+        external_tokens_measured=evidence_bool(
+            raw.get("external_tokens_measured"),
+            field="external_tokens_measured",
+            owner=owner,
+        ),
+        external_cost_usd=evidence_nonnegative_float(
+            raw.get("external_cost_usd"),
+            field="external_cost_usd",
+            owner=owner,
+        ),
+        external_cost_measured=evidence_bool(
+            raw.get("external_cost_measured"),
+            field="external_cost_measured",
+            owner=owner,
+        ),
+        provider_cached_tokens=evidence_nonnegative_int(
+            raw.get("provider_cached_tokens"),
+            field="provider_cached_tokens",
+            owner=owner,
+        ),
+        provider_cached_tokens_measured=evidence_bool(
+            raw.get("provider_cached_tokens_measured"),
+            field="provider_cached_tokens_measured",
+            owner=owner,
+        ),
+        primary_tokens_measured=primary_tokens_measured,
+        self_hosted_metrics=self_hosted_metrics,
+    )
+    return EvidenceReplayRow(
+        result=result,
+        source_type=str(provenance["source_type"]),
+        provider_name=provenance["provider_name"],
+        capture_command_or_export_id=provenance["capture_command_or_export_id"],
+        claim_scope=str(provenance["claim_scope"]),
+        provider_export_provenance_complete=provider_authority,
+        public_claim_eligible=False,
+        line_number=line_number,
+    )
+
+
+def read_evidence_jsonl(path: Path) -> list[EvidenceReplayRow]:
+    fd = _open_regular_no_symlink(path)
+    try:
+        size = os.fstat(fd).st_size
+        if size > MAX_EVIDENCE_JSONL_BYTES:
+            raise SystemExit(
+                f"evidence JSONL exceeds {MAX_EVIDENCE_JSONL_BYTES} bytes: {path}"
+            )
+        rows: list[EvidenceReplayRow] = []
+        with os.fdopen(fd, "r", encoding="utf-8") as handle:
+            fd = -1
+            for line_number, line in enumerate(handle, start=1):
+                if line_number > MAX_EVIDENCE_JSONL_LINES:
+                    raise SystemExit(
+                        f"evidence JSONL line limit exceeded for {path}: > {MAX_EVIDENCE_JSONL_LINES}"
+                    )
+                if not line.strip():
+                    continue
+                try:
+                    payload = json.loads(line)
+                except json.JSONDecodeError as exc:
+                    raise SystemExit(
+                        f"{path}:{line_number} evidence row must be JSON: {exc.msg}"
+                    ) from None
+                rows.append(parse_evidence_row(payload, owner=f"{path}:{line_number}", line_number=line_number))
+    finally:
+        if fd != -1:
+            os.close(fd)
+    if not rows:
+        raise SystemExit(f"evidence JSONL contains no rows: {path}")
+    return rows
+
+
+def validate_evidence_coverage(
+    evidence_rows: list[EvidenceReplayRow],
+    runnable_targets: list[tuple[TaskFixture, Variant]],
+) -> dict[tuple[str, str], EvidenceReplayRow]:
+    by_key: dict[tuple[str, str], EvidenceReplayRow] = {}
+    for row in evidence_rows:
+        if row.key in by_key:
+            raise SystemExit(
+                f"duplicate evidence row for {row.key[0]}/{row.key[1]} "
+                f"(lines {by_key[row.key].line_number} and {row.line_number})"
+            )
+        by_key[row.key] = row
+    missing = [
+        f"{task.id}/{variant.name}"
+        for task, variant in runnable_targets
+        if (task.id, variant.name) not in by_key
+    ]
+    if missing:
+        raise SystemExit(f"missing evidence row(s) for selected targets: {', '.join(missing)}")
+    return {
+        (task.id, variant.name): by_key[(task.id, variant.name)]
+        for task, variant in runnable_targets
+    }
+
+
+def run_evidence_fixture(task: TaskFixture, variant: Variant, evidence: EvidenceReplayRow) -> RunResult:
+    result = evidence.result
+    if result.task_id != task.id or result.variant != variant.name:
+        raise SystemExit(
+            f"evidence target mismatch: expected {task.id}/{variant.name}, "
+            f"got {result.task_id}/{result.variant}"
+        )
+    if result.model == "evidence-replay":
+        result.model = task.model
+    if not result.effort:
+        result.effort = task.effort or ""
+    return result
+
+
 def row_int(row: dict[str, str], key: str) -> int:
     try:
         return int(float(row.get(key) or 0))
@@ -2277,18 +2686,230 @@ def summarize_benchmark_rows(rows: list[dict[str, str]], baseline_variant: str)
         ),
     }
 
+def annotate_replay_report(
+    report: dict[str, Any],
+    replay_rows: list[EvidenceReplayRow],
+    *,
+    mixed_csv: bool,
+) -> dict[str, Any]:
+    source_types = sorted({row.source_type for row in replay_rows})
+    provider_names = sorted({row.provider_name for row in replay_rows if row.provider_name})
+    claim_scopes = sorted({row.claim_scope for row in replay_rows})
+    same_run_complete = (not mixed_csv) and len(replay_rows) == int(report.get("row_count") or 0)
+    all_provider_claim_authority = bool(replay_rows) and all(
+        row.provider_export_provenance_complete for row in replay_rows
+    )
+    raw_claim_status = str(report.get("claim_status") or "")
+    matched_pair_evidence = report.get("matched_pair_evidence")
+    matched_claim_gates_allow_public_claim = (
+        isinstance(matched_pair_evidence, list)
+        and bool(matched_pair_evidence)
+        and all(
+            isinstance(item, dict)
+            and isinstance(item.get("claim_boundary"), dict)
+            and bool(item["claim_boundary"].get("token_savings_claim_allowed"))
+            and bool(item["claim_boundary"].get("shifted_cost_claim_allowed"))
+            for item in matched_pair_evidence
+        )
+    )
+    report_claim_gates_allow_public_claim = (
+        raw_claim_status in REPLAY_PUBLIC_CLAIM_ELIGIBLE_RAW_STATUSES
+        and matched_claim_gates_allow_public_claim
+    )
+    if not same_run_complete:
+        public_claim_status = REPLAY_UNKNOWN_MIXED_CSV_STATUS
+        public_claim_eligible = False
+    elif all_provider_claim_authority and report_claim_gates_allow_public_claim:
+        public_claim_status = REPLAY_PUBLIC_CLAIM_CANDIDATE_STATUS
+        public_claim_eligible = True
+    elif all_provider_claim_authority:
+        public_claim_status = REPLAY_PROVIDER_CLAIM_GATES_NOT_MET_STATUS
+        public_claim_eligible = False
+    else:
+        public_claim_status = REPLAY_NOT_PUBLIC_CLAIM_STATUS
+        public_claim_eligible = False
+    report["raw_metric_claim_status"] = raw_claim_status
+    report["public_claim_status"] = public_claim_status
+    report["public_claim_eligible"] = public_claim_eligible
+    if not public_claim_eligible:
+        report["claim_status"] = public_claim_status
+    report["replay_evidence"] = {
+        "schema_version": BENCH_RUN_EVIDENCE_SCHEMA_VERSION,
+        "mode": "evidence_jsonl_replay",
+        "row_count": len(replay_rows),
+        "source_types": source_types,
+        "provider_names": provider_names,
+        "claim_scopes": claim_scopes,
+        "same_run_complete": same_run_complete,
+        "mixed_csv": mixed_csv,
+        "provider_export_provenance_complete": all_provider_claim_authority,
+        "report_claim_gates_allow_public_claim": report_claim_gates_allow_public_claim,
+        "public_claim_status": public_claim_status,
+        "public_claim_eligible": public_claim_eligible,
+        "target_keys": [f"{row.result.task_id}/{row.result.variant}" for row in replay_rows],
+        "claim_boundary": REPLAY_CLAIM_BOUNDARY,
+    }
+    return report
+
+
+def report_public_claim_status(report: dict[str, Any]) -> tuple[str, bool | None]:
+    if "public_claim_status" in report:
+        return str(report.get("public_claim_status")), bool(report.get("public_claim_eligible"))
+    return (
+        "csv_provenance_unknown_requires_original_evidence_or_trusted_ledger",
+        None,
+    )
+
+
+def markdown_value(value: Any) -> str:
+    if value is None:
+        return "n/a"
+    if isinstance(value, bool):
+        return "true" if value else "false"
+    if isinstance(value, float):
+        return f"{value:.6g}"
+    text = sanitize_note_text(value)
+    return text.replace("|", "\\|") or "n/a"
+
+
+def render_dashboard_markdown(report: dict[str, Any]) -> str:
+    public_claim_status, public_claim_eligible = report_public_claim_status(report)
+    metric_claim_status = report.get("raw_metric_claim_status", report.get("claim_status"))
+    lines = [
+        "# ContextGuard Benchmark Dashboard",
+        "",
+        f"- Schema: `{markdown_value(report.get('schema'))}`",
+        f"- Baseline variant: `{markdown_value(report.get('baseline_variant'))}`",
+        f"- Rows: {markdown_value(report.get('row_count'))}",
+        f"- Metric claim status: `{markdown_value(metric_claim_status)}`",
+        f"- Public claim status: `{markdown_value(public_claim_status)}`",
+        f"- Public claim eligible: `{markdown_value(public_claim_eligible)}`",
+        "",
+        "> Claim boundary: this dashboard is not a hosted savings claim unless report claim gates "
+        "allow it and public-claim provenance is complete. Proxy byte reductions are diagnostic "
+        "and are not hosted API token savings.",
+        "",
+        "## Variant summary",
+        "",
+        "| Variant | Runs | Successes | Failure rate | Tokens/success | Bytes saved | Token proxy saved | Quality notes |",
+        "| --- | ---: | ---: | ---: | ---: | ---: | ---: | --- |",
+    ]
+    summaries = report.get("summary_by_variant") if isinstance(report.get("summary_by_variant"), dict) else {}
+    comparison_by_variant = {
+        item.get("variant"): item
+        for item in report.get("comparisons", [])
+        if isinstance(item, dict)
+    }
+    for variant, summary in sorted(summaries.items()):
+        if not isinstance(summary, dict):
+            continue
+        comparison = comparison_by_variant.get(variant, {})
+        quality = comparison.get("quality_gate") if isinstance(comparison, dict) else None
+        if quality is None and summary.get("is_baseline_strategy"):
+            quality = "baseline"
+        lines.append(
+            "| "
+            + " | ".join([
+                markdown_value(variant),
+                markdown_value(summary.get("runs")),
+                markdown_value(summary.get("successful_runs")),
+                markdown_value(summary.get("failure_rate")),
+                markdown_value(summary.get("tokens_per_successful_task")),
+                markdown_value(summary.get("bytes_saved_successful")),
+                markdown_value(summary.get("token_proxy_saved_successful")),
+                markdown_value(quality),
+            ])
+            + " |"
+        )
+    lines.extend([
+        "",
+        "## Comparisons",
+        "",
+        "| Variant | Quality gate | Matched tasks | Token paired tasks | Token savings % | Shifted cost savings % |",
+        "| --- | --- | ---: | ---: | ---: | ---: |",
+    ])
+    comparisons = report.get("comparisons") if isinstance(report.get("comparisons"), list) else []
+    if comparisons:
+        for item in comparisons:
+            if not isinstance(item, dict):
+                continue
+            lines.append(
+                "| "
+                + " | ".join([
+                    markdown_value(item.get("variant")),
+                    markdown_value(item.get("quality_gate")),
+                    markdown_value(item.get("matched_successful_task_count")),
+                    markdown_value(item.get("paired_token_task_count")),
+                    markdown_value(item.get("token_savings_pct")),
+                    markdown_value(item.get("cost_savings_pct_with_shift")),
+                ])
+                + " |"
+            )
+    else:
+        lines.append("| n/a | n/a | 0 | 0 | n/a | n/a |")
+    replay = report.get("replay_evidence") if isinstance(report.get("replay_evidence"), dict) else None
+    if replay is not None:
+        lines.extend([
+            "",
+            "## Replay evidence provenance",
+            "",
+            f"- Source types: `{markdown_value(', '.join(replay.get('source_types') or []))}`",
+            f"- Claim scopes: `{markdown_value(', '.join(replay.get('claim_scopes') or []))}`",
+            f"- Same-run complete: `{markdown_value(replay.get('same_run_complete'))}`",
+            f"- Mixed/pre-existing CSV: `{markdown_value(replay.get('mixed_csv'))}`",
+            f"- Boundary: {markdown_value(replay.get('claim_boundary'))}",
+        ])
+    else:
+        lines.extend([
+            "",
+            "## Provenance note",
+            "",
+            "- CSV-only dashboards have unknown public-claim provenance unless regenerated from "
+            "the original evidence JSONL or a future trusted provenance ledger.",
+        ])
+    lines.extend([
+        "",
+        "## Re-run context",
+        "",
+        "- Evidence replay: `context-guard-bench --tasks <tasks.json> --variants <variants.json> "
+        "--evidence-jsonl <evidence.jsonl> --csv <results.csv> --report-json <report.json> "
+        "--dashboard-md <dashboard.md>`",
+    ])
+    return "\n".join(lines) + "\n"
+
+
+def write_report_outputs(
+    csv_path: Path,
+    report_path: Path | None,
+    dashboard_path: Path | None,
+    baseline_variant: str,
+    *,
+    replay_rows: list[EvidenceReplayRow] | None = None,
+    mixed_csv: bool = False,
+) -> dict[str, Any]:
+    # Keep lock order stable across all derived writes: source CSV first, then
+    # report, then dashboard. Do not introduce a derived-output -> CSV path.
+    with csv_file_lock(csv_path, create_parent=True):
+        report = summarize_benchmark_rows(read_csv_rows(csv_path), baseline_variant)
+        if replay_rows is not None:
+            report = annotate_replay_report(report, replay_rows, mixed_csv=mixed_csv)
+        if report_path is not None:
+            with csv_file_lock(report_path, create_parent=True):
+                write_text_no_follow(
+                    report_path,
+                    json.dumps(report, ensure_ascii=False, indent=2, sort_keys=True) + "\n",
+                )
+        if dashboard_path is not None:
+            with csv_file_lock(dashboard_path, create_parent=True):
+                write_text_no_follow(dashboard_path, render_dashboard_markdown(report))
+    return report
+
+
 def write_report_json(csv_path: Path, report_path: Path, baseline_variant: str) -> dict[str, Any]:
     # Keep lock order stable across all report writes: source CSV first, derived
     # report second. Do not introduce a report -> CSV path; that can deadlock
     # concurrent report generation.
-    with csv_file_lock(csv_path, create_parent=True):
-        report = summarize_benchmark_rows(read_csv_rows(csv_path), baseline_variant)
-        with csv_file_lock(report_path, create_parent=True):
-            write_text_no_follow(
-                report_path,
-                json.dumps(report, ensure_ascii=False, indent=2, sort_keys=True) + "\n",
-            )
-    return report
+    return write_report_outputs(csv_path, report_path, None, baseline_variant)
 
 
 def sanitize_note_text(value: Any) -> str:
@@ -2351,8 +2972,18 @@ def existing_file_identity(path: Path) -> tuple[int, int] | None:
         os.close(fd)
 
 
-def validate_distinct_output_paths(csv_path: Path, ledger_path: Path | None, report_path: Path | None) -> None:
-    outputs = [("csv", csv_path), ("ledger-jsonl", ledger_path), ("report-json", report_path)]
+def validate_distinct_output_paths(
+    csv_path: Path,
+    ledger_path: Path | None,
+    report_path: Path | None,
+    dashboard_path: Path | None = None,
+) -> None:
+    outputs = [
+        ("csv", csv_path),
+        ("ledger-jsonl", ledger_path),
+        ("report-json", report_path),
+        ("dashboard-md", dashboard_path),
+    ]
     seen: dict[Path, str] = {}
     seen_identity: dict[tuple[int, int], str] = {}
     for label, path in outputs:
@@ -2391,12 +3022,16 @@ def main() -> int:
                         help="optional JSONL ledger path for cost-shift accounting per run")
     parser.add_argument("--report-json", default=None, type=Path,
                         help="optional A/B summary report JSON path generated from --csv after real runs")
+    parser.add_argument("--dashboard-md", default=None, type=Path,
+                        help="optional Markdown dashboard path generated from the benchmark report")
+    parser.add_argument("--evidence-jsonl", default=None, type=Path,
+                        help="optional validated run-evidence JSONL replay input; skips provider invocation")
     parser.add_argument("--baseline-variant", default="baseline",
                         help="variant name used as the report baseline (default: baseline)")
     args = parser.parse_args()
 
     require_no_follow_file_ops_supported()
-    validate_distinct_output_paths(args.csv, args.ledger_jsonl, args.report_json)
+    validate_distinct_output_paths(args.csv, args.ledger_jsonl, args.report_json, args.dashboard_md)
 
     variants = parse_variants(args.variants)
     tasks = parse_tasks(args.tasks, variants=variants)
@@ -2411,6 +3046,61 @@ def main() -> int:
         for task, variant in targets
         if (task.id, variant.name) not in skip_keys
     ]
+    if args.evidence_jsonl is not None:
+        if args.dry_run:
+            for task, variant in targets:
+                if (task.id, variant.name) in skip_keys:
+                    print(f"skip {task.id}/{variant.name} (already in {args.csv})")
+                    continue
+                print(f"evidence replay dry-run: {task.id}/{variant.name} <- {args.evidence_jsonl}")
+            print("completed 0 run(s); results in (dry-run; no CSV writes)")
+            return 0
+        csv_had_preexisting_content = file_has_content_no_follow(args.csv)
+        evidence_rows = read_evidence_jsonl(args.evidence_jsonl)
+        evidence_by_key = validate_evidence_coverage(evidence_rows, runnable_targets)
+        claude_ver = "evidence-replay"
+        completed = 0
+        replay_rows_written: list[EvidenceReplayRow] = []
+        for task, variant in targets:
+            if (task.id, variant.name) in skip_keys:
+                print(f"skip {task.id}/{variant.name} (already in {args.csv})")
+                continue
+            evidence = evidence_by_key[(task.id, variant.name)]
+            print(f"replay {task.id}/{variant.name} ...", flush=True)
+            result = run_evidence_fixture(task, variant, evidence)
+            wrote = append_csv(args.csv, claude_ver, result, skip_existing=args.resume)
+            if wrote:
+                replay_rows_written.append(evidence)
+                if args.ledger_jsonl is not None:
+                    append_cost_shift_ledger(
+                        args.ledger_jsonl,
+                        claude_ver,
+                        result,
+                        replay_provenance=evidence.provenance_payload(),
+                    )
+            completed += 1
+            status = "ok" if result.success else "FAIL"
+            suffix = "" if wrote else " (CSV not updated; row already present)"
+            print(
+                f"  {status} tokens={sum(result.tokens.values())} cost=${result.cost_usd:.4f} "
+                f"wall_time={result.wall_time_seconds:.3f}s {sanitize_note_text(result.notes)}{suffix}"
+            )
+        if args.report_json is not None or args.dashboard_md is not None:
+            report = write_report_outputs(
+                args.csv,
+                args.report_json,
+                args.dashboard_md,
+                args.baseline_variant,
+                replay_rows=replay_rows_written,
+                mixed_csv=csv_had_preexisting_content or bool(skip_keys) or len(replay_rows_written) != int(completed),
+            )
+            if args.report_json is not None:
+                print(f"report {args.report_json}: {report['claim_status']}")
+            if args.dashboard_md is not None:
+                print(f"dashboard {args.dashboard_md}: {report_public_claim_status(report)[0]}")
+        print(f"completed {completed} run(s); results in {args.csv}")
+        return 0
+
     placeholder_targets = [
         f"{task.id}/{variant.name}"
         for task, variant in runnable_targets
@@ -2463,9 +3153,12 @@ def main() -> int:
             f"wall_time={result.wall_time_seconds:.3f}s {sanitize_note_text(result.notes)}{suffix}"
         )
     target = args.csv if not args.dry_run else "(dry-run; no CSV writes)"
-    if args.report_json is not None and not args.dry_run:
-        report = write_report_json(args.csv, args.report_json, args.baseline_variant)
-        print(f"report {args.report_json}: {report['claim_status']}")
+    if (args.report_json is not None or args.dashboard_md is not None) and not args.dry_run:
+        report = write_report_outputs(args.csv, args.report_json, args.dashboard_md, args.baseline_variant)
+        if args.report_json is not None:
+            print(f"report {args.report_json}: {report['claim_status']}")
+        if args.dashboard_md is not None:
+            print(f"dashboard {args.dashboard_md}: {report_public_claim_status(report)[0]}")
     print(f"completed {completed} run(s); results in {target}")
     return 0
 
diff --git a/scripts/prepublish_check.py b/scripts/prepublish_check.py
index c42db5b..2557322 100755
--- a/scripts/prepublish_check.py
+++ b/scripts/prepublish_check.py
@@ -227,6 +227,7 @@ def load_command_manifest():
     "docs/benchmark-fixtures/visual-ocr-cropped-ocr.prompt.example.md",
     "docs/benchmark-fixtures/token-savings-12task.tasks.example.json",
     "docs/benchmark-fixtures/token-savings-12task.variants.example.json",
+    "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl",
     "docs/benchmark-fixtures/token-savings-12task-baseline.prompt.example.md",
     "docs/benchmark-fixtures/token-savings-12task-contextguard.prompt.example.md",
     "package.json",
diff --git a/tests/test_context_guard_kit.py b/tests/test_context_guard_kit.py
index a40f514..41ea260 100644
--- a/tests/test_context_guard_kit.py
+++ b/tests/test_context_guard_kit.py
@@ -24224,8 +24224,397 @@ def test_benchmark_runner_rejects_overlapping_output_paths(self):
                         root / "results.csv",
                         root / "cost-shift.jsonl",
                         root / "report.json",
+                        root / "dashboard.md",
                     )
 
+                    with self.assertRaises(SystemExit) as dashboard_ctx:
+                        module.validate_distinct_output_paths(
+                            root / "results.csv",
+                            root / "cost-shift.jsonl",
+                            root / "report.json",
+                            root / "bench" / ".." / "results.csv",
+                        )
+                    self.assertIn("--dashboard-md must not point to the same path as --csv", str(dashboard_ctx.exception))
+
+    def test_benchmark_runner_replays_evidence_without_provider_and_writes_dashboard(self):
+        for script in BENCH_SCRIPTS:
+            with self.subTest(script=script):
+                with tempfile.TemporaryDirectory() as tmp:
+                    root = Path(tmp)
+                    placeholder = "python3 -c \"raise SystemExit('fixture-only placeholder: replace success_command before real benchmark runs')\""
+                    tasks_path = root / "tasks.json"
+                    variants_path = root / "variants.json"
+                    evidence_path = root / "evidence.jsonl"
+                    tasks_path.write_text(json.dumps([
+                        {
+                            "id": "t01",
+                            "prompt": "fixture prompt",
+                            "model": "sonnet",
+                            "effort": "medium",
+                            "max_turns": 1,
+                            "success_command": placeholder,
+                            "success_cwd": ".",
+                        }
+                    ]), encoding="utf-8")
+                    variants_path.write_text(json.dumps([
+                        {"name": "baseline", "extra_args": []},
+                        {"name": "optimized", "extra_args": []},
+                    ]), encoding="utf-8")
+
+                    def evidence_row(variant: str, input_tokens: int, output_tokens: int, bytes_after: int) -> dict:
+                        return {
+                            "schema_version": "contextguard.bench.run-evidence.v1",
+                            "task_id": "t01",
+                            "variant": variant,
+                            "model": "sonnet",
+                            "effort": "medium",
+                            "success": True,
+                            "tokens": {"input_tokens": input_tokens, "output_tokens": output_tokens},
+                            "primary_tokens_measured": True,
+                            "cost_usd": 0.123,
+                            "cost_measured": True,
+                            "external_tokens": 0,
+                            "external_tokens_measured": True,
+                            "external_cost_usd": 0,
+                            "external_cost_measured": True,
+                            "bytes_before": 1000,
+                            "bytes_after": bytes_after,
+                            "corrections": 0,
+                            "notes": f"synthetic {variant}",
+                            "provenance": {
+                                "evidence_source_type": "synthetic_fixture",
+                                "capture_command_or_export_id": "unit-test-fixture",
+                                "claim_scope": "local_replay_fixture_not_public_claim",
+                            },
+                        }
+
+                    evidence_path.write_text(
+                        "\n".join([
+                            json.dumps(evidence_row("baseline", 100, 20, 1000)),
+                            json.dumps(evidence_row("optimized", 50, 10, 200)),
+                        ]) + "\n",
+                        encoding="utf-8",
+                    )
+                    dry_csv = root / "dry-results.csv"
+                    dry_proc = subprocess.run(
+                        [
+                            sys.executable,
+                            str(script),
+                            "--tasks",
+                            str(tasks_path),
+                            "--variants",
+                            str(variants_path),
+                            "--csv",
+                            str(dry_csv),
+                            "--evidence-jsonl",
+                            str(evidence_path),
+                            "--dry-run",
+                            "--claude-bin",
+                            str(root / "missing-claude"),
+                        ],
+                        text=True,
+                        capture_output=True,
+                        check=True,
+                    )
+                    self.assertIn("evidence replay dry-run", dry_proc.stdout)
+                    self.assertFalse(dry_csv.exists())
+                    self.assertFalse((root / "dry-results.csv.lock").exists())
+
+                    csv_path = root / "results.csv"
+                    ledger_path = root / "ledger.jsonl"
+                    report_path = root / "report.json"
+                    dashboard_path = root / "dashboard.md"
+                    proc = subprocess.run(
+                        [
+                            sys.executable,
+                            str(script),
+                            "--tasks",
+                            str(tasks_path),
+                            "--variants",
+                            str(variants_path),
+                            "--csv",
+                            str(csv_path),
+                            "--evidence-jsonl",
+                            str(evidence_path),
+                            "--ledger-jsonl",
+                            str(ledger_path),
+                            "--report-json",
+                            str(report_path),
+                            "--dashboard-md",
+                            str(dashboard_path),
+                            "--claude-bin",
+                            str(root / "missing-claude"),
+                        ],
+                        text=True,
+                        capture_output=True,
+                        check=True,
+                    )
+                    self.assertIn("replay t01/baseline", proc.stdout)
+                    self.assertIn("dashboard", proc.stdout)
+                    with csv_path.open(encoding="utf-8", newline="") as f:
+                        rows = list(csv.DictReader(f))
+                    self.assertEqual(len(rows), 2)
+                    self.assertTrue(all(row["claude_version"] == "evidence-replay" for row in rows))
+                    self.assertTrue(all(row["primary_tokens_measured"] == "false" for row in rows))
+                    self.assertTrue(all(row["cost_measured"] == "false" for row in rows))
+
+                    ledger_rows = [
+                        json.loads(line)
+                        for line in ledger_path.read_text(encoding="utf-8").splitlines()
+                        if line.strip()
+                    ]
+                    self.assertEqual(len(ledger_rows), 2)
+                    self.assertEqual(ledger_rows[0]["evidence_source_type"], "synthetic_fixture")
+                    self.assertFalse(ledger_rows[0]["public_claim_eligible"])
+                    self.assertIn("replay_provenance", ledger_rows[0])
+
+                    report = json.loads(report_path.read_text(encoding="utf-8"))
+                    self.assertEqual(report["claim_status"], "replay_only_not_public_claim")
+                    self.assertEqual(report["raw_metric_claim_status"], "insufficient_paired_data")
+                    self.assertEqual(report["public_claim_status"], "replay_only_not_public_claim")
+                    self.assertFalse(report["public_claim_eligible"])
+                    self.assertEqual(report["replay_evidence"]["source_types"], ["synthetic_fixture"])
+
+                    dashboard = dashboard_path.read_text(encoding="utf-8")
+                    self.assertIn("Claim boundary", dashboard)
+                    self.assertIn("Quality gate", dashboard)
+                    self.assertIn("context-guard-bench --tasks", dashboard)
+                    self.assertIn("--evidence-jsonl", dashboard)
+
+                    resumed_report = root / "resumed-report.json"
+                    subprocess.run(
+                        [
+                            sys.executable,
+                            str(script),
+                            "--tasks",
+                            str(tasks_path),
+                            "--variants",
+                            str(variants_path),
+                            "--csv",
+                            str(csv_path),
+                            "--evidence-jsonl",
+                            str(evidence_path),
+                            "--report-json",
+                            str(resumed_report),
+                            "--resume",
+                        ],
+                        text=True,
+                        capture_output=True,
+                        check=True,
+                    )
+                    resumed = json.loads(resumed_report.read_text(encoding="utf-8"))
+                    self.assertEqual(resumed["claim_status"], "unknown_mixed_csv")
+                    self.assertFalse(resumed["public_claim_eligible"])
+
+                    no_evidence_proc = subprocess.run(
+                        [
+                            sys.executable,
+                            str(script),
+                            "--tasks",
+                            str(tasks_path),
+                            "--variants",
+                            str(variants_path),
+                            "--csv",
+                            str(root / "no-evidence.csv"),
+                            "--claude-bin",
+                            str(root / "missing-claude"),
+                        ],
+                        text=True,
+                        capture_output=True,
+                    )
+                    self.assertEqual(no_evidence_proc.returncode, 2)
+                    self.assertIn("fixture-only placeholder", no_evidence_proc.stderr)
+
+    def test_benchmark_runner_evidence_replay_validation_fails_closed(self):
+        for index, script in enumerate(BENCH_SCRIPTS):
+            with self.subTest(script=script):
+                module = load_python_script_module(script, f"_bench_runner_evidence_validation_{index}")
+                with tempfile.TemporaryDirectory() as tmp:
+                    root = Path(tmp)
+                    evidence_path = root / "evidence.jsonl"
+
+                    def good_row(**updates):
+                        row = {
+                            "schema_version": "contextguard.bench.run-evidence.v1",
+                            "task_id": "t01",
+                            "variant": "baseline",
+                            "success": True,
+                            "tokens": {"input_tokens": 100, "output_tokens": 20},
+                            "primary_tokens_measured": False,
+                            "cost_usd": 0.0,
+                            "cost_measured": False,
+                            "provenance": {
+                                "evidence_source_type": "synthetic_fixture",
+                                "claim_scope": "local_replay_fixture_not_public_claim",
+                            },
+                        }
+                        row.update(updates)
+                        return row
+
+                    bad_cases = {
+                        "schema": good_row(schema_version="wrong"),
+                        "missing_provenance": {k: v for k, v in good_row().items() if k != "provenance"},
+                        "negative_metric": good_row(bytes_after=-1),
+                    }
+                    for name, row in bad_cases.items():
+                        with self.subTest(case=name):
+                            evidence_path.write_text(json.dumps(row) + "\n", encoding="utf-8")
+                            with self.assertRaises(SystemExit):
+                                module.read_evidence_jsonl(evidence_path)
+
+                    evidence_path.write_text(
+                        json.dumps(good_row(cost_usd=float("nan"))) + "\n",
+                        encoding="utf-8",
+                    )
+                    with self.assertRaises(SystemExit):
+                        module.read_evidence_jsonl(evidence_path)
+
+                    manual = good_row(
+                        primary_tokens_measured=True,
+                        cost_measured=True,
+                        cost_usd=1.23,
+                        provenance={
+                            "evidence_source_type": "manual_audit",
+                            "claim_scope": "manual_check_not_public_claim",
+                        },
+                    )
+                    evidence_path.write_text(json.dumps(manual) + "\n", encoding="utf-8")
+                    parsed = module.read_evidence_jsonl(evidence_path)[0]
+                    self.assertFalse(parsed.result.primary_tokens_measured)
+                    self.assertFalse(parsed.result.cost_measured)
+                    self.assertFalse(parsed.public_claim_eligible)
+
+                    def provider_row(variant, *, input_tokens, output_tokens, cost_usd, corrections=0,
+                                     measured=True):
+                        return {
+                            "schema_version": "contextguard.bench.run-evidence.v1",
+                            "task_id": "t01",
+                            "variant": variant,
+                            "success": True,
+                            "tokens": {"input_tokens": input_tokens, "output_tokens": output_tokens},
+                            "primary_tokens_measured": measured,
+                            "cost_usd": cost_usd,
+                            "cost_measured": measured,
+                            "external_tokens": 0,
+                            "external_tokens_measured": True,
+                            "external_cost_usd": 0,
+                            "external_cost_measured": True,
+                            "bytes_before": 1000,
+                            "bytes_after": 800 if variant == "optimized" else 1000,
+                            "corrections": corrections,
+                            "provenance": {
+                                "evidence_source_type": "provider_export",
+                                "provider_name": "unit-provider",
+                                "capture_command_or_export_id": "export-123",
+                                "claim_scope": "provider_measured_matched_task_public_claim",
+                            },
+                        }
+
+                    def csv_rows_from_replay(replay_rows):
+                        csv_rows = []
+                        for replay in replay_rows:
+                            result = replay.result
+                            shifted = (
+                                result.cost_measured
+                                and result.external_tokens_measured
+                                and (result.external_tokens == 0 or result.external_cost_measured)
+                            )
+                            csv_rows.append({
+                                "task_id": result.task_id,
+                                "variant": result.variant,
+                                "success": "true" if result.success else "false",
+                                "total_tokens": str(sum(result.tokens.values())),
+                                "primary_tokens_measured": "true" if result.primary_tokens_measured else "false",
+                                "cost_usd": f"{result.cost_usd:.6f}",
+                                "cost_measured": "true" if result.cost_measured else "false",
+                                "external_tokens": str(result.external_tokens),
+                                "external_tokens_measured": "true" if result.external_tokens_measured else "false",
+                                "external_cost_usd": f"{result.external_cost_usd:.6f}",
+                                "external_cost_measured": "true" if result.external_cost_measured else "false",
+                                "total_cost_with_shift_usd": (
+                                    f"{(result.cost_usd + result.external_cost_usd):.6f}" if shifted else ""
+                                ),
+                                "bytes_before": str(result.bytes_before),
+                                "bytes_after": str(result.bytes_after),
+                                "corrections": str(result.corrections),
+                            })
+                        return csv_rows
+
+                    evidence_path.write_text(
+                        "\n".join([
+                            json.dumps(provider_row("baseline", input_tokens=100, output_tokens=20, cost_usd=0.12, measured=False)),
+                            json.dumps(provider_row("optimized", input_tokens=50, output_tokens=10, cost_usd=0.06, measured=False)),
+                        ]) + "\n",
+                        encoding="utf-8",
+                    )
+                    provider_incomplete = module.read_evidence_jsonl(evidence_path)
+                    incomplete_report = module.annotate_replay_report(
+                        module.summarize_benchmark_rows(csv_rows_from_replay(provider_incomplete), "baseline"),
+                        provider_incomplete,
+                        mixed_csv=False,
+                    )
+                    self.assertEqual(incomplete_report["raw_metric_claim_status"], "insufficient_paired_data")
+                    self.assertEqual(incomplete_report["claim_status"], "provider_export_claim_gates_not_met")
+                    self.assertFalse(incomplete_report["public_claim_eligible"])
+
+                    evidence_path.write_text(
+                        "\n".join([
+                            json.dumps(provider_row("baseline", input_tokens=100, output_tokens=20, cost_usd=0.12)),
+                            json.dumps(provider_row("optimized", input_tokens=50, output_tokens=10, cost_usd=0.06, corrections=1)),
+                        ]) + "\n",
+                        encoding="utf-8",
+                    )
+                    provider_quality_regression = module.read_evidence_jsonl(evidence_path)
+                    quality_report = module.annotate_replay_report(
+                        module.summarize_benchmark_rows(csv_rows_from_replay(provider_quality_regression), "baseline"),
+                        provider_quality_regression,
+                        mixed_csv=False,
+                    )
+                    self.assertEqual(quality_report["raw_metric_claim_status"], "quality_gate_watch")
+                    self.assertEqual(quality_report["claim_status"], "provider_export_claim_gates_not_met")
+                    self.assertFalse(quality_report["public_claim_eligible"])
+
+                    evidence_path.write_text(
+                        "\n".join([
+                            json.dumps(provider_row("baseline", input_tokens=100, output_tokens=20, cost_usd=0.12)),
+                            json.dumps(provider_row("optimized", input_tokens=50, output_tokens=10, cost_usd=0.06)),
+                        ]) + "\n",
+                        encoding="utf-8",
+                    )
+                    provider_complete = module.read_evidence_jsonl(evidence_path)
+                    complete_report = module.annotate_replay_report(
+                        module.summarize_benchmark_rows(csv_rows_from_replay(provider_complete), "baseline"),
+                        provider_complete,
+                        mixed_csv=False,
+                    )
+                    self.assertEqual(
+                        complete_report["raw_metric_claim_status"],
+                        "token_and_shifted_cost_savings_observed",
+                    )
+                    self.assertEqual(complete_report["claim_status"], "token_and_shifted_cost_savings_observed")
+                    self.assertEqual(complete_report["public_claim_status"], "provider_export_public_claim_candidate")
+                    self.assertTrue(complete_report["public_claim_eligible"])
+
+                    duplicate = "\n".join([json.dumps(good_row()), json.dumps(good_row())]) + "\n"
+                    evidence_path.write_text(duplicate, encoding="utf-8")
+                    rows = module.read_evidence_jsonl(evidence_path)
+                    with self.assertRaises(SystemExit):
+                        module.validate_evidence_coverage(
+                            rows,
+                            [(module.TaskFixture(id="t01", prompt="x"), module.Variant(name="baseline"))],
+                        )
+
+                    evidence_path.write_text(json.dumps(good_row()) + "\n", encoding="utf-8")
+                    rows = module.read_evidence_jsonl(evidence_path)
+                    with self.assertRaises(SystemExit):
+                        module.validate_evidence_coverage(
+                            rows,
+                            [
+                                (module.TaskFixture(id="t01", prompt="x"), module.Variant(name="baseline")),
+                                (module.TaskFixture(id="t01", prompt="x"), module.Variant(name="optimized")),
+                            ],
+                        )
+
     def test_benchmark_runner_preflight_fails_unsupported_platform_before_file_io(self):
         module = load_module_from_path(KIT_DIR / "benchmark_runner.py", "_bench_runner_unsupported_platform")
         with tempfile.TemporaryDirectory() as tmp:
@@ -25212,6 +25601,8 @@ def _combined_experimental_benchmark_fixture_text(self, guide, fixture_dir, fixt
         for task_path, variant_path in fixture_pairs.values():
             combined += "\n" + task_path.read_text(encoding="utf-8").lower()
             combined += "\n" + variant_path.read_text(encoding="utf-8").lower()
+        for evidence_path in sorted(fixture_dir.glob("*.example.jsonl")):
+            combined += "\n" + evidence_path.read_text(encoding="utf-8").lower()
         for prompt_path in sorted(fixture_dir.glob("*.prompt.example.md")):
             combined += "\n" + prompt_path.read_text(encoding="utf-8").lower()
         return combined
@@ -25232,6 +25623,7 @@ def test_experimental_benchmark_fixtures_are_packaged_and_linked(self):
         package_files = set(json.loads((ROOT / "package.json").read_text(encoding="utf-8"))["files"])
         self.assertIn("docs/experimental-benchmark-fixtures.md", package_files)
         self.assertIn("docs/benchmark-fixtures/*.example.json", package_files)
+        self.assertIn("docs/benchmark-fixtures/*.example.jsonl", package_files)
         self.assertIn("docs/benchmark-fixtures/*.prompt.example.md", package_files)
 
         prepublish = (ROOT / "scripts" / "prepublish_check.py").read_text(encoding="utf-8")
@@ -25251,6 +25643,7 @@ def test_experimental_benchmark_fixtures_are_packaged_and_linked(self):
             "docs/benchmark-fixtures/visual-ocr-cropped-ocr.prompt.example.md",
             "docs/benchmark-fixtures/token-savings-12task.tasks.example.json",
             "docs/benchmark-fixtures/token-savings-12task.variants.example.json",
+            "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl",
             "docs/benchmark-fixtures/token-savings-12task-baseline.prompt.example.md",
             "docs/benchmark-fixtures/token-savings-12task-contextguard.prompt.example.md",
             'ROOT / "docs" / "experimental-benchmark-fixtures.md"',
@@ -25339,8 +25732,17 @@ def test_experimental_benchmark_fixtures_parse_and_bind_prompt_files(self):
     def test_token_savings_12task_fixture_parses_and_generates_claim_safe_report(self):
         fixture_dir, _guide, fixture_pairs = self._experimental_benchmark_fixture_paths()
         task_path, variant_path = fixture_pairs["token_savings"]
+        evidence_path = fixture_dir / "token-savings-12task.evidence.example.jsonl"
         task_raw = json.loads(task_path.read_text(encoding="utf-8"))
         self.assertEqual(len(task_raw), 12)
+        evidence_raw = [
+            json.loads(line)
+            for line in evidence_path.read_text(encoding="utf-8").splitlines()
+            if line.strip()
+        ]
+        self.assertEqual(len(evidence_raw), 24)
+        self.assertTrue(all(row["provenance"]["evidence_source_type"] == "synthetic_fixture" for row in evidence_raw))
+        self.assertTrue(all(row["primary_tokens_measured"] is False for row in evidence_raw))
         expected_categories = {
             "bugfix",
             "exploration",
@@ -25415,6 +25817,42 @@ def test_token_savings_12task_fixture_parses_and_generates_claim_safe_report(sel
                 parsed_tasks = module.parse_tasks(task_path, variants=parsed_variants)
                 self.assertEqual(len(parsed_tasks), 12)
                 self.assertTrue(all(module.is_placeholder_success_command(task.success_command) for task in parsed_tasks))
+                replay_rows = module.read_evidence_jsonl(evidence_path)
+                replay_targets = module.filter_targets(parsed_tasks, parsed_variants, None, None)
+                replay_by_key = module.validate_evidence_coverage(replay_rows, replay_targets)
+                self.assertEqual(len(replay_by_key), 24)
+                self.assertTrue(all(row.source_type == "synthetic_fixture" for row in replay_rows))
+                self.assertTrue(all(not row.result.primary_tokens_measured for row in replay_rows))
+                self.assertTrue(all(not row.result.cost_measured for row in replay_rows))
+                replay_report = module.annotate_replay_report(
+                    module.summarize_benchmark_rows(
+                        [
+                            {
+                                "task_id": row.result.task_id,
+                                "variant": row.result.variant,
+                                "success": "true" if row.result.success else "false",
+                                "total_tokens": str(sum(row.result.tokens.values())),
+                                "primary_tokens_measured": "false",
+                                "cost_usd": f"{row.result.cost_usd:.6f}",
+                                "cost_measured": "false",
+                                "external_tokens": str(row.result.external_tokens),
+                                "external_tokens_measured": "true" if row.result.external_tokens_measured else "false",
+                                "external_cost_usd": f"{row.result.external_cost_usd:.6f}",
+                                "external_cost_measured": "true" if row.result.external_cost_measured else "false",
+                                "total_cost_with_shift_usd": "",
+                                "bytes_before": str(row.result.bytes_before),
+                                "bytes_after": str(row.result.bytes_after),
+                                "corrections": str(row.result.corrections),
+                            }
+                            for row in replay_rows
+                        ],
+                        "baseline_full_context_fixture",
+                    ),
+                    replay_rows,
+                    mixed_csv=False,
+                )
+                self.assertEqual(replay_report["claim_status"], "replay_only_not_public_claim")
+                self.assertEqual(replay_report["public_claim_status"], "replay_only_not_public_claim")
                 report = module.summarize_benchmark_rows(rows, "baseline_full_context_fixture")
                 self.assertEqual(report["schema"], "context-guard-bench-report-v1")
                 self.assertEqual(report["claim_status"], "token_and_shifted_cost_savings_observed")