Skip to content

feat: better cache utilization through prefix stability#85

Open
Dev-iL wants to merge 1 commit into
doodledood:mainfrom
Dev-iL:2604/prompt_cache_claude
Open

feat: better cache utilization through prefix stability#85
Dev-iL wants to merge 1 commit into
doodledood:mainfrom
Dev-iL:2604/prompt_cache_claude

Conversation

@Dev-iL
Copy link
Copy Markdown
Contributor

@Dev-iL Dev-iL commented Apr 16, 2026

Closes: #84

Summary

/verify now composes every verifier prompt with the manifest inlined as a byte-identical static prefix, and orders launches so the first agent in each (agent_type, model) group primes Claude's prompt cache while the rest hit it. On a typical /do → /verify fanout, the manifest is the dominant repeated context — caching it across grouped agents cuts input-token cost without changing what each verifier sees.

Caching is unconditional and invisible: there is no flag, no schema field, no warmup probe. When the manifest is below the model's minimum cacheable token threshold, the same launch order runs without cache reads — the prefix silently does not activate.

What changed

  • Prompt shape — every verifier prompt starts with the full manifest content as a shared static prefix; per-criterion data (ID, description, verification method, prompt/command) follows. The boundary is sharp: everything before the criterion-specific section is identical across all agents in a cache group.
  • Snapshot discipline/verify reads the manifest exactly once per pass and reuses that content verbatim for every verifier in the pass. No mid-pass re-reads, so the prefix stays byte-identical.
  • Grouped launch — the current phase's criteria are partitioned by (agent_type, model). Within a group, the first agent is launched alone to write the cache; remaining agents launch up to the active mode's concurrency cap and hit the cached prefix. Groups targeting different models can run concurrently; groups on the same model run sequentially so they can share system-prompt cache too.
  • Mode interaction — cache grouping reorders launches within the active execution mode's concurrency cap. It does not raise concurrency, does not change verifier eligibility, model routing, fix-loop limits, or escalation rules. thorough launches all remaining group members at once; balanced batches of 4; efficient stays sequential (and accepts reduced cache benefit).
  • Observability — before launching verifiers in each phase, /verify's narrative output lists the cache groups and launch order in a few terse lines, as an audit trail for cache effectiveness.
  • Dropped redundancy — the per-prompt Optional context — manifest: <path> reference line is omitted for the manifest itself, since the manifest content is already inlined as the prefix. Other context paths (e.g., a discovery log) still append normally.

Files changed

File Change
claude-plugins/manifest-dev/skills/verify/references/CACHING.md New — single-shape prompt composition, grouping rules, launch strategy, mode interaction, silent no-op below threshold
claude-plugins/manifest-dev/skills/verify/SKILL.md Manifest-Prefix Caching section, cache-aware prompt composition, grouped launch strategy, cache observability
claude-plugins/manifest-dev/.claude-plugin/plugin.json Version 0.111.10.112.0
README.md (root) /verify row notes the shared cached prefix
claude-plugins/manifest-dev/README.md /verify row notes the shared cached prefix

Design notes

No flag, no schema field, no warmup. Earlier iterations of this PR exposed --cache none|manifest|max, a Cache: field in the manifest Intent schema, and a cache-warmup agent that gauged context size before launching real verifiers. The do-log removal in 3a8ad15 made max mode dead code (no execution log to cache), and the warmup agent earned its keep only when the user could choose a wrong strategy. Once the strategy menu collapsed to "always cache the manifest," the only honest behaviors below threshold are silent no-op or unconditional launch — both are what the cache hardware already does without a probe. The result is one prompt shape, one launch ordering, and no opt-in surface.

Why per-phase grouping. Phases already partition criteria by iteration speed (fast checks first, e2e/deploy later). Computing cache groups within each phase keeps the grouping aligned with the existing launch boundaries and avoids reshuffling launches across phase walls.

Snapshot once per pass. The cache key is a byte-for-byte match on the prefix. Re-reading the manifest between verifier launches — even when the file hasn't changed on disk — risks whitespace or encoding drift breaking the match. Reading once and reusing the buffer eliminates that class of bug.

@doodledood
Copy link
Copy Markdown
Owner

Hey did you test this? I'm really not confident it will actually utilize cache properly

Claude Code already utilizes cache heavily by itself when it makes sense as far as I know

The concern is that it's not deterministic

@Dev-iL
Copy link
Copy Markdown
Contributor Author

Dev-iL commented Apr 22, 2026

Hey did you test this? I'm really not confident it will actually utilize cache properly

Claude Code already utilizes cache heavily by itself when it makes sense as far as I know

The concern is that it's not deterministic

Indeed, this is one of the concerns I mentioned in #84. Let me ask you this - what test would convince you that this works? /do-ing the same manifest in two separate sessions, once with "max" cache and once with no cache and comparing token usage?

@Dev-iL
Copy link
Copy Markdown
Contributor Author

Dev-iL commented Apr 22, 2026

The only objective hint that this might work comes from the leaked Claude code sources.

According to perplexity:

The strongest clue from the leaked-source discussion is that Claude Code appears to be engineered around a stable/dynamic prompt boundary, where the static front portion is reused through prompt caching and the dynamic tail changes per turn 1. That directly supports the PR’s idea of putting shared manifest/context content in a byte-identical prefix and moving per-criterion data to the end 3.

Source patterns that support the PR

  • The leaked-source analysis says Claude Code uses a pattern like SYSTEM_PROMPT_DYNAMIC_BOUNDARY, splitting the system prompt into a cached static portion and a changing portion 1.
  • Another analysis of the leaked code says Claude Code has “aggressive prompt cache reuse via a stable/dynamic prompt boundary,” which is exactly the mechanism this PR is trying to exploit with grouped launches and shared prefixes 2.
  • A separate writeup notes that prompt reuse is extremely high and that warm-up calls can prime the cache before real work begins, which aligns with the PR’s warmup agent design 5.
  • The prompt-caching docs say the cache matches from the beginning of the request, and any byte change in the prefix can invalidate reuse, which supports the PR’s insistence on byte-identical group prefixes 3.

Why that matters for this PR

The PR’s manifest and max modes try to keep the shared context stable while varying only the final per-criterion parts, which is exactly the condition needed for prefix caching to pay off 3. Its warmup step also matches the “prime the cache first” pattern described in leaked-source commentary about Claude Code’s reuse behavior 5. In other words, the PR is trying to encode the same optimization principle the leaked source appears to rely on internally: static first, dynamic last 1.

Caveat

The leaked-source commentary suggests the pattern, but it does not prove that this specific PR will yield the claimed savings in your environment, especially because cross-agent cache sharing may depend on how Claude Code actually partitions conversations 2. So the leaked source is a good mechanistic justification, not a guarantee of impact 1.

@Dev-iL Dev-iL force-pushed the 2604/prompt_cache_claude branch 3 times, most recently from 62376bf to 800c400 Compare April 27, 2026 06:31
@doodledood
Copy link
Copy Markdown
Owner

I'm not sure this can work, maybe im not getting something but they do an aggressive prefix cache already as you are talking in CC it's out of the box

I'm not sure you can influence caching from externally like this

@Dev-iL
Copy link
Copy Markdown
Contributor Author

Dev-iL commented Apr 30, 2026

We might be able to test this empirically by running the same manifest in two separate sessions one with cache and one without, but I can't find a window with spare tokens... Perhaps I'll try to run in on Haiku end to end. The /usage statistics are quite helpful in this regard.

Got any tips on how to create a dummy manifest with a lot of validations so we can stress test this on different agents? I suspect the effectiveness will be different between agents.

@Dev-iL Dev-iL force-pushed the 2604/prompt_cache_claude branch from 800c400 to 6d34afd Compare April 30, 2026 07:17
@Dev-iL
Copy link
Copy Markdown
Contributor Author

Dev-iL commented Apr 30, 2026

Looking at the caching docs of the main providers, there is definitely a benefit in structuring the context in a certain way (static manifest first, append-only /do log).

Claude Code Caching Best Practices:

  • Cache stable, reusable content like system instructions, background information, large contexts, or frequent tool definitions.
  • Place cached content at the prompt's beginning for best performance.
  • Regularly analyze cache hit rates and adjust your strategy as needed.
    ...
  • Coding assistants: Improve autocomplete and codebase Q&A by keeping relevant sections or a summarized version of the codebase in the prompt.
  • Detailed instruction sets: Share extensive lists of instructions, procedures, and examples to fine-tune Claude's responses. Developers often include an example or two in the prompt, but with prompt caching you can get even better performance by including 20+ diverse examples of high quality answers.
  • Talk to books, papers, documentation, podcast transcripts, and other longform content: Bring any knowledge base alive by embedding the entire document(s) into the prompt, and letting users ask it questions.

Gemini Implicit Caching:

To increase the chance of an implicit cache hit:

  • Try putting large and common contents at the beginning of your prompt
  • Try to send requests with similar prefix in a short amount of time

Codex Caching Best Practices:

  • Structure prompts with static or repeated content at the beginning and dynamic, user-specific content at the end.
  • ...
  • Maintain a steady stream of requests with identical prompt prefixes to minimize cache evictions and maximize caching benefits.

I have incorporated the above in the second commit of this PR, and will work on doing that test.

@Dev-iL
Copy link
Copy Markdown
Contributor Author

Dev-iL commented Apr 30, 2026

@doodledood Below is the testing plan. WDYT?

Manifest

# Definition: Cache Stress-Test Harness for /verify Caching Paths

## 1. Intent & Context
- **Goal:** Create a benchmark test harness — fixture files and a session test matrix in `tests/cache-fixtures/` — that deliberately exercises every code path in `/verify`'s caching implementation. The test matrix defines (coding agent × cache strategy × mode) sessions for the user to run manually, comparing `cache_read_input_tokens` across runs (checked via `/usage` on Claude or `/stats` on Codex/Gemini). Three coding agents: Claude Haiku 4.5, Gemini gemini-3.1-flash-lite-preview, Codex gpt-5.4-mini.
- **Mental Model:** The manifest itself is the primary stress instrument. When `/verify` inlines it as the shared prefix snapshot (`Cache: max`), its ~4,000+ token content reliably exceeds both the Sonnet 2,048-token and Haiku 4,096-token thresholds. The 16-agent `(criteria-checker, inherit)` group in phase 1 is the primary cache-read factory — each agent after the first should read rather than write the cached prefix. Five single-agent named-subagent INV-G* entries verify that isolated groups (different system prompts) still launch correctly without cross-group contamination. Two explicit model overrides (`model: haiku`, `model: sonnet`) create separate cache namespaces, testing that cache isolation by model works. A phase-2 set recomputes groups independently from phase 1. A deferred-auto AC triggers the Deferred-Pending Escalation path and anchors R-4 empirical validation.

  Cache groups this manifest produces at runtime (Phase 1):

  | Cache Group (agent_type, model) | Criteria count | Purpose |
  |----------------------------------|----------------|---------|
  | (criteria-checker, inherit) | 16 | Primary stress group — maximum cache reads after first warmup |
  | (change-intent-reviewer, inherit) | 1 | Single-agent named subagent group |
  | (docs-reviewer, inherit) | 1 | Single-agent named subagent group |
  | (prose-value-reviewer, inherit) | 1 | Single-agent named subagent group |
  | (context-file-adherence-reviewer, inherit) | 1 | Single-agent named subagent group |
  | (criteria-checker, haiku) | 1 | Cross-model group — isolated from inherit; collapses into inherit when session model IS Haiku |
  | (criteria-checker, sonnet) | 1 | Cross-model group — isolated from inherit; behavior on Gemini/Codex is a test outcome |

  **When the coding agent is Claude Haiku 4.5:** `model: inherit` resolves to haiku, so `(criteria-checker, haiku)` and `(criteria-checker, inherit)` share the same cache namespace — the "separate group" collapses. This is intentional: it tests the group-collapse path.

  **When the coding agent is Gemini or Codex:** `model: haiku` and `model: sonnet` are Claude-specific overrides. The behavior (error, fallback to inherit, or ignored) is a measured test outcome — not pre-determined.

- **Mode:** thorough
- **Interview:** autonomous
- **Medium:** local
- **Cache:** max

## 2. Approach

*Initial direction, not rigid plan. Provides enough to start confidently; expect adjustment when reality diverges.*

- **Architecture:**
  - New directory: `tests/cache-fixtures/` (existing `tests/` dir, new subdirectory — no production paths touched)
  - D1 (`large-fixture.md`): a large realistic-manifest-format document that reliably exceeds both Sonnet's 2,048 and Haiku's 4,096-token thresholds when used as a cache prefix snapshot
  - D2 (`warmup-scenarios.md`): reference document for all warmup agent decision paths (above threshold, below threshold, API fallback, crash fallback)
  - D3 (`grouping-matrix.md`): a table-based reference showing expected `(agent_type, model)` groups for six representative criterion configurations
  - D4 (`test-matrix.md`): a test matrix document defining 10 sessions — (coding agent × cache strategy) combinations — for the user to run manually in separate sessions, one row per session. Includes expected cache group behavior, amendment needed per session (to change the manifest's `Cache:` field), and `/usage` or `/stats` result column.
  - D5: deferred-auto AC for post-matrix empirical validation (R-4)

- **Execution Order:**
  - D1 → D2, D3 (parallel) → D4 → D5 (deferred)
  - Rationale: D1's large fixture is a dependency for D2 and D3 content accuracy (D2 references the fixture's expected token count; D3 uses it as an example). D4 references all prior deliverables to construct the session matrix. D5 is user-triggered after completing the test sessions.

- **Risk Areas:**
  - [R-1] Fixture too small — `large-fixture.md` terse enough to fall below Haiku's 4,096-token threshold | Detect: AC-1.3 char count < 15,000, AC-1.4 token count API < 4,096
  - [R-2] Wrong threshold values in warmup scenarios doc — thresholds drift if CACHING.md changes | Detect: AC-2.1 codebase check against exact values
  - [R-3] `model: haiku` and `model: sonnet` overrides on Gemini/Codex CLIs produce unknown behavior — they're Claude-specific model IDs | Detect: observed via session results in D4 matrix; non-Claude sessions report whether criteria with model overrides errored, fell back to inherit, or used an equivalent
  - [R-4] Cross-agent cache sharing not supported via Agent tool — the core assumption of the mechanism | Detect: AC-5.1 deferred-auto empirical check via /usage or /stats

- **Trade-offs:**
  - [T-1] Fixture completeness vs. brevity → Prefer completeness; fixture must exceed Haiku threshold (4,096 tokens) reliably; err toward more content
  - [T-2] Test matrix breadth vs. session count → Prefer a focused 10-session matrix (3 agents × 3 cache strategies + 1 efficient-mode variant) over a full 27-combination factorial; 10 sessions is runnable by one person in a day

## 3. Global Invariants (The Constitution)

*Rules that apply to the ENTIRE execution. If these fail, the task fails.*

*Note on `model:` in verify blocks: `/verify` supports `model:` overrides on any agent-spawning criterion type (`bash`, `codebase`, `research`, `subagent`) — not just `subagent`. `bash` and `codebase` criteria spawn criteria-checker agents; `model:` controls which model that agent uses. This manifest uses `model: haiku` and `model: sonnet` on codebase/bash criteria deliberately to create separate cache groups during verification.*

- [INV-G1] Description: All new files created by this manifest live under `tests/` only — no files added to production skill directories (`claude-plugins/`, `.claude/`, `agents/`)
  ```yaml
  verify:
    method: bash
    command: "cd $(git rev-parse --show-toplevel) && git status --short | grep -E '^[?A]' | awk '{print $2}' | grep -Ev '^tests/' | grep -Ev '^$' | grep -v CLEAN || echo CLEAN"
  ```

- [INV-G2] Description: No production skill files (skills, agents, hooks) modified — changes scoped to `tests/` only
  ```yaml
  verify:
    method: bash
    command: "cd $(git rev-parse --show-toplevel) && git diff HEAD --name-only | grep -E '(claude-plugins|\.claude/|hooks/)' || echo CLEAN"
  ```

- [INV-G3] Description: No new Python or Node dependencies introduced — harness uses only shell builtins and curl (already available)
  ```yaml
  verify:
    method: bash
    command: "cd $(git rev-parse --show-toplevel) && git diff HEAD -- pyproject.toml 2>/dev/null | grep -E '^\\+[^+]' || echo CLEAN"
  ```

- [INV-G4] Description: All deliverables pass intent analysis — the benchmark harness does what it claims: exercises caching paths, not something else
  ```yaml
  verify:
    method: subagent
    agent: change-intent-reviewer
    model: inherit
    prompt: "Review all files created under tests/cache-fixtures/ and tests/cache-bench.sh. Stated intent: each file exercises a specific code path in /verify's caching implementation (CACHING.md in claude-plugins/manifest-dev/skills/verify/references/CACHING.md). Verify that the content of each file actually exercises its stated purpose and contains no behavioral divergence from the stated intent. Flag any deliverable whose content does not match its stated purpose."
  ```

- [INV-G5] Description: All documentation files pass docs-reviewer quality check — no stale references, no inaccurate threshold values, no missing paths
  ```yaml
  verify:
    method: subagent
    agent: docs-reviewer
    model: inherit
    prompt: "Review tests/cache-fixtures/warmup-scenarios.md and tests/cache-fixtures/grouping-matrix.md against the source of truth: claude-plugins/manifest-dev/skills/verify/references/CACHING.md and claude-plugins/manifest-dev/agents/cache-warmup.md. Report any inaccuracies, stale values, missing scenarios, or references to sections that don't exist in the source files. Threshold: no MEDIUM+ findings."
  ```

- [INV-G6] Description: The test matrix uses exact model IDs for all three coding agents — `claude-haiku-4-5-20251001`, `gemini-3.1-flash-lite-preview`, `gpt-5.4-mini` — with no typos, shorthand, or outdated names
  ```yaml
  verify:
    method: codebase
    prompt: "Read tests/cache-fixtures/test-matrix.md. Verify it uses these exact model IDs with no variation: 'claude-haiku-4-5-20251001' for Claude, 'gemini-3.1-flash-lite-preview' for Gemini, and 'gpt-5.4-mini' for Codex. Report PASS or FAIL with any incorrect model identifiers found."
  ```

- [INV-G7] Description: Documentation files contain no AI prose sheen — no narrating-the-obvious, no puffery, no em-dash overuse
  ```yaml
  verify:
    method: subagent
    agent: prose-value-reviewer
    model: inherit
    prompt: "Review tests/cache-fixtures/warmup-scenarios.md, tests/cache-fixtures/grouping-matrix.md, and tests/cache-fixtures/large-fixture.md for prose value. Flag: narrating-the-obvious comments, empty buzzwords, AI rhetorical patterns (em-dash overuse, 'It's not just X — it's Y'), and sycophantic fragments. Threshold: no MEDIUM+ findings."
  ```

- [INV-G8] Description: All new files adhere to CLAUDE.md project conventions — kebab-case filenames, no unnecessary doc files, no debug logs
  ```yaml
  verify:
    method: subagent
    agent: context-file-adherence-reviewer
    model: inherit
    prompt: "Review the new files under tests/cache-fixtures/ and tests/cache-bench.sh against CLAUDE.md conventions at the repo root. Check: kebab-case filenames, no unnecessary README files created, no debug logs added, no files placed in wrong directories per CLAUDE.md. Threshold: no MEDIUM+ findings."
  ```

- [INV-G9] Description: `large-fixture.md` uses valid markdown heading structure — at least 6 headings, no broken internal anchors, all code blocks properly closed
  ```yaml
  verify:
    method: codebase
    model: haiku
    prompt: "Read tests/cache-fixtures/large-fixture.md. Verify: (1) file has at least 6 markdown headings (any level), (2) no broken [link](#anchor) references where the target anchor cannot be found in the file, (3) all fenced code blocks (triple-backtick) have matching closing fences. Report PASS or FAIL with specific issues."
  ```

## 4. Process Guidance (Non-Verifiable)

- [PG-1] Run `pytest tests/` before creating any new test artifacts to establish the pre-existing baseline — confirm prior tests pass before touching `tests/`
- [PG-2] Read CLAUDE.md before any file creation to confirm naming conventions and directory policies
- [PG-3] `large-fixture.md` must err toward MORE content, not less — if uncertain whether it exceeds the Haiku 4,096-token threshold, add another section. Under-sized fixtures invalidate the warmup stress path for Haiku-model groups.
- [PG-4] All bash verify commands in this manifest assume execution from the repo root. Commands use `$(git rev-parse --show-toplevel)` prefix or equivalent — do not rely on implicit working directory.
- [PG-5] The test matrix (`test-matrix.md`) is a planning document, not an autonomous executor. For each session row, the user manually amends the manifest's `Cache:` field to the required value, starts a fresh coding-agent session, and runs `/do <manifest-path>`.
- [PG-6] This manifest is the cache prefix instrument. Its large fixture and 22-criterion structure are what make it a meaningful stress test. Run `/do` on this manifest (not a small arbitrary manifest) for each matrix session.

## 5. Known Assumptions

- [ASM-1] Harness test artifacts (markdown fixtures + shell script) require no pytest unit tests — they ARE the test instruments; meta-testing them with pytest would add circular complexity | Default: no pytest coverage for new files under `tests/cache-fixtures/` | Impact if wrong: low — artifacts are simple enough to verify by inspection via the manifest's own ACs
- [ASM-2] CACHING.md terminology and threshold values are stable — if thresholds change (Opus: 4,096, Sonnet: 2,048, Haiku: 4,096), `warmup-scenarios.md` and `grouping-matrix.md` must be updated to match | Default: current values as of 2026-04-30 commit | Impact if wrong: docs become misleading for threshold-boundary test cases; AC-2.1 would catch the discrepancy on next /verify run
- [ASM-3] Claude sessions (S01–S04) use Haiku 4.5 as the coding agent — meaning `model: inherit` resolves to haiku, intentionally collapsing the `(criteria-checker, haiku)` and `(criteria-checker, inherit)` groups into one namespace. This collapse is a measured outcome, not an error. Gemini and Codex sessions test genuine cross-model override behavior. | Impact if assumption is wrong about which model "inherit" resolves to on Gemini/Codex: the model column in the result table will clarify the actual resolution
- [ASM-4] `tests/cache-fixtures/` can be created as a new subdirectory without conflicts — no existing content at that path (verified via repo inspection) | Default: directory does not exist | Impact if wrong: none — directory creation is idempotent
- [ASM-5] Deliverable format — markdown fixtures + shell script (not a pytest test suite) — was auto-decided as most appropriate for a benchmarking/documentation harness that cannot be automatically run in CI | Default: markdown + shell | Impact if wrong: user may have expected pytest; the manifest can be amended to add a `conftest.py` or pytest wrapper if preferred

## 6. Deliverables (The Work)

### Deliverable 1: Large Content Fixture
*New file: `tests/cache-fixtures/large-fixture.md`*

**Purpose:** A large markdown document in realistic manifest format that, when inlined as a `/verify` shared prefix snapshot (`Cache: manifest` or `Cache: max`), reliably exceeds both the Sonnet 2,048-token and Haiku 4,096-token thresholds. Serves as the primary test context for all caching benchmark runs.

**Content spec:** Include the following sections, each substantive (not stubs): Goal and Mental Model (~300 words), a full Approach section with Architecture, Execution Order, Risk Areas, and Trade-offs (~400 words), at least 8 Global Invariant entries with YAML verify blocks (~800 words), a Process Guidance section with at least 5 items (~200 words), a Known Assumptions section with at least 4 items (~300 words), and at least 3 Deliverables with 3 ACs each (~600 words). Total target: **≥15,000 characters** (≈4,285 tokens at 3.5 chars/token, safely above the Haiku 4,096-token threshold).

**Acceptance Criteria:**
- [AC-1.1] Description: File exists at `tests/cache-fixtures/large-fixture.md`
  ```yaml
  verify:
    method: bash
    command: "test -f $(git rev-parse --show-toplevel)/tests/cache-fixtures/large-fixture.md && echo PASS || echo FAIL"
  ```

- [AC-1.2] Description: File follows manifest schema structure — contains sections for Intent/Context, Approach, Global Invariants, Process Guidance, and Deliverables (in that order, any capitalization)
  ```yaml
  verify:
    method: codebase
    prompt: "Read tests/cache-fixtures/large-fixture.md. Verify it contains all five manifest structural sections in order: (1) a section about intent or goal, (2) a section about approach, (3) a section about global invariants, (4) a section about process guidance, (5) a section about deliverables. Report PASS with section names found, or FAIL with missing sections."
  ```

- [AC-1.3] Description: Character count ≥ 15,000 — safely above both the Sonnet (2,048 tokens ≈ 7,168 chars) and Haiku (4,096 tokens ≈ 14,336 chars) thresholds
  ```yaml
  verify:
    method: bash
    command: "CHARS=$(wc -c < $(git rev-parse --show-toplevel)/tests/cache-fixtures/large-fixture.md); [ \"$CHARS\" -ge 15000 ] && echo \"PASS: $CHARS chars\" || echo \"FAIL: $CHARS chars (need >= 15000)\""
  ```

- [AC-1.4] Description: Token count via Anthropic token counting API returns ≥ 4,096 for the fixture content (Haiku threshold) — exact measurement confirming both thresholds are met
  ```yaml
  verify:
    method: bash
    phase: "2"
    command: |
      FIXTURE="$(git rev-parse --show-toplevel)/tests/cache-fixtures/large-fixture.md"
      if [ -z "$ANTHROPIC_API_KEY" ]; then
        CHARS=$(wc -c < "$FIXTURE")
        echo "SKIP: ANTHROPIC_API_KEY not set; cannot verify exact token count"
        echo "Estimated $(( CHARS / 3 )) tokens from $CHARS chars (3 chars/token)"
        echo "Set ANTHROPIC_API_KEY and re-run to get an exact count"
        exit 0
      else
        RESULT=$(curl -s https://api.anthropic.com/v1/messages/count_tokens \
          -H "x-api-key: $ANTHROPIC_API_KEY" \
          -H "content-type: application/json" \
          -H "anthropic-version: 2023-06-01" \
          -d "{\"model\": \"claude-haiku-4-5-20251001\", \"messages\": [{\"role\": \"user\", \"content\": $(jq -Rs . < "$FIXTURE")}]}")
        TOKENS=$(printf '%s' "$RESULT" | jq -r '.input_tokens // empty')
        if [ -z "$TOKENS" ]; then
          echo "FAIL: API call did not return input_tokens. Response: $RESULT"
        elif [ "$TOKENS" -ge 4096 ]; then
          echo "PASS: $TOKENS tokens (>= 4096 Haiku threshold)"
        else
          echo "FAIL: $TOKENS tokens (< 4096 Haiku threshold)"
        fi
      fi
  ```

### Deliverable 2: Warmup Scenario Reference
*New file: `tests/cache-fixtures/warmup-scenarios.md`*

**Purpose:** Reference document covering all four warmup agent decision paths. Developers running the benchmark can use this to predict warmup output before running, then compare actual warmup output to verify the warmup agent is behaving correctly.

**Content spec:** A section per path, each with: (a) trigger condition, (b) expected warmup output fields using the structured output format from `cache-warmup.md`, (c) downstream effect on `/verify`. Paths: above-threshold/proceed, below-threshold/downgrade-to-none, API unavailable/estimated-fallback, warmup crash/fallback-to-none.

**Acceptance Criteria:**
- [AC-2.1] Description: Documents the above-threshold "proceed" path with correct model thresholds (Opus: 4,096, Sonnet: 2,048, Haiku: 4,096) and the 1.25x write / 0.1x read cost multipliers
  ```yaml
  verify:
    method: codebase
    prompt: "Read tests/cache-fixtures/warmup-scenarios.md. Verify it documents the 'proceed' recommendation path and includes all three model thresholds exactly: Opus 4096, Sonnet 2048, Haiku 4096 minimum tokens. Also check it mentions the cost multipliers (1.25x cache write, 0.1x cache read). Report PASS or FAIL with any missing details."
  ```

- [AC-2.2] Description: Documents the below-threshold "downgrade-to-none" path AND the alternative-model recommendation (e.g., suggest Sonnet when context is above 2,048 but below Opus's 4,096)
  ```yaml
  verify:
    method: codebase
    prompt: "Read tests/cache-fixtures/warmup-scenarios.md. Verify it documents: (1) the downgrade-to-none recommendation path when context is below ALL model thresholds, and (2) the alternative-model recommendation when context exceeds one model's threshold but not the requested model's threshold (e.g., above Sonnet 2048 but below Opus 4096). Report PASS or FAIL with missing scenarios."
  ```

- [AC-2.3] Description: Documents warmup crash fallback — all three failure modes (crash, timeout, unparseable output) route to `--cache none` behavior, do not block verification, and produce a log entry (per CACHING.md's "Warmup Failure Fallback" section)
  ```yaml
  verify:
    method: codebase
    prompt: "Read tests/cache-fixtures/warmup-scenarios.md. Verify it documents: (1) all three warmup failure modes named explicitly: crash, timeout, and unparseable output, (2) all three fall back to --cache none and do NOT block verification, (3) the fallback produces a log entry (failure is logged before proceeding). Cross-check against claude-plugins/manifest-dev/skills/verify/references/CACHING.md's 'Warmup Failure Fallback' section. Report PASS or FAIL."
  ```

- [AC-2.4] Description: Documentation passes docs-reviewer — accurate terminology, no stale threshold values, no ambiguous instructions
  ```yaml
  verify:
    method: subagent
    agent: docs-reviewer
    model: inherit
    phase: "2"
    prompt: "Review tests/cache-fixtures/warmup-scenarios.md against the source of truth: claude-plugins/manifest-dev/agents/cache-warmup.md and claude-plugins/manifest-dev/skills/verify/references/CACHING.md. Report any: (1) threshold values that differ from CACHING.md, (2) output format fields that differ from cache-warmup.md's Output Format section, (3) decision logic described differently from CACHING.md's Warmup Failure Fallback section. Threshold: no MEDIUM+ findings."
  ```

### Deliverable 3: Cache Group Reference Matrix
*New file: `tests/cache-fixtures/grouping-matrix.md`*

**Purpose:** A table-based reference showing expected `(agent_type, model)` cache groups for six representative criterion configurations. Enables quick visual validation of grouping logic before and after a verify run. The six configurations must cover all four key grouping scenarios from CACHING.md.

**Content spec:** A markdown table with columns: Scenario Name, Criteria IDs (example), agent_type, model, Cache Group Key `(agent_type, model)`, Agents in Group, Expected Cache Reads (N-1 for N agents in group), Note. Followed by a threshold reference table (Opus/Sonnet/Haiku thresholds + cost multipliers).

**Acceptance Criteria:**
- [AC-3.1] Description: Matrix includes at least 6 rows covering all four key grouping scenarios: (a) same-type same-model multi-agent group, (b) named subagent isolation (separate groups per named agent type), (c) explicit model override creating a cache boundary, (d) efficient-mode Haiku→inherit model upgrade creating a cache boundary
  ```yaml
  verify:
    method: codebase
    prompt: "Read tests/cache-fixtures/grouping-matrix.md. Verify the matrix has at least 6 rows and explicitly covers: (a) multiple criteria-checker agents with inherit model in one group, (b) two or more different named subagent types each in their own separate group, (c) a criterion with explicit model override (e.g., model: haiku or model: sonnet) in a separate group from model: inherit, (d) efficient-mode Haiku-to-inherit model upgrade creating a new cache namespace. Report PASS or FAIL with missing scenarios."
  ```

- [AC-3.2] Description: Each row shows at minimum: scenario name, agent_type, model, cache group key formatted as `(agent_type, model)`, and number of agents in group
  ```yaml
  verify:
    method: codebase
    prompt: "Read tests/cache-fixtures/grouping-matrix.md. For each row in the main matrix table, verify it contains: a scenario name, the agent_type value, the model value, the cache group key formatted as a (agent_type, model) pair, and the number of agents in the group. Report any rows missing these fields."
  ```

- [AC-3.3] Description: Matrix includes a threshold reference section with correct values: Opus 4,096, Sonnet 2,048, Haiku 4,096 tokens, and cost multipliers 1.25x write / 0.1x read
  ```yaml
  verify:
    method: codebase
    prompt: "Read tests/cache-fixtures/grouping-matrix.md. Verify it contains a threshold reference section (table or list) with exactly: Opus 4096, Sonnet 2048, Haiku 4096 minimum cacheable tokens, and cost multipliers of 1.25x for cache write and 0.1x for cache read. Report PASS or FAIL with any incorrect values."
  ```

- [AC-3.4] Description: Matrix content is internally consistent with CACHING.md grouping logic — no rows that describe grouping behavior that contradicts the source
  ```yaml
  verify:
    method: subagent
    agent: change-intent-reviewer
    model: inherit
    phase: "2"
    prompt: "Compare tests/cache-fixtures/grouping-matrix.md against the grouping rules in claude-plugins/manifest-dev/skills/verify/references/CACHING.md (Grouping Strategy section and Example Grouping subsection). Identify any rows in the matrix that describe grouping behavior inconsistent with CACHING.md. Examples of wrong rows: named subagents of different types sharing one group (each type must be its own group), or criteria with different model values sharing one group (per-model cache isolation). Report PASS if all rows are consistent, FAIL with specific inconsistencies."
  ```

### Deliverable 4: Test Session Matrix
*New file: `tests/cache-fixtures/test-matrix.md`*

**Purpose:** A reference document defining 10 test sessions — (coding agent × cache strategy, with one mode variant) — for the user to run in separate sessions. Each row specifies the coding agent, model ID, which `Cache:` value to set in the manifest, the expected cache group behavior, and blank result cells to fill from `/usage` (Claude) or `/stats` (Codex/Gemini).

**Sessions defined:**

| Session | Agent | Model | Cache | Mode | Key behavior |
|---------|-------|-------|-------|------|--------------|
| S01 | Claude | claude-haiku-4-5-20251001 | none | thorough | Baseline — no caching |
| S02 | Claude | claude-haiku-4-5-20251001 | manifest | thorough | 16-agent group; haiku=inherit collapse |
| S03 | Claude | claude-haiku-4-5-20251001 | max | thorough | As S02 + frozen manifest + append-only log |
| S04 | Claude | claude-haiku-4-5-20251001 | manifest | efficient | Low-yield: sequential launch |
| S05 | Gemini | gemini-3.1-flash-lite-preview | none | thorough | Baseline |
| S06 | Gemini | gemini-3.1-flash-lite-preview | manifest | thorough | Cross-CLI cache; model:haiku behavior TBD |
| S07 | Gemini | gemini-3.1-flash-lite-preview | max | thorough | As S06 + frozen manifest |
| S08 | Codex | gpt-5.4-mini | none | thorough | Baseline |
| S09 | Codex | gpt-5.4-mini | manifest | thorough | Cross-CLI cache; model:haiku behavior TBD |
| S10 | Codex | gpt-5.4-mini | max | thorough | As S09 + frozen manifest |

**Content spec:** The matrix table (all 10 rows), an Amendment Instructions section explaining how to change `Cache:` per session before running `/do`, a Result Recording table (Session / cache_read_input_tokens / cache_creation_input_tokens / model:haiku-override-behavior / notes), and a per-agent check-command reference (`/usage` for Claude; `/stats` for Codex/Gemini). Include a note on the haiku-group collapse scenario for Claude sessions.

**Acceptance Criteria:**
- [AC-4.1] Description: File exists at `tests/cache-fixtures/test-matrix.md`
  ```yaml
  verify:
    method: bash
    command: "test -f $(git rev-parse --show-toplevel)/tests/cache-fixtures/test-matrix.md && echo PASS || echo FAIL"
  ```

- [AC-4.2] Description: Matrix includes all three coding agents with their exact model IDs — `claude-haiku-4-5-20251001`, `gemini-3.1-flash-lite-preview`, `gpt-5.4-mini`
  ```yaml
  verify:
    method: codebase
    prompt: "Read tests/cache-fixtures/test-matrix.md. Verify it contains rows for all three coding agents with exactly these model IDs: 'claude-haiku-4-5-20251001' (Claude), 'gemini-3.1-flash-lite-preview' (Gemini), 'gpt-5.4-mini' (Codex). Report PASS or FAIL with any missing or incorrect model IDs."
  ```

- [AC-4.3] Description: Matrix covers all three cache strategies (`none`, `manifest`, `max`) for at least two agents, and includes at least one `efficient` mode row and 10 total session rows
  ```yaml
  verify:
    method: codebase
    prompt: "Read tests/cache-fixtures/test-matrix.md. Verify: (1) session rows exist for cache=none, cache=manifest, and cache=max for at least two different coding agents, (2) at least one row uses mode=efficient, (3) exactly 10 session rows are defined (S01–S10). Report PASS or FAIL with any missing configurations."
  ```

- [AC-4.4] Description: Document references `/usage` for Claude, `/stats` for Codex/Gemini, and names `cache_read_input_tokens` and `cache_creation_input_tokens` as the fields to record per session
  ```yaml
  verify:
    method: codebase
    model: sonnet
    prompt: "Read tests/cache-fixtures/test-matrix.md. Verify it: (1) mentions '/usage' as the Claude result-checking command, (2) mentions '/stats' as the Codex and Gemini result-checking command, (3) names 'cache_read_input_tokens' and 'cache_creation_input_tokens' explicitly as fields to record. Report PASS or FAIL."
  ```

- [AC-4.5] Description: Document explains the haiku-group collapse scenario for Claude sessions: when coding agent is Claude Haiku 4.5, `model: haiku` and `model: inherit` resolve to the same model, merging the two groups into one
  ```yaml
  verify:
    method: codebase
    model: haiku
    phase: "2"
    prompt: "Read tests/cache-fixtures/test-matrix.md. Verify it explicitly explains the haiku-group collapse scenario: when the coding agent is Claude Haiku 4.5, model:haiku and model:inherit resolve to the same model so (criteria-checker, haiku) and (criteria-checker, inherit) share one cache namespace. This should appear in notes for Claude sessions (S02/S03). Report PASS or FAIL."
  ```

### Deliverable 5: Deferred R-4 Empirical Validation
*No new file — a deferred-auto acceptance criterion that gates /done until the user has completed at least one cached session from the test matrix.*

**Purpose:** Validate R-4 from CACHING.md ("Cross-agent cache sharing via Agent tool is speculative"). After completing one or more sessions from `test-matrix.md`, the user runs `/usage` (Claude) or `/stats` (Codex/Gemini) and reports `cache_read_input_tokens`. Non-zero confirms the mechanism works. Zero across multiple cached runs means Agent tool conversations do not share cache.

**Acceptance Criteria:**
- [AC-5.1] Description: User has completed at least one cached session (S02, S03, S06, S07, S09, or S10) from the test matrix and reported the `/usage` or `/stats` output with `cache_read_input_tokens` and `cache_creation_input_tokens` values
  ```yaml
  verify:
    method: deferred-auto
    prompt: "User confirms: completed at least one cached session (cache=manifest or cache=max) from tests/cache-fixtures/test-matrix.md and typed /usage (Claude) or /stats (Codex/Gemini) at end of that session. Report the raw /usage or /stats output, and specifically: cache_read_input_tokens and cache_creation_input_tokens values. If cache_read_input_tokens > 0: PASS — cross-agent cache sharing via Agent tool is operational. If cache_read_input_tokens = 0 across multiple cached sessions: FAIL — cache sharing may not be supported via Agent tool (R-4 from CACHING.md Known Limitations)."
  ```


Test Matrix & Workflow

The 10 Sessions

Session Coding Agent Model Cache Mode What it exercises
S01 Claude claude-haiku-4-5-20251001 none thorough Uncached baseline — establishes token cost floor
S02 Claude claude-haiku-4-5-20251001 manifest thorough 16-agent group; haiku=inherit collapse (both model overrides land in same namespace)
S03 Claude claude-haiku-4-5-20251001 max thorough As S02 + frozen manifest epoch + append-only log
S04 Claude claude-haiku-4-5-20251001 manifest efficient Low-yield path: sequential launch, haiku criteria-checker, haiku=inherit collapse
S05 Gemini gemini-3.1-flash-lite-preview none thorough Uncached baseline on non-Claude CLI
S06 Gemini gemini-3.1-flash-lite-preview manifest thorough Cross-CLI: does Gemini honour --cache manifest? What does model: haiku do?
S07 Gemini gemini-3.1-flash-lite-preview max thorough As S06 + frozen manifest epoch
S08 Codex gpt-5.4-mini none thorough Uncached baseline on Codex
S09 Codex gpt-5.4-mini manifest thorough Cross-CLI: does Codex honour --cache manifest?
S10 Codex gpt-5.4-mini max thorough As S09 + frozen manifest epoch
S11 Claude claude-sonnet-4-6 none thorough Sonnet uncached baseline
S12 Claude claude-sonnet-4-6 manifest thorough Genuine haiku isolation + sonnet-group collapse + 17-agent inherit group

S11 is optional if you just want one Sonnet data point — S12 alone against S02 would show whether Sonnet vs Haiku orchestrator changes the cache picture. But the baseline makes the comparison clean.

How to Run Each Session

  1. Archive the manifest once:

    cp /tmp/manifest-20260430-120000.md .manifest/cache-stress-test-2026-04-30.md
    
  2. For each session, amend the Cache: field in the manifest to match the session's value (none / manifest / max), start a fresh session on the target coding agent, then:

    /do .manifest/cache-stress-test-2026-04-30.md
    
  3. At end of session, before closing:

    • Claude: /usage
    • Codex or Gemini: /stats
  4. Record cache_read_input_tokens and cache_creation_input_tokens in the result table inside tests/cache-fixtures/test-matrix.md.

Result Recording Table (fill in after each session)

Session cache_creation_input_tokens cache_read_input_tokens model:haiku behavior Notes
S01 N/A
S02 collapsed (=inherit)
S03 collapsed (=inherit)
S04 collapsed (=inherit)
S05 N/A
S06 TBD
S07 TBD
S08 N/A
S09 TBD
S10 TBD
S11 N/A
S12 TBD

What to Look For

  • S01 vs S02 vs S03 (Claude, same agent): cache_read_input_tokens should be 0 for S01, non-zero for S02/S03 if R-4 is confirmed. The ratio cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens) approximates how many agents hit vs. wrote the cache.
  • S02 vs S04 (manifest, thorough vs. efficient): S04 should have lower cache_read_input_tokens — efficient mode launches sequentially, fewer concurrent agents, smaller group fanout.
  • S06/S07 and S09/S10 (Gemini/Codex, cached): If cache_read_input_tokens is non-zero, caching works cross-CLI. If zero but S02 was non-zero, the Agent tool's cache is Claude-specific.
  • model: haiku behavior on Gemini/Codex: Observed in the session logs — criteria with model: haiku either error, silently inherit the session model, or produce a distinct failure pattern.

@Dev-iL Dev-iL force-pushed the 2604/prompt_cache_claude branch 5 times, most recently from 83aa1cd to 3104953 Compare May 7, 2026 06:33
@Dev-iL Dev-iL force-pushed the 2604/prompt_cache_claude branch 3 times, most recently from 1fc52c3 to 27a0395 Compare May 13, 2026 14:25
@Dev-iL
Copy link
Copy Markdown
Contributor Author

Dev-iL commented May 14, 2026

@doodledood That's the closest comparison I managed to do. While not perfect, looks like there's a noticeable difference in R/W ratio - 21.5 w/o cache and 27.5 w/ cache.

S11 (cache: none):

Session

Usage by model:
   claude-sonnet-4-6:  1.4k input, 80.5k output, 11.1m cache read, 516.3k cache write ($6.47)
    claude-haiku-4-5:  352 input, 14 output, 0 cache read, 0 cache write ($0.0004)

S12 (cache: manifest):

Session

Usage by model:
   claude-sonnet-4-6:  1.6k input, 58.5k output, 10.8m cache read, 392.5k cache write ($5.59)
    claude-haiku-4-5:  351 input, 13 output, 0 cache read, 0 cache write ($0.0004)

BTW, with the removal of the do log, "max" caching mode should likely change (or be removed), since there's no longer an explicit log to force being append-only.

@Dev-iL Dev-iL force-pushed the 2604/prompt_cache_claude branch 4 times, most recently from c7ae9b4 to 0ddc4a7 Compare May 15, 2026 10:22
@Dev-iL
Copy link
Copy Markdown
Contributor Author

Dev-iL commented May 15, 2026

@doodledood I updated this PR to make caching unconditional. The main change: always spawning the first verifier in a group separately and the rest only after we started seeing a response. This change should improve cache utilization "for free", with the largest benefits expected to workflows with multiple sub-agents, especially when they're using a different model/effort from the orchestrator.

@doodledood
Copy link
Copy Markdown
Owner

@Dev-iL So a couple of things/concerns:

  • First of all, the experimental plugin now is drastically changed, with verify even being removed, the total tokens is down by 80-90% probably after massive simplification, so probably I would say its a good idea to see how it does token spend wise as well
  • Second, I'm still not sold on the idea that it works more than a tiny non-random amount (this is something that can only really be tested at scale) -> prefix works but the problem is world state which is ever changing. You may have a cached manifest but thats really a grain of sand in the sea of consumed tokens. The moment it reads a file or interacts with the external world the cache is invalid if that state is different than before
  • If you are concerned about cost/usage, first of all with the new experimental set of skills its not opinionated about this but you can first try to use sonnet to /do and to take it even further set model = haiku on all ACs when you create your manifest
  • The prefix cache therefore IMO a micro optimization that adds unnecessary constraints & complexity (like append only manifest to name one, which I'd like to avoid); One of the reasons I went for the radical simplification is that newer models are much better at deriving intent & following instructions and the current set of skills became too large & hard to modify even when using the best practices

WDYT?

@Dev-iL
Copy link
Copy Markdown
Contributor Author

Dev-iL commented May 16, 2026

  • First of all, the experimental plugin now is drastically changed, with verify even being removed, the total tokens is down by 80-90% probably after massive simplification, so probably I would say its a good idea to see how it does token spend wise as well
  1. Sure, I'll try it out.
  2. It's quite possible that this proposal becomes largely unnecessary if the plugin's internals change that drastically. Let's not forget that much like "mode", it's an optimization intended for an earlier version of the plugin that sometimes used an unreasonable amount of tokens.

You may have a cached manifest but thats really a grain of sand in the sea of consumed tokens.

Let's think about this for a second - the prefix consists of the SKILLs, MCP tools, and then the files necessary for a given task. Presumably, the manifest is one of the first files in that list, since it's provided at the very first prompt of an interaction, then come ad-hoc files. So any change to the manifest (either append only or in-place) doesn't really matter since it discards everything else afterwards. However, if the manifest can be completely static (like the earlier constraint of updating the do-log instead of the manifest), this opens up room for more substantial cache reuse.

So I think the discussion should be on finding a good tradeoff between tracking intent and cache reuse.

The moment it reads a file or interacts with the external world the cache is invalid if that state is different than before

Then what's the point of the aggressive prefix caching by claude itself then? If this was under our control, the prefix would be append-only with forks at checkpoints before "unique" actions or subpaths.

set model = haiku on all ACs when you create your manifest

Is there a manifest field for this? Or should I just tell it to use haiku to verify?

The prefix cache therefore IMO a micro optimization that adds unnecessary constraints & complexity (like append only manifest to name one, which I'd like to avoid);

I'm still not sold on the idea that it works more than a tiny non-random amount

There has to be some threshold from which the extra complexity is worth it. Is it 10% token savings? 20%? More?

WDYT?

I have access to a local LLM that runs on llama.cpp and provides pretty detailed server-side logging, especially when it comes to caching. If that's an acceptable benchmark to you, I can run Claude code with this backend and monitor the exact queries that are being sent as well as logs mentioning cache events, in addition to the client side usage metrics.

@Dev-iL Dev-iL force-pushed the 2604/prompt_cache_claude branch from 0ddc4a7 to 8ac2371 Compare May 17, 2026 11:54
@doodledood
Copy link
Copy Markdown
Owner

@Dev-iL

  • about the mechanism: maybe I'm not getting the proposal clearly but Claude itself prefix caches the assistant and user turns and the tool calls so that the next message you send will not have to process the full history every time and get more and more expensive and slow as you talk to Claude. So I don't really see how you can influence that to be more aggressive than it already is. It makes sense to me that the improvement you showed is random noise and if you run it on more examples you'll see it flip back and forth
  • the manifest schema supports model so yeah you could just specify in there use haiku and it should spin up the task with haiku when verification happens

@Dev-iL
Copy link
Copy Markdown
Contributor Author

Dev-iL commented May 17, 2026

@doodledood
The point is not to make it more aggressive, the point is to make invalidation less likely to happen. If the manifest is included in the prefix close to the beginning, somewhere around the skills and the first user turn, as the conversation progresses there's more and more context appended after the manifest. If the manifest is modified - everything that follows is thrown out and must be recomputed.

It makes sense to me that the improvement you showed is random noise and if you run it on more examples you'll see it flip back and forth.

I agree that a single data point is fairly weak evidence - so please suggest an experiment that would satisfy you.

@doodledood
Copy link
Copy Markdown
Owner

@Dev-iL so your case is that verification specifically could be cached if the manifest never changes?

In any case there must be changes to the manifest in form of append only or in place and both of them mean it invalidates the cache (for example after a review comment that raised a point you missed in the manifest)

@Dev-iL
Copy link
Copy Markdown
Contributor Author

Dev-iL commented May 17, 2026

@Dev-iL so your case is that verification specifically could be cached if the manifest never changes?

That's right. The motivation for this PR was making the verification stage more token-friendly. Also, if a single verification is triggered ahead of the rest it ensures the best possible prefix is available to the other validations.

In any case there must be changes to the manifest in form of append only or in place and both of them mean it invalidates the cache (for example after a review comment that raised a point you missed in the manifest)

I agree that changes to the manifest are usually unavoidable, and that it doesn't make sense to keep the manifest static. My suggestion is allowing changes to the manifest only at verification loop boundary, to minimize the cache disturbance. I.e.
define -> manifest v1 -> full verification pass with static manifest across N agents -> manifest v2 -> another verification -> ... -> done

@doodledood
Copy link
Copy Markdown
Owner

@Dev-iL I see however the cache they use is not an input output regular cache it's simply a pre computed prefill cache which means only first byte will be faster but the output will still generate different tokens due to sampling at runtime as far as I understand. It means that you may score a small cache hit when a verifier subagent starts (very very small, just a clean context, a user message and that's it) the rest will be different each run regardless

So in a nutshell you cannot influence the cache like this in theory

Again my guess is if you test it many many times you'll see the results converge on same ratio difference

@Dev-iL Dev-iL force-pushed the 2604/prompt_cache_claude branch from 8ac2371 to 6628fdf Compare May 18, 2026 00:24
@Dev-iL Dev-iL changed the title feat: implement prompt caching via --cache feat: better cache utilization through prefix stability May 18, 2026
@Dev-iL Dev-iL force-pushed the 2604/prompt_cache_claude branch 9 times, most recently from 840ff26 to 516b397 Compare May 24, 2026 06:11
/verify now composes every verifier prompt with the manifest inlined as a byte-identical static prefix, then orders launches so the first agent in each (agent_type, model) group primes Claude's prompt cache and the rest hit it. On large manifests this cuts input-token cost across the fanout; when the manifest is below the model's minimum cacheable size, the same launch order runs without cache reads — no probe, no fallback path.

- /verify: inline manifest as shared static prefix; partition the current phase's criteria into (agent_type, model) cache groups; launch the first agent in each group, then the remainder up to the active mode's concurrency cap. Cache grouping reorders launches
  within the mode's cap; it does not raise concurrency.
- Prompt shape: shared prefix first, per-criterion data (ID, description, verification method, prompt/command) last. The "Optional context — manifest: <path>" line is dropped for the manifest itself since it is inlined.
- Snapshot discipline: read the manifest exactly once per pass and reuse that content verbatim for every verifier in the pass — never re-read between launches.
- Observability: /verify's narrative output lists the cache groups and launch order per phase, terse, as an audit trail for cache effectiveness.
- references/CACHING.md: new reference covering prompt shape, grouping rules, launch strategy, mode interaction, and the silent no-op for sub-threshold manifests.
- Root README and plugin README: /verify entries note the shared cached prefix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Dev-iL Dev-iL force-pushed the 2604/prompt_cache_claude branch from 516b397 to 9cabef9 Compare May 24, 2026 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FR: Introduce prompt caching

2 participants