Benchmark cached-prefix replay end to end

Parent tracker: #69
Design context: #65

## Summary

Build the benchmark coverage needed to validate cached-prefix replay against realistic long-context Responses and Conversation workloads.

## Scope

- Re-run benchmarks with APC disabled, cold prefixes, non-Harmony Responses models, and real agentic traffic profiles.
- Add server-side render/tokenize timing instrumentation.
- Add an `agentic-api` end-to-end benchmark that sends `prompt_cache_ref + append_token_ids`.
- Measure storage lookup/persistence overhead separately from vLLM TTFT.
- Compare short-context, long-context APC-hot, and APC-unstable regimes.

## Acceptance criteria

- Benchmark output includes TTFT, total latency, prompt tokens, generated tokens, cached-token counts, and fallback reasons.
- Long APC-hot agentic loops show lower TTFT without changing model-visible token IDs.
- Neutral or negative short-context results are explicitly reported rather than hidden.
- Results can be attached back to ADR-04 or a follow-up report.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark cached-prefix replay end to end #74

Summary

Scope

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Benchmark cached-prefix replay end to end #74

Description

Summary

Scope

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions