Parent tracker: #69
Design context: #65
Summary
Build the benchmark coverage needed to validate cached-prefix replay against realistic long-context Responses and Conversation workloads.
Scope
- Re-run benchmarks with APC disabled, cold prefixes, non-Harmony Responses models, and real agentic traffic profiles.
- Add server-side render/tokenize timing instrumentation.
- Add an
agentic-api end-to-end benchmark that sends prompt_cache_ref + append_token_ids.
- Measure storage lookup/persistence overhead separately from vLLM TTFT.
- Compare short-context, long-context APC-hot, and APC-unstable regimes.
Acceptance criteria
- Benchmark output includes TTFT, total latency, prompt tokens, generated tokens, cached-token counts, and fallback reasons.
- Long APC-hot agentic loops show lower TTFT without changing model-visible token IDs.
- Neutral or negative short-context results are explicitly reported rather than hidden.
- Results can be attached back to ADR-04 or a follow-up report.
Parent tracker: #69
Design context: #65
Summary
Build the benchmark coverage needed to validate cached-prefix replay against realistic long-context Responses and Conversation workloads.
Scope
agentic-apiend-to-end benchmark that sendsprompt_cache_ref + append_token_ids.Acceptance criteria