Summary
Track implementation follow-up from ADR-04 for cached prompt-token prefixes in long Responses and Conversation continuations.
The goal is to reduce time to first token (TTFT) for long APC-hot agentic loops by letting agentic-api persist enough prefix metadata to prove a prior model-visible prompt prefix is still valid, then ask vLLM to continue from a compact prompt_cache_ref + append_token_ids replay request.
Design context: #65
Why
The ADR measurements showed that automatic prefix caching was already effective on the measured DGX GPT-OSS-20B server, and rendered prompt IDs were stable. The useful production win is therefore not primarily recovering missed GPU prefill; it is avoiding repeated prompt reconstruction, repeated rendering/tokenization, and large request bodies as conversations grow.
In the measured Codex-session fixture, the handle path became clearly useful around 24k prompt tokens, with a fitted TTFT improvement of about 20.4 ms per additional 10k prompt tokens.
Subissues
Acceptance criteria
- Strict-prefix validation exists before replay is enabled.
- Replay only appends at renderer/template-safe boundaries.
- vLLM handle miss and restart fallback behavior is defined.
- Codex WebSocket continuation can reach the same replay path.
- Long-context benchmarks show lower TTFT without changing model-visible token IDs.
Summary
Track implementation follow-up from ADR-04 for cached prompt-token prefixes in long Responses and Conversation continuations.
The goal is to reduce time to first token (TTFT) for long APC-hot agentic loops by letting
agentic-apipersist enough prefix metadata to prove a prior model-visible prompt prefix is still valid, then ask vLLM to continue from a compactprompt_cache_ref + append_token_idsreplay request.Design context: #65
Why
The ADR measurements showed that automatic prefix caching was already effective on the measured DGX GPT-OSS-20B server, and rendered prompt IDs were stable. The useful production win is therefore not primarily recovering missed GPU prefill; it is avoiding repeated prompt reconstruction, repeated rendering/tokenization, and large request bodies as conversations grow.
In the measured Codex-session fixture, the handle path became clearly useful around 24k prompt tokens, with a fitted TTFT improvement of about 20.4 ms per additional 10k prompt tokens.
Subissues
agentic-apiAcceptance criteria