Parent tracker: #69
Design context: #65
Summary
Make cached-prefix replay compatible with scaled llm-d deployments, where agentic-api database state and vLLM KV cache residency are separate layers.
Scope
- Make the replay prefix router-visible without sending the full token array.
- Define compact prefix identity: prefix hash, token count, block size, and eventually a block-hash chain compatible with llm-d precise-prefix routing.
- Ensure vLLM KV events are emitted for the Responses replay path.
- Align the replay-plan token stream, block size, and prefix/block hash with llm-d routing.
- Test active-active EPP, tiered KV offload, wrong-pod routing, pod restart,
AllBlocksCleared, and shared-storage reload scenarios.
Acceptance criteria
- Router-visible prefix hints are derived from the same token stream and block size vLLM uses for KV events.
- Wrong-pod, restart, and cache-clear cases fall back safely.
- The design does not treat a process-local vLLM handle as durable global database state.
Parent tracker: #69
Design context: #65
Summary
Make cached-prefix replay compatible with scaled llm-d deployments, where
agentic-apidatabase state and vLLM KV cache residency are separate layers.Scope
AllBlocksCleared, and shared-storage reload scenarios.Acceptance criteria