Parent tracker: #69
Design context: #65
Summary
Harden the vLLM execution path for compact prefix replay using prompt_cache_ref + append_token_ids.
Scope
- Keep vLLM as the authority for rendering, tokenization, prompt-cache handle execution, and generation.
- Harden the
prompt_cache_ref + append_token_ids path.
- Decide where marginal rendering lives: incremental Responses render API in vLLM, cached prefix IDs plus marginal messages in vLLM, or an
agentic-api path that calls the same renderer/tokenizer source of truth.
- Scope
prompt_cache_ref to the selected serving endpoint or cache epoch.
- Add handle miss fallback, reseeding, and restart behavior.
- Avoid returning full
prompt_token_ids on every production turn.
Acceptance criteria
- Handle replay either executes safely or falls back to a full render path.
- vLLM restart or handle miss behavior is deterministic and tested.
- The production path can avoid full prompt-token arrays in steady state.
- Marginal rendering ownership is documented and implemented behind an explicit interface.
Parent tracker: #69
Design context: #65
Summary
Harden the vLLM execution path for compact prefix replay using
prompt_cache_ref + append_token_ids.Scope
prompt_cache_ref + append_token_idspath.agentic-apipath that calls the same renderer/tokenizer source of truth.prompt_cache_refto the selected serving endpoint or cache epoch.prompt_token_idson every production turn.Acceptance criteria