Skip to content

Harden vLLM prompt-cache replay execution #70

Description

@franciscojavierarceo

Parent tracker: #69
Design context: #65

Summary

Harden the vLLM execution path for compact prefix replay using prompt_cache_ref + append_token_ids.

Scope

  • Keep vLLM as the authority for rendering, tokenization, prompt-cache handle execution, and generation.
  • Harden the prompt_cache_ref + append_token_ids path.
  • Decide where marginal rendering lives: incremental Responses render API in vLLM, cached prefix IDs plus marginal messages in vLLM, or an agentic-api path that calls the same renderer/tokenizer source of truth.
  • Scope prompt_cache_ref to the selected serving endpoint or cache epoch.
  • Add handle miss fallback, reseeding, and restart behavior.
  • Avoid returning full prompt_token_ids on every production turn.

Acceptance criteria

  • Handle replay either executes safely or falls back to a full render path.
  • vLLM restart or handle miss behavior is deterministic and tested.
  • The production path can avoid full prompt-token arrays in steady state.
  • Marginal rendering ownership is documented and implemented behind an explicit interface.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions