Skip to content

Benchmark cached-prefix replay end to end #74

Description

@franciscojavierarceo

Parent tracker: #69
Design context: #65

Summary

Build the benchmark coverage needed to validate cached-prefix replay against realistic long-context Responses and Conversation workloads.

Scope

  • Re-run benchmarks with APC disabled, cold prefixes, non-Harmony Responses models, and real agentic traffic profiles.
  • Add server-side render/tokenize timing instrumentation.
  • Add an agentic-api end-to-end benchmark that sends prompt_cache_ref + append_token_ids.
  • Measure storage lookup/persistence overhead separately from vLLM TTFT.
  • Compare short-context, long-context APC-hot, and APC-unstable regimes.

Acceptance criteria

  • Benchmark output includes TTFT, total latency, prompt tokens, generated tokens, cached-token counts, and fallback reasons.
  • Long APC-hot agentic loops show lower TTFT without changing model-visible token IDs.
  • Neutral or negative short-context results are explicitly reported rather than hidden.
  • Results can be attached back to ADR-04 or a follow-up report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions