Skip to content

support prefill prefix cache#286

Open
jiayyu wants to merge 17 commits intomainfrom
ds_prefix_cache2
Open

support prefill prefix cache#286
jiayyu wants to merge 17 commits intomainfrom
ds_prefix_cache2

Conversation

@jiayyu
Copy link
Contributor

@jiayyu jiayyu commented Mar 9, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@jiayyu jiayyu force-pushed the ds_prefix_cache2 branch 3 times, most recently from bf02a43 to e17b614 Compare March 13, 2026 07:55
@jiayyu jiayyu marked this pull request as ready for review March 13, 2026 07:56
Copilot AI review requested due to automatic review settings March 13, 2026 07:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends prefix-caching support through the scheduler, metadata builders, and attention implementations (including DeepSeek-v2/MLA paths), and adds unit tests for cache-aware block management behavior.

Changes:

  • Add prefix-cache metadata plumbing (has_cached, num_cached_tokens, etc.) and use it to gather cached+new KV for attention/indexing.
  • Update BlockManager/scheduler logic to account for cache hits and multi-token decode allocation.
  • Add/extend tests covering prefix-cache allocation behavior and hash-table cleanup.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/test_prefix_cache_accuracy.py Adds an integration-style prefix-cache accuracy driver script under tests/.
tests/test_block_manager.py Adds new unit tests for cache-aware allocation, hash cleanup, and multi-token append scenarios.
atom/utils/forward_context.py Extends AttentionMetaData with prefix-cache-related fields.
atom/models/deepseek_v2.py Adjusts sparse indexer to consider full KV length when prefix cache is present.
atom/model_ops/base_attention.py Extends cp_mha_gather_cache to support multiple KV cache layouts.
atom/model_ops/attentions/backends.py Builds prefill metadata accounting for cached tokens; adds token-to-batch mapping for cache gather.
atom/model_ops/attentions/aiter_mla.py Updates MLA prefill metadata generation to use full-context lengths when prefix cache is present.
atom/model_ops/attention_mla.py Adds prefix-cache path to gather full KV and run varlen flash-attn prefill.
atom/model_ops/attention_mha.py Adds prefix-cache KV gather+concat path for MHA via cp_mha_gather_cache.
atom/model_engine/scheduler.py Adds cache hit-rate stats logging; updates decode scheduling to reserve multi-token space.
atom/model_engine/block_manager.py Adds free-block tracking set + cache-aware can_allocate; updates can_append to support multi-token appends.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 115 to 118
if cache_miss:
block_id = self.free_block_ids[0]
block_id = self._pop_free_block()
block = self._allocate_block(block_id)
else:
Comment on lines 171 to +179
if 0 < seq_len % self.block_size <= num_new_tokens or self.block_size == 1:
needed_blocks = (seq_len + self.block_size - 1) // self.block_size
while len(block_table) < needed_blocks:
# For block_size == 1, we need to update hash for each new block
# For block_size > 1, the previous block should have hash != -1 (unless it's the first block)
if self.block_size == 1:
# Allocate new block and update hash immediately (like allocate does for full blocks)
block_id = self.free_block_ids[0]
block = self._allocate_block(block_id)
block_table.append(block_id)
token_ids = [seq[-1]]
prefix = (
self.blocks[block_table[-2]].hash
if len(block_table) > 1
else -1
)
h = self.compute_hash(token_ids, prefix)
block.update(h, token_ids)
self.hash_to_block_id[h] = block_id
else:
# For block_size > 1, we only allocate new block when needed
# The hash will be updated when the block becomes full
block_id = self.free_block_ids[0]
block = self._allocate_block(block_id)
block_table.append(block_id)
last_block = block
elif seq_len % self.block_size == 0:
# Last block is now full, update its hash (similar to allocate)
# TODO: fix hash
token_ids = seq.block(seq.num_blocks - 1)
if len(token_ids) == self.block_size:
prefix = (
self.blocks[block_table[-2]].hash if len(block_table) > 1 else -1
)
h = self.compute_hash(token_ids, prefix)
last_block.update(h, token_ids)
self.hash_to_block_id[h] = last_block.block_id
else:
pass
# Last block is not full and not at the boundary
# Hash remains -1 until block is full (consistent with allocate logic)
# assert last_block.hash == -1, last_block.block_id
# Decode-generated blocks: token not finalized yet (depends on
# sampling / speculative verification), so we cannot compute a
# correct hash here. Just allocate the block without hashing.
block_id = self._pop_free_block()
self._allocate_block(block_id)
block_table.append(block_id)
Copilot AI review requested due to automatic review settings March 16, 2026 03:26
@jiayyu jiayyu force-pushed the ds_prefix_cache2 branch from cbcccb6 to fb6a927 Compare March 16, 2026 03:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR appears to extend prefix-caching support across the KV block manager, scheduler metadata, and multiple attention paths (MHA/MLA/Deepseek indexer), and adds tests/scripts intended to validate caching behavior.

Changes:

  • Add prefix-cache-aware metadata fields (has_cached, num_cached_tokens, mappings) to forward context and attention metadata builders.
  • Update attention implementations (MHA/MLA/Deepseek sparse indexer) to gather and attend over cached + new KV during prefill.
  • Extend BlockManager logic/tests for prefix caching, hash cleanup, and multi-token decode block allocation.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
atom/model_engine/block_manager.py Adds free-block set tracking, cache-aware can_allocate, multi-token can_append, and changes allocation behavior.
atom/model_engine/scheduler.py Tracks cache hit stats; updates decode scheduling to request multi-token capacity.
atom/utils/forward_context.py Adds prefix-cache fields to AttentionMetaData.
atom/model_ops/attentions/backends.py Builds prefill metadata supporting cached prefixes (cu_seqlens_q/k, token mappings, block tables).
atom/model_ops/base_attention.py Extends cp_mha_gather_cache to support multiple KV cache layouts.
atom/model_ops/attentions/aiter_mla.py Adjusts sparse indexer metadata and kv-indices generation for cached prefixes.
atom/model_ops/attention_mla.py Adds prefix-cache prefill path using gather_kv_b_proj + varlen flash attention.
atom/model_ops/attention_mha.py Gathers cached KV from paged cache and concatenates with new tokens during prefill.
atom/models/deepseek_v2.py Updates sparse attention indexer to gather full KV when cached prefixes exist.
tests/test_block_manager.py Adds additional prefix-caching tests, hash cleanup tests, and multi-token can_append tests.
tests/test_prefix_cache_accuracy.py Adds a standalone HTTP-based accuracy/caching workload script under tests/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

del self.hash_to_block_id[block.hash]
block.reset()
self.free_block_ids.remove(block_id)
self.free_block_ids_set.discard(block_id)
Comment on lines +159 to +162
needed_blocks = (
seq_len + num_new_tokens + self.block_size - 1
) // self.block_size
new_blocks_needed = max(0, needed_blocks - current_blocks)
@jiayyu jiayyu force-pushed the ds_prefix_cache2 branch from fb6a927 to 5f0026e Compare March 16, 2026 11:20
@jiayyu jiayyu changed the title Ds prefix cache2 support prefill prefix cache Mar 17, 2026
)
if attn_metadata.has_cached:
# Full context (cached + new): use cu_seqlens_k for indexer
cu_seqlens_k_np = attn_metadata.cu_seqlens_k.cpu().numpy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.cu_seqlens_k.cpu() this will introduce D2H copy

num_tokens = hidden_states.shape[0]
# When has_cached, gather full KV (cached + new) for indexer top-k
total_kv = (
prefill_metadata.cu_seqlens_k[-1].item()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D2H

Copilot AI review requested due to automatic review settings March 18, 2026 04:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for using prefix-cache hits during prefill so attention/indexer paths can operate over “cached + newly computed” KV, while also improving block allocation logic and adding related tests.

Changes:

  • Extend attention metadata to carry prefix-cache-related fields (e.g., has_cached, total_kv, cached-token mappings).
  • Update attention/indexer implementations to gather and use full KV context when prefixes are cached.
  • Enhance block manager allocation/append logic for prefix caching and multi-token decode scheduling; add new unit tests and a standalone accuracy script.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_prefix_cache_accuracy.py Adds a standalone script intended to validate high cache-hit behavior against a running server.
tests/test_block_manager.py Adds unit tests covering prefix-cache-aware allocation, hash-table cleanup, multi-token append logic, and edge cases.
atom/utils/forward_context.py Extends AttentionMetaData with prefix-cache fields used by attention/indexer paths.
atom/models/deepseek_v2.py Adjusts sparse indexer KV gathering to support cached + new KV during prefill.
atom/model_ops/base_attention.py Updates cache-gather helper to support multiple KV cache layouts.
atom/model_ops/attentions/backends.py Builds additional prefill metadata for prefix cache (cu-seqlens, token-to-batch mapping, etc.).
atom/model_ops/attentions/aiter_mla.py Updates sparse MLA prefill metadata generation to account for cached prefixes.
atom/model_ops/attention_mla.py Enables prefix-cache-aware prefill KV gathering and uses full-KV cu-seqlens in relevant paths.
atom/model_ops/attention_mha.py Gathers cached KV from paged cache and concatenates with new KV for attention compute when prefixes hit.
atom/model_engine/scheduler.py Adds cache hit stats tracking and updates decode scheduling to reserve blocks for multi-token decode.
atom/model_engine/block_manager.py Introduces set-based free-block tracking + prefix-cache-aware can_allocate and multi-token can_append.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 113 to 118
if block_id == -1 or self.blocks[block_id].token_ids != token_ids:
cache_miss = True
if cache_miss:
block_id = self.free_block_ids[0]
block_id = self._pop_free_block()
block = self._allocate_block(block_id)
else:
Copilot AI review requested due to automatic review settings March 18, 2026 05:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for prefill prefix-cache hits by propagating new caching metadata through the attention stack, gathering cached KV for compute where needed, and expanding unit/integration-style tests around prefix caching behavior.

Changes:

  • Extend AttentionMetaData and attention backend builders to carry prefix-cache related metadata (e.g., cached tokens, total KV, token→batch mapping).
  • Update model attention implementations to handle prefill runs where some KV comes from cache (e.g., DeepSeekV2 indexer sizing, MLA/MHA KV gathering paths).
  • Enhance scheduler/block manager behavior and add tests for prefix-caching allocation/append edge cases plus a standalone accuracy script.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
atom/model_engine/block_manager.py Adds free-block set tracking, prefix-aware can_allocate, multi-token can_append, and hash eviction logic.
atom/model_engine/scheduler.py Adds periodic prefix-cache stats and updates decode append capacity checks for multi-token decode.
atom/utils/forward_context.py Extends AttentionMetaData with prefix-cache fields used by attention implementations/builders.
atom/model_ops/attentions/backends.py Builds prefill metadata with cached tokens and adds buffers needed to gather cached KV.
atom/model_ops/attentions/aiter_mla.py Updates sparse/prefill metadata computation to account for cached prefix.
atom/model_ops/base_attention.py Extends cp_mha_gather_cache to support multiple KV cache layouts.
atom/model_ops/attention_mha.py Adds prefill path to gather cached KV from paged cache and concat with newly written tokens.
atom/model_ops/attention_mla.py Adds prefix-cache-enabled MLA prefill path using gathered full KV; adjusts sparse indexing inputs.
atom/models/deepseek_v2.py Adjusts sparse indexer KV gather sizing and seqlens selection when prefix cache is present.
tests/test_block_manager.py Adds new unit tests around prefix-caching allocation correctness, hash cleanup, and multi-token append.
tests/test_prefix_cache_accuracy.py Adds a standalone HTTP-driven accuracy/workload script for prefix caching behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Evict stale hash entry before resetting
if block.hash != -1 and self.hash_to_block_id.get(block.hash) == block_id:
del self.hash_to_block_id[block.hash]
block.reset()
@jiayyu jiayyu force-pushed the ds_prefix_cache2 branch from a2190e0 to d601bd6 Compare March 18, 2026 05:42
valarLip and others added 9 commits March 18, 2026 13:48
Fix GPU memory access fault caused by double conversion of block_tables
in cached prefill path. kv_indices_generate_triton applies block_ratio
internally, but was receiving already-converted block_tables (via
block_tables_converted), causing indices to be multiplied by block_ratio
twice (e.g. block_id*256 instead of block_id*16), exceeding KV cache
bounds.

Key changes:
- Use raw block_tables for kv_indices generation in aiter_mla prefill
- Route cached prefill through paged MLA attention (supports Q≠K)
  instead of flash_attn_varlen_func (requires Q==K)
- Track has_cached flag through AttentionMetaData for path selection
- Fix block_manager: hash table leak, can_allocate cache-hit accounting,
  can_append for multi-token decode, O(1) free block tracking
- Add CacheStats to scheduler for prefix cache hit rate monitoring
- Add comprehensive block_manager tests (119 passing)

Verified: gsm8k 1319 samples, 95.83% accuracy, 0 GPU faults.
@jiayyu jiayyu force-pushed the ds_prefix_cache2 branch from d601bd6 to 609dbf3 Compare March 18, 2026 05:49
Copilot AI review requested due to automatic review settings March 18, 2026 07:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for prefix-cache hits during prefill by carrying cached-context metadata through attention prep and (when needed) gathering full KV (cached + new) for attention/indexing. Also tightens block-management behavior and extends tests around prefix caching and multi-token decode.

Changes:

  • Extend AttentionMetaData and attention-metadata builders to track cached-prefill state (has_cached, total_kv, num_cached_tokens, seq_starts) and build correct cu_seqlens_q/k.
  • Update MHA/MLA attention paths (and DeepSeek v2 indexer path) to operate on full KV length when prefix-cache hits exist.
  • Improve BlockManager free-block bookkeeping + add unit tests for prefix-caching allocation, hash cleanup, and multi-token decode capacity checks; add cache hit-rate logging in the scheduler.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_prefix_cache_accuracy.py Adds a standalone accuracy/load script for prefix-cache behavior against a running server.
tests/test_block_manager.py Adds tests for can_allocate with cache hits, hash-table cleanup, multi-token can_append, and preemption/cache reuse.
atom/utils/forward_context.py Extends AttentionMetaData with prefix-cache metadata fields.
atom/models/deepseek_v2.py Makes sparse indexer allocate/gather based on full KV length when cached prefixes exist.
atom/model_ops/base_attention.py Extends cp_mha_gather_cache to correctly handle both NHD and SHUFFLE cache layouts.
atom/model_ops/attentions/backends.py Builds correct prefill metadata for cached prefixes (positions/slot mapping, cu_seqlens_q/k, seq_starts, etc.).
atom/model_ops/attentions/aiter_mla.py Fixes sparse prefill metadata (cu_seqlen_ks/ke, token_to_seq_idxs) and ensures KV indices setup works with caching.
atom/model_ops/attention_mla.py Adds a prefix-cache prefill path that gathers full KV via gather_kv_b_proj before attention.
atom/model_ops/attention_mha.py Adds prefix-cache KV gather+concat in server-mode MHA via cp_mha_gather_cache.
atom/model_engine/scheduler.py Adds CacheStats logging and updates decode scheduling to consider multi-token appends (spec decode).
atom/model_engine/block_manager.py Adds free_block_ids_set, lazy deque cleanup, cache-aware can_allocate, and multi-token can_append.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +17 to +18
import requests

Comment on lines 113 to 117
if block_id == -1 or self.blocks[block_id].token_ids != token_ids:
cache_miss = True
if cache_miss:
block_id = self.free_block_ids[0]
block_id = self._pop_free_block()
block = self._allocate_block(block_id)
Comment on lines +41 to +42
from atom.plugin import is_plugin_mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F811> reported by reviewdog 🐶
Redefinition of unused is_plugin_mode from line 25: is_plugin_mode redefined here

Suggested change
from atom.plugin import is_plugin_mode

Comment on lines +43 to 44
from atom.plugin.attention_mla import MLAAttentionImplDecoratorForPluginMode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [ruff] <F811> reported by reviewdog 🐶
Redefinition of unused MLAAttentionImplDecoratorForPluginMode from line 26: MLAAttentionImplDecoratorForPluginMode redefined here

Suggested change
from atom.plugin.attention_mla import MLAAttentionImplDecoratorForPluginMode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants