Conversation
bf02a43 to
e17b614
Compare
There was a problem hiding this comment.
Pull request overview
This PR extends prefix-caching support through the scheduler, metadata builders, and attention implementations (including DeepSeek-v2/MLA paths), and adds unit tests for cache-aware block management behavior.
Changes:
- Add prefix-cache metadata plumbing (
has_cached,num_cached_tokens, etc.) and use it to gather cached+new KV for attention/indexing. - Update BlockManager/scheduler logic to account for cache hits and multi-token decode allocation.
- Add/extend tests covering prefix-cache allocation behavior and hash-table cleanup.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_prefix_cache_accuracy.py | Adds an integration-style prefix-cache accuracy driver script under tests/. |
| tests/test_block_manager.py | Adds new unit tests for cache-aware allocation, hash cleanup, and multi-token append scenarios. |
| atom/utils/forward_context.py | Extends AttentionMetaData with prefix-cache-related fields. |
| atom/models/deepseek_v2.py | Adjusts sparse indexer to consider full KV length when prefix cache is present. |
| atom/model_ops/base_attention.py | Extends cp_mha_gather_cache to support multiple KV cache layouts. |
| atom/model_ops/attentions/backends.py | Builds prefill metadata accounting for cached tokens; adds token-to-batch mapping for cache gather. |
| atom/model_ops/attentions/aiter_mla.py | Updates MLA prefill metadata generation to use full-context lengths when prefix cache is present. |
| atom/model_ops/attention_mla.py | Adds prefix-cache path to gather full KV and run varlen flash-attn prefill. |
| atom/model_ops/attention_mha.py | Adds prefix-cache KV gather+concat path for MHA via cp_mha_gather_cache. |
| atom/model_engine/scheduler.py | Adds cache hit-rate stats logging; updates decode scheduling to reserve multi-token space. |
| atom/model_engine/block_manager.py | Adds free-block tracking set + cache-aware can_allocate; updates can_append to support multi-token appends. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if cache_miss: | ||
| block_id = self.free_block_ids[0] | ||
| block_id = self._pop_free_block() | ||
| block = self._allocate_block(block_id) | ||
| else: |
| if 0 < seq_len % self.block_size <= num_new_tokens or self.block_size == 1: | ||
| needed_blocks = (seq_len + self.block_size - 1) // self.block_size | ||
| while len(block_table) < needed_blocks: | ||
| # For block_size == 1, we need to update hash for each new block | ||
| # For block_size > 1, the previous block should have hash != -1 (unless it's the first block) | ||
| if self.block_size == 1: | ||
| # Allocate new block and update hash immediately (like allocate does for full blocks) | ||
| block_id = self.free_block_ids[0] | ||
| block = self._allocate_block(block_id) | ||
| block_table.append(block_id) | ||
| token_ids = [seq[-1]] | ||
| prefix = ( | ||
| self.blocks[block_table[-2]].hash | ||
| if len(block_table) > 1 | ||
| else -1 | ||
| ) | ||
| h = self.compute_hash(token_ids, prefix) | ||
| block.update(h, token_ids) | ||
| self.hash_to_block_id[h] = block_id | ||
| else: | ||
| # For block_size > 1, we only allocate new block when needed | ||
| # The hash will be updated when the block becomes full | ||
| block_id = self.free_block_ids[0] | ||
| block = self._allocate_block(block_id) | ||
| block_table.append(block_id) | ||
| last_block = block | ||
| elif seq_len % self.block_size == 0: | ||
| # Last block is now full, update its hash (similar to allocate) | ||
| # TODO: fix hash | ||
| token_ids = seq.block(seq.num_blocks - 1) | ||
| if len(token_ids) == self.block_size: | ||
| prefix = ( | ||
| self.blocks[block_table[-2]].hash if len(block_table) > 1 else -1 | ||
| ) | ||
| h = self.compute_hash(token_ids, prefix) | ||
| last_block.update(h, token_ids) | ||
| self.hash_to_block_id[h] = last_block.block_id | ||
| else: | ||
| pass | ||
| # Last block is not full and not at the boundary | ||
| # Hash remains -1 until block is full (consistent with allocate logic) | ||
| # assert last_block.hash == -1, last_block.block_id | ||
| # Decode-generated blocks: token not finalized yet (depends on | ||
| # sampling / speculative verification), so we cannot compute a | ||
| # correct hash here. Just allocate the block without hashing. | ||
| block_id = self._pop_free_block() | ||
| self._allocate_block(block_id) | ||
| block_table.append(block_id) |
cbcccb6 to
fb6a927
Compare
There was a problem hiding this comment.
Pull request overview
This PR appears to extend prefix-caching support across the KV block manager, scheduler metadata, and multiple attention paths (MHA/MLA/Deepseek indexer), and adds tests/scripts intended to validate caching behavior.
Changes:
- Add prefix-cache-aware metadata fields (
has_cached,num_cached_tokens, mappings) to forward context and attention metadata builders. - Update attention implementations (MHA/MLA/Deepseek sparse indexer) to gather and attend over cached + new KV during prefill.
- Extend BlockManager logic/tests for prefix caching, hash cleanup, and multi-token decode block allocation.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
atom/model_engine/block_manager.py |
Adds free-block set tracking, cache-aware can_allocate, multi-token can_append, and changes allocation behavior. |
atom/model_engine/scheduler.py |
Tracks cache hit stats; updates decode scheduling to request multi-token capacity. |
atom/utils/forward_context.py |
Adds prefix-cache fields to AttentionMetaData. |
atom/model_ops/attentions/backends.py |
Builds prefill metadata supporting cached prefixes (cu_seqlens_q/k, token mappings, block tables). |
atom/model_ops/base_attention.py |
Extends cp_mha_gather_cache to support multiple KV cache layouts. |
atom/model_ops/attentions/aiter_mla.py |
Adjusts sparse indexer metadata and kv-indices generation for cached prefixes. |
atom/model_ops/attention_mla.py |
Adds prefix-cache prefill path using gather_kv_b_proj + varlen flash attention. |
atom/model_ops/attention_mha.py |
Gathers cached KV from paged cache and concatenates with new tokens during prefill. |
atom/models/deepseek_v2.py |
Updates sparse attention indexer to gather full KV when cached prefixes exist. |
tests/test_block_manager.py |
Adds additional prefix-caching tests, hash cleanup tests, and multi-token can_append tests. |
tests/test_prefix_cache_accuracy.py |
Adds a standalone HTTP-based accuracy/caching workload script under tests/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| del self.hash_to_block_id[block.hash] | ||
| block.reset() | ||
| self.free_block_ids.remove(block_id) | ||
| self.free_block_ids_set.discard(block_id) |
| needed_blocks = ( | ||
| seq_len + num_new_tokens + self.block_size - 1 | ||
| ) // self.block_size | ||
| new_blocks_needed = max(0, needed_blocks - current_blocks) |
fb6a927 to
5f0026e
Compare
| ) | ||
| if attn_metadata.has_cached: | ||
| # Full context (cached + new): use cu_seqlens_k for indexer | ||
| cu_seqlens_k_np = attn_metadata.cu_seqlens_k.cpu().numpy() |
There was a problem hiding this comment.
.cu_seqlens_k.cpu() this will introduce D2H copy
atom/models/deepseek_v2.py
Outdated
| num_tokens = hidden_states.shape[0] | ||
| # When has_cached, gather full KV (cached + new) for indexer top-k | ||
| total_kv = ( | ||
| prefill_metadata.cu_seqlens_k[-1].item() |
There was a problem hiding this comment.
Pull request overview
Adds support for using prefix-cache hits during prefill so attention/indexer paths can operate over “cached + newly computed” KV, while also improving block allocation logic and adding related tests.
Changes:
- Extend attention metadata to carry prefix-cache-related fields (e.g.,
has_cached,total_kv, cached-token mappings). - Update attention/indexer implementations to gather and use full KV context when prefixes are cached.
- Enhance block manager allocation/append logic for prefix caching and multi-token decode scheduling; add new unit tests and a standalone accuracy script.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_prefix_cache_accuracy.py | Adds a standalone script intended to validate high cache-hit behavior against a running server. |
| tests/test_block_manager.py | Adds unit tests covering prefix-cache-aware allocation, hash-table cleanup, multi-token append logic, and edge cases. |
| atom/utils/forward_context.py | Extends AttentionMetaData with prefix-cache fields used by attention/indexer paths. |
| atom/models/deepseek_v2.py | Adjusts sparse indexer KV gathering to support cached + new KV during prefill. |
| atom/model_ops/base_attention.py | Updates cache-gather helper to support multiple KV cache layouts. |
| atom/model_ops/attentions/backends.py | Builds additional prefill metadata for prefix cache (cu-seqlens, token-to-batch mapping, etc.). |
| atom/model_ops/attentions/aiter_mla.py | Updates sparse MLA prefill metadata generation to account for cached prefixes. |
| atom/model_ops/attention_mla.py | Enables prefix-cache-aware prefill KV gathering and uses full-KV cu-seqlens in relevant paths. |
| atom/model_ops/attention_mha.py | Gathers cached KV from paged cache and concatenates with new KV for attention compute when prefixes hit. |
| atom/model_engine/scheduler.py | Adds cache hit stats tracking and updates decode scheduling to reserve blocks for multi-token decode. |
| atom/model_engine/block_manager.py | Introduces set-based free-block tracking + prefix-cache-aware can_allocate and multi-token can_append. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if block_id == -1 or self.blocks[block_id].token_ids != token_ids: | ||
| cache_miss = True | ||
| if cache_miss: | ||
| block_id = self.free_block_ids[0] | ||
| block_id = self._pop_free_block() | ||
| block = self._allocate_block(block_id) | ||
| else: |
There was a problem hiding this comment.
Pull request overview
This PR adds support for prefill prefix-cache hits by propagating new caching metadata through the attention stack, gathering cached KV for compute where needed, and expanding unit/integration-style tests around prefix caching behavior.
Changes:
- Extend
AttentionMetaDataand attention backend builders to carry prefix-cache related metadata (e.g., cached tokens, total KV, token→batch mapping). - Update model attention implementations to handle prefill runs where some KV comes from cache (e.g., DeepSeekV2 indexer sizing, MLA/MHA KV gathering paths).
- Enhance scheduler/block manager behavior and add tests for prefix-caching allocation/append edge cases plus a standalone accuracy script.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
atom/model_engine/block_manager.py |
Adds free-block set tracking, prefix-aware can_allocate, multi-token can_append, and hash eviction logic. |
atom/model_engine/scheduler.py |
Adds periodic prefix-cache stats and updates decode append capacity checks for multi-token decode. |
atom/utils/forward_context.py |
Extends AttentionMetaData with prefix-cache fields used by attention implementations/builders. |
atom/model_ops/attentions/backends.py |
Builds prefill metadata with cached tokens and adds buffers needed to gather cached KV. |
atom/model_ops/attentions/aiter_mla.py |
Updates sparse/prefill metadata computation to account for cached prefix. |
atom/model_ops/base_attention.py |
Extends cp_mha_gather_cache to support multiple KV cache layouts. |
atom/model_ops/attention_mha.py |
Adds prefill path to gather cached KV from paged cache and concat with newly written tokens. |
atom/model_ops/attention_mla.py |
Adds prefix-cache-enabled MLA prefill path using gathered full KV; adjusts sparse indexing inputs. |
atom/models/deepseek_v2.py |
Adjusts sparse indexer KV gather sizing and seqlens selection when prefix cache is present. |
tests/test_block_manager.py |
Adds new unit tests around prefix-caching allocation correctness, hash cleanup, and multi-token append. |
tests/test_prefix_cache_accuracy.py |
Adds a standalone HTTP-driven accuracy/workload script for prefix caching behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Evict stale hash entry before resetting | ||
| if block.hash != -1 and self.hash_to_block_id.get(block.hash) == block_id: | ||
| del self.hash_to_block_id[block.hash] | ||
| block.reset() |
a2190e0 to
d601bd6
Compare
Fix GPU memory access fault caused by double conversion of block_tables in cached prefill path. kv_indices_generate_triton applies block_ratio internally, but was receiving already-converted block_tables (via block_tables_converted), causing indices to be multiplied by block_ratio twice (e.g. block_id*256 instead of block_id*16), exceeding KV cache bounds. Key changes: - Use raw block_tables for kv_indices generation in aiter_mla prefill - Route cached prefill through paged MLA attention (supports Q≠K) instead of flash_attn_varlen_func (requires Q==K) - Track has_cached flag through AttentionMetaData for path selection - Fix block_manager: hash table leak, can_allocate cache-hit accounting, can_append for multi-token decode, O(1) free block tracking - Add CacheStats to scheduler for prefix cache hit rate monitoring - Add comprehensive block_manager tests (119 passing) Verified: gsm8k 1319 samples, 95.83% accuracy, 0 GPU faults.
d601bd6 to
609dbf3
Compare
There was a problem hiding this comment.
Pull request overview
Adds support for prefix-cache hits during prefill by carrying cached-context metadata through attention prep and (when needed) gathering full KV (cached + new) for attention/indexing. Also tightens block-management behavior and extends tests around prefix caching and multi-token decode.
Changes:
- Extend
AttentionMetaDataand attention-metadata builders to track cached-prefill state (has_cached,total_kv,num_cached_tokens,seq_starts) and build correctcu_seqlens_q/k. - Update MHA/MLA attention paths (and DeepSeek v2 indexer path) to operate on full KV length when prefix-cache hits exist.
- Improve
BlockManagerfree-block bookkeeping + add unit tests for prefix-caching allocation, hash cleanup, and multi-token decode capacity checks; add cache hit-rate logging in the scheduler.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_prefix_cache_accuracy.py | Adds a standalone accuracy/load script for prefix-cache behavior against a running server. |
| tests/test_block_manager.py | Adds tests for can_allocate with cache hits, hash-table cleanup, multi-token can_append, and preemption/cache reuse. |
| atom/utils/forward_context.py | Extends AttentionMetaData with prefix-cache metadata fields. |
| atom/models/deepseek_v2.py | Makes sparse indexer allocate/gather based on full KV length when cached prefixes exist. |
| atom/model_ops/base_attention.py | Extends cp_mha_gather_cache to correctly handle both NHD and SHUFFLE cache layouts. |
| atom/model_ops/attentions/backends.py | Builds correct prefill metadata for cached prefixes (positions/slot mapping, cu_seqlens_q/k, seq_starts, etc.). |
| atom/model_ops/attentions/aiter_mla.py | Fixes sparse prefill metadata (cu_seqlen_ks/ke, token_to_seq_idxs) and ensures KV indices setup works with caching. |
| atom/model_ops/attention_mla.py | Adds a prefix-cache prefill path that gathers full KV via gather_kv_b_proj before attention. |
| atom/model_ops/attention_mha.py | Adds prefix-cache KV gather+concat in server-mode MHA via cp_mha_gather_cache. |
| atom/model_engine/scheduler.py | Adds CacheStats logging and updates decode scheduling to consider multi-token appends (spec decode). |
| atom/model_engine/block_manager.py | Adds free_block_ids_set, lazy deque cleanup, cache-aware can_allocate, and multi-token can_append. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import requests | ||
|
|
| if block_id == -1 or self.blocks[block_id].token_ids != token_ids: | ||
| cache_miss = True | ||
| if cache_miss: | ||
| block_id = self.free_block_ids[0] | ||
| block_id = self._pop_free_block() | ||
| block = self._allocate_block(block_id) |
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist