support prefill prefix cache by jiayyu · Pull Request #286 · ROCm/ATOM

jiayyu · 2026-03-09T07:28:51Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR extends prefix-caching support through the scheduler, metadata builders, and attention implementations (including DeepSeek-v2/MLA paths), and adds unit tests for cache-aware block management behavior.

Changes:

Add prefix-cache metadata plumbing (has_cached, num_cached_tokens, etc.) and use it to gather cached+new KV for attention/indexing.
Update BlockManager/scheduler logic to account for cache hits and multi-token decode allocation.
Add/extend tests covering prefix-cache allocation behavior and hash-table cleanup.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/test_prefix_cache_accuracy.py	Adds an integration-style prefix-cache accuracy driver script under `tests/`.
tests/test_block_manager.py	Adds new unit tests for cache-aware allocation, hash cleanup, and multi-token append scenarios.
atom/utils/forward_context.py	Extends `AttentionMetaData` with prefix-cache-related fields.
atom/models/deepseek_v2.py	Adjusts sparse indexer to consider full KV length when prefix cache is present.
atom/model_ops/base_attention.py	Extends `cp_mha_gather_cache` to support multiple KV cache layouts.
atom/model_ops/attentions/backends.py	Builds prefill metadata accounting for cached tokens; adds token-to-batch mapping for cache gather.
atom/model_ops/attentions/aiter_mla.py	Updates MLA prefill metadata generation to use full-context lengths when prefix cache is present.
atom/model_ops/attention_mla.py	Adds prefix-cache path to gather full KV and run varlen flash-attn prefill.
atom/model_ops/attention_mha.py	Adds prefix-cache KV gather+concat path for MHA via `cp_mha_gather_cache`.
atom/model_engine/scheduler.py	Adds cache hit-rate stats logging; updates decode scheduling to reserve multi-token space.
atom/model_engine/block_manager.py	Adds free-block tracking set + cache-aware `can_allocate`; updates `can_append` to support multi-token appends.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_block_manager.py

tests/test_prefix_cache_accuracy.py

atom/model_engine/block_manager.py

            if cache_miss:
-                block_id = self.free_block_ids[0]
+                block_id = self._pop_free_block()
                block = self._allocate_block(block_id)
            else:


atom/model_engine/block_manager.py

        if 0 < seq_len % self.block_size <= num_new_tokens or self.block_size == 1:
            needed_blocks = (seq_len + self.block_size - 1) // self.block_size
            while len(block_table) < needed_blocks:
-                # For block_size == 1, we need to update hash for each new block
-                # For block_size > 1, the previous block should have hash != -1 (unless it's the first block)
-                if self.block_size == 1:
-                    # Allocate new block and update hash immediately (like allocate does for full blocks)
-                    block_id = self.free_block_ids[0]
-                    block = self._allocate_block(block_id)
-                    block_table.append(block_id)
-                    token_ids = [seq[-1]]
-                    prefix = (
-                        self.blocks[block_table[-2]].hash
-                        if len(block_table) > 1
-                        else -1
-                    )
-                    h = self.compute_hash(token_ids, prefix)
-                    block.update(h, token_ids)
-                    self.hash_to_block_id[h] = block_id
-                else:
-                    # For block_size > 1, we only allocate new block when needed
-                    # The hash will be updated when the block becomes full
-                    block_id = self.free_block_ids[0]
-                    block = self._allocate_block(block_id)
-                    block_table.append(block_id)
-                    last_block = block
-        elif seq_len % self.block_size == 0:
-            # Last block is now full, update its hash (similar to allocate)
-            # TODO: fix hash
-            token_ids = seq.block(seq.num_blocks - 1)
-            if len(token_ids) == self.block_size:
-                prefix = (
-                    self.blocks[block_table[-2]].hash if len(block_table) > 1 else -1
-                )
-                h = self.compute_hash(token_ids, prefix)
-                last_block.update(h, token_ids)
-                self.hash_to_block_id[h] = last_block.block_id
-        else:
-            pass
-            # Last block is not full and not at the boundary
-            # Hash remains -1 until block is full (consistent with allocate logic)
-            # assert last_block.hash == -1, last_block.block_id
+                # Decode-generated blocks: token not finalized yet (depends on
+                # sampling / speculative verification), so we cannot compute a
+                # correct hash here.  Just allocate the block without hashing.
+                block_id = self._pop_free_block()
+                self._allocate_block(block_id)
+                block_table.append(block_id)


atom/model_ops/attentions/backends.py

Copilot

Pull request overview

This PR appears to extend prefix-caching support across the KV block manager, scheduler metadata, and multiple attention paths (MHA/MLA/Deepseek indexer), and adds tests/scripts intended to validate caching behavior.

Changes:

Add prefix-cache-aware metadata fields (has_cached, num_cached_tokens, mappings) to forward context and attention metadata builders.
Update attention implementations (MHA/MLA/Deepseek sparse indexer) to gather and attend over cached + new KV during prefill.
Extend BlockManager logic/tests for prefix caching, hash cleanup, and multi-token decode block allocation.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`atom/model_engine/block_manager.py`	Adds free-block set tracking, cache-aware `can_allocate`, multi-token `can_append`, and changes allocation behavior.
`atom/model_engine/scheduler.py`	Tracks cache hit stats; updates decode scheduling to request multi-token capacity.
`atom/utils/forward_context.py`	Adds prefix-cache fields to `AttentionMetaData`.
`atom/model_ops/attentions/backends.py`	Builds prefill metadata supporting cached prefixes (cu_seqlens_q/k, token mappings, block tables).
`atom/model_ops/base_attention.py`	Extends `cp_mha_gather_cache` to support multiple KV cache layouts.
`atom/model_ops/attentions/aiter_mla.py`	Adjusts sparse indexer metadata and kv-indices generation for cached prefixes.
`atom/model_ops/attention_mla.py`	Adds prefix-cache prefill path using `gather_kv_b_proj` + varlen flash attention.
`atom/model_ops/attention_mha.py`	Gathers cached KV from paged cache and concatenates with new tokens during prefill.
`atom/models/deepseek_v2.py`	Updates sparse attention indexer to gather full KV when cached prefixes exist.
`tests/test_block_manager.py`	Adds additional prefix-caching tests, hash cleanup tests, and multi-token can_append tests.
`tests/test_prefix_cache_accuracy.py`	Adds a standalone HTTP-based accuracy/caching workload script under `tests/`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_prefix_cache_accuracy.py

atom/model_engine/block_manager.py

+            del self.hash_to_block_id[block.hash]
        block.reset()
-        self.free_block_ids.remove(block_id)
+        self.free_block_ids_set.discard(block_id)


atom/model_engine/block_manager.py

+        needed_blocks = (
+            seq_len + num_new_tokens + self.block_size - 1
+        ) // self.block_size
+        new_blocks_needed = max(0, needed_blocks - current_blocks)


tests/test_block_manager.py

atom/model_ops/attentions/backends.py

valarLip · 2026-03-17T02:43:01Z

atom/model_ops/attentions/aiter_mla.py

-            )
+            if attn_metadata.has_cached:
+                # Full context (cached + new): use cu_seqlens_k for indexer
+                cu_seqlens_k_np = attn_metadata.cu_seqlens_k.cpu().numpy()


.cu_seqlens_k.cpu() this will introduce D2H copy

valarLip · 2026-03-17T02:47:35Z

atom/models/deepseek_v2.py

+        num_tokens = hidden_states.shape[0]
+        # When has_cached, gather full KV (cached + new) for indexer top-k
+        total_kv = (
+            prefill_metadata.cu_seqlens_k[-1].item()


Copilot

Pull request overview

Adds support for using prefix-cache hits during prefill so attention/indexer paths can operate over “cached + newly computed” KV, while also improving block allocation logic and adding related tests.

Changes:

Extend attention metadata to carry prefix-cache-related fields (e.g., has_cached, total_kv, cached-token mappings).
Update attention/indexer implementations to gather and use full KV context when prefixes are cached.
Enhance block manager allocation/append logic for prefix caching and multi-token decode scheduling; add new unit tests and a standalone accuracy script.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/test_prefix_cache_accuracy.py	Adds a standalone script intended to validate high cache-hit behavior against a running server.
tests/test_block_manager.py	Adds unit tests covering prefix-cache-aware allocation, hash-table cleanup, multi-token append logic, and edge cases.
atom/utils/forward_context.py	Extends `AttentionMetaData` with prefix-cache fields used by attention/indexer paths.
atom/models/deepseek_v2.py	Adjusts sparse indexer KV gathering to support cached + new KV during prefill.
atom/model_ops/base_attention.py	Updates cache-gather helper to support multiple KV cache layouts.
atom/model_ops/attentions/backends.py	Builds additional prefill metadata for prefix cache (cu-seqlens, token-to-batch mapping, etc.).
atom/model_ops/attentions/aiter_mla.py	Updates sparse MLA prefill metadata generation to account for cached prefixes.
atom/model_ops/attention_mla.py	Enables prefix-cache-aware prefill KV gathering and uses full-KV cu-seqlens in relevant paths.
atom/model_ops/attention_mha.py	Gathers cached KV from paged cache and concatenates with new KV for attention compute when prefixes hit.
atom/model_engine/scheduler.py	Adds cache hit stats tracking and updates decode scheduling to reserve blocks for multi-token decode.
atom/model_engine/block_manager.py	Introduces set-based free-block tracking + prefix-cache-aware `can_allocate` and multi-token `can_append`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atom/models/deepseek_v2.py

atom/model_ops/attentions/backends.py

atom/model_engine/block_manager.py

            if block_id == -1 or self.blocks[block_id].token_ids != token_ids:
                cache_miss = True
            if cache_miss:
-                block_id = self.free_block_ids[0]
+                block_id = self._pop_free_block()
                block = self._allocate_block(block_id)
            else:


tests/test_prefix_cache_accuracy.py

Copilot

Pull request overview

This PR adds support for prefill prefix-cache hits by propagating new caching metadata through the attention stack, gathering cached KV for compute where needed, and expanding unit/integration-style tests around prefix caching behavior.

Changes:

Extend AttentionMetaData and attention backend builders to carry prefix-cache related metadata (e.g., cached tokens, total KV, token→batch mapping).
Update model attention implementations to handle prefill runs where some KV comes from cache (e.g., DeepSeekV2 indexer sizing, MLA/MHA KV gathering paths).
Enhance scheduler/block manager behavior and add tests for prefix-caching allocation/append edge cases plus a standalone accuracy script.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`atom/model_engine/block_manager.py`	Adds free-block set tracking, prefix-aware `can_allocate`, multi-token `can_append`, and hash eviction logic.
`atom/model_engine/scheduler.py`	Adds periodic prefix-cache stats and updates decode append capacity checks for multi-token decode.
`atom/utils/forward_context.py`	Extends `AttentionMetaData` with prefix-cache fields used by attention implementations/builders.
`atom/model_ops/attentions/backends.py`	Builds prefill metadata with cached tokens and adds buffers needed to gather cached KV.
`atom/model_ops/attentions/aiter_mla.py`	Updates sparse/prefill metadata computation to account for cached prefix.
`atom/model_ops/base_attention.py`	Extends `cp_mha_gather_cache` to support multiple KV cache layouts.
`atom/model_ops/attention_mha.py`	Adds prefill path to gather cached KV from paged cache and concat with newly written tokens.
`atom/model_ops/attention_mla.py`	Adds prefix-cache-enabled MLA prefill path using gathered full KV; adjusts sparse indexing inputs.
`atom/models/deepseek_v2.py`	Adjusts sparse indexer KV gather sizing and seqlens selection when prefix cache is present.
`tests/test_block_manager.py`	Adds new unit tests around prefix-caching allocation correctness, hash cleanup, and multi-token append.
`tests/test_prefix_cache_accuracy.py`	Adds a standalone HTTP-driven accuracy/workload script for prefix caching behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_block_manager.py

atom/model_engine/block_manager.py

+        # Evict stale hash entry before resetting
+        if block.hash != -1 and self.hash_to_block_id.get(block.hash) == block_id:
+            del self.hash_to_block_id[block.hash]
        block.reset()


atom/model_ops/attentions/backends.py

atom/model_ops/attentions/aiter_mla.py

tests/test_prefix_cache_accuracy.py

Fix GPU memory access fault caused by double conversion of block_tables in cached prefill path. kv_indices_generate_triton applies block_ratio internally, but was receiving already-converted block_tables (via block_tables_converted), causing indices to be multiplied by block_ratio twice (e.g. block_id*256 instead of block_id*16), exceeding KV cache bounds. Key changes: - Use raw block_tables for kv_indices generation in aiter_mla prefill - Route cached prefill through paged MLA attention (supports Q≠K) instead of flash_attn_varlen_func (requires Q==K) - Track has_cached flag through AttentionMetaData for path selection - Fix block_manager: hash table leak, can_allocate cache-hit accounting, can_append for multi-token decode, O(1) free block tracking - Add CacheStats to scheduler for prefix cache hit rate monitoring - Add comprehensive block_manager tests (119 passing) Verified: gsm8k 1319 samples, 95.83% accuracy, 0 GPU faults.

Copilot

Pull request overview

Adds support for prefix-cache hits during prefill by carrying cached-context metadata through attention prep and (when needed) gathering full KV (cached + new) for attention/indexing. Also tightens block-management behavior and extends tests around prefix caching and multi-token decode.

Changes:

Extend AttentionMetaData and attention-metadata builders to track cached-prefill state (has_cached, total_kv, num_cached_tokens, seq_starts) and build correct cu_seqlens_q/k.
Update MHA/MLA attention paths (and DeepSeek v2 indexer path) to operate on full KV length when prefix-cache hits exist.
Improve BlockManager free-block bookkeeping + add unit tests for prefix-caching allocation, hash cleanup, and multi-token decode capacity checks; add cache hit-rate logging in the scheduler.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/test_prefix_cache_accuracy.py	Adds a standalone accuracy/load script for prefix-cache behavior against a running server.
tests/test_block_manager.py	Adds tests for `can_allocate` with cache hits, hash-table cleanup, multi-token `can_append`, and preemption/cache reuse.
atom/utils/forward_context.py	Extends `AttentionMetaData` with prefix-cache metadata fields.
atom/models/deepseek_v2.py	Makes sparse indexer allocate/gather based on full KV length when cached prefixes exist.
atom/model_ops/base_attention.py	Extends `cp_mha_gather_cache` to correctly handle both NHD and SHUFFLE cache layouts.
atom/model_ops/attentions/backends.py	Builds correct prefill metadata for cached prefixes (positions/slot mapping, `cu_seqlens_q/k`, `seq_starts`, etc.).
atom/model_ops/attentions/aiter_mla.py	Fixes sparse prefill metadata (`cu_seqlen_ks/ke`, `token_to_seq_idxs`) and ensures KV indices setup works with caching.
atom/model_ops/attention_mla.py	Adds a prefix-cache prefill path that gathers full KV via `gather_kv_b_proj` before attention.
atom/model_ops/attention_mha.py	Adds prefix-cache KV gather+concat in server-mode MHA via `cp_mha_gather_cache`.
atom/model_engine/scheduler.py	Adds `CacheStats` logging and updates decode scheduling to consider multi-token appends (spec decode).
atom/model_engine/block_manager.py	Adds `free_block_ids_set`, lazy deque cleanup, cache-aware `can_allocate`, and multi-token `can_append`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_prefix_cache_accuracy.py

+import requests
+


atom/model_engine/block_manager.py

            if block_id == -1 or self.blocks[block_id].token_ids != token_ids:
                cache_miss = True
            if cache_miss:
-                block_id = self.free_block_ids[0]
+                block_id = self._pop_free_block()
                block = self._allocate_block(block_id)


github-actions · 2026-03-19T09:28:41Z

atom/model_ops/attention_mla.py

+from atom.plugin import is_plugin_mode
+


⚠️ [ruff] <F811> _{reported by reviewdog 🐶}
Redefinition of unused is_plugin_mode from line 25: is_plugin_mode redefined here

Suggested change

from atom.plugin import is_plugin_mode

github-actions · 2026-03-19T09:28:41Z

atom/model_ops/attention_mla.py

+from atom.plugin.attention_mla import MLAAttentionImplDecoratorForPluginMode



⚠️ [ruff] <F811> _{reported by reviewdog 🐶}
Redefinition of unused MLAAttentionImplDecoratorForPluginMode from line 26: MLAAttentionImplDecoratorForPluginMode redefined here

Suggested change

from atom.plugin.attention_mla import MLAAttentionImplDecoratorForPluginMode

jiayyu force-pushed the ds_prefix_cache2 branch 3 times, most recently from bf02a43 to e17b614 Compare March 13, 2026 07:55

jiayyu marked this pull request as ready for review March 13, 2026 07:56

Copilot AI review requested due to automatic review settings March 13, 2026 07:56

Copilot started reviewing on behalf of jiayyu March 13, 2026 07:57 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 16, 2026 03:26

jiayyu force-pushed the ds_prefix_cache2 branch from cbcccb6 to fb6a927 Compare March 16, 2026 03:26

Copilot started reviewing on behalf of jiayyu March 16, 2026 03:26 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

jiayyu force-pushed the ds_prefix_cache2 branch from fb6a927 to 5f0026e Compare March 16, 2026 11:20

jiayyu changed the title ~~Ds prefix cache2~~ support prefill prefix cache Mar 17, 2026

valarLip reviewed Mar 17, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 18, 2026 04:48

Copilot started reviewing on behalf of jiayyu March 18, 2026 04:50 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings March 18, 2026 05:07

Copilot started reviewing on behalf of jiayyu March 18, 2026 05:08 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

jiayyu force-pushed the ds_prefix_cache2 branch from a2190e0 to d601bd6 Compare March 18, 2026 05:42

valarLip and others added 9 commits March 18, 2026 13:48

wip

b65b276

support mla prefix cache

f5b2760

mha

d55a2aa

mla shuffled weight to be supported

9eee120

fix format

8795773

fix format

7900280

refine mha

56cd285

fix format

eca2ebf

jiayyu added 6 commits March 18, 2026 13:48

prefill mla wip

c3f06ef

fix format

368a716

trigger ci

fb85730

fix dtype

f55ca3e

fix d2h

e2a946f

fix

609dbf3

jiayyu force-pushed the ds_prefix_cache2 branch from d601bd6 to 609dbf3 Compare March 18, 2026 05:49

resolve comments

99be595

Copilot AI review requested due to automatic review settings March 18, 2026 07:45

Copilot started reviewing on behalf of jiayyu March 18, 2026 07:47 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

Merge branch 'main' into ds_prefix_cache2

675c47c

github-actions bot reviewed Mar 19, 2026

View reviewed changes

		from atom.plugin.attention_mla import MLAAttentionImplDecoratorForPluginMode

Conversation

jiayyu commented Mar 9, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valarLip Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

valarLip Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants