Support per_token KV Cache eviction by l-bat · Pull Request #1 · l-bat/nncf

l-bat · 2025-07-07T14:14:57Z

Features:

Support per-token KV Cache eviction (H2O, SnapKV)
Support per-group KV Cache eviction
Support keys rerotation
Support refined token selection algorithms (KVCrush, CriticvalKV)
Evaluate LLMs on LongBench

Usage example:

from transformers import AutoModelForCausalLM, AutoTokenizer
from nncf.quantization.advanced_parameters import KVCacheCompressionParameters, KVCacheCompressionMode
from nncf.quantization.algorithms.kv_cache_management.torch_backend import KVCacheCompressor

MODEL_NAME = "Qwen/Qwen3-0.6B"
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, attn_implementation="eager")
prompt = "Once home, Mira spread out her findings across her bedroom floor . The map was rudimentary, marked with simple symbols: a sun, a tree, and an ominous 'X' at the end . It felt like a treasure map, and Mira's imagination began to race . After her parents went to bed, she gathered supplies: a flashlight, a notebook, and a snack for the journey . With her heart racing at the thought of adventure, she headed out into the cool night . The moon illuminated her path as Mira made her way up the hillside, following the map's directions . The night was quiet, with only the sound of rustling leaves and the distant hoot of an owl . As she climbed higher, she felt a growing sense of purpose . “The heart that beats beneath the stones,” she muttered, trying to decipher what the words could mean . After some time, she arrived at a clearing where the ground was carpeted with moss and dotted with smooth stones . The map indicated that she needed to look closely . Mira knelt down to inspect the area and, just as she was about to give up, she heard a soft thump, like the beat of a drum . Surprised, she looked around and found a particularly large stone slightly displaced from the others . The crystal became her talisman, reminding her of her promise and the magic of storytelling—a bridge between the ordinary and the extraordinary, where dreams take flight and every book waited to be opened ."
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

eviction_parameters = KVCacheCompressionParameters(
    algorithm=KVCacheCompressionMode.SNAPKV,
    window_size=8,
    strategy="per_token",
    start_size=32,
    recent_size=64,
    intermediate_size=128,
    score_aggregation="norm_sum"
)
compress = KVCacheCompressor(eviction_parameters=eviction_parameters)
with compress(model):
    outputs = model.generate(**inputs, do_sample=False, max_new_tokens=256, use_cache=True)

generate_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
generate_answer = generate_answer[inputs.input_ids.shape[-1]:]
print("Generated answer:", generate_answer)

Copilot

Pull Request Overview

Adds per-token KV cache eviction support (SnapKV, H2O) by defining new compression parameters and a compressor that hooks into model attention layers.

Introduce KVCacheCompressionMode enum and KVCacheCompressionParameters dataclass for eviction configuration
Implement KVCacheCompressor with scoring, index selection, and cache pruning logic in torch_backend.py
Update .ci/cspell_dict.txt to recognize the new “snapkv” term

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
src/nncf/quantization/algorithms/kv_cache_management/torch_backend.py	Adds `KVCacheCompressor` class, scoring, index selection, and hook logic
src/nncf/quantization/algorithms/kv_cache_management/init.py	Add license header
src/nncf/quantization/advanced_parameters.py	Define `KVCacheCompressionMode` enum and `KVCacheCompressionParameters`
.ci/cspell_dict.txt	Add “snapkv” to the custom spell‐check dictionary

Comments suppressed due to low confidence (3)

src/nncf/quantization/advanced_parameters.py:472

The docstring lists supported values as 'sum' and 'max', but the code accepts 'sum' and 'norm_sum'. Please update the documentation to match the implemented options.

        when determining least important tokens for eviction from cache. Supported values are 'sum' and 'max'.

src/nncf/quantization/algorithms/kv_cache_management/torch_backend.py:185

The docstring parameters (module, hidden_states, attentions) do not match the method signature (layer_idx, keys, values, kwargs). Please synchronize the docstring with the actual arguments.

        The core logic of the compression method.

src/nncf/quantization/algorithms/kv_cache_management/torch_backend.py:23

[nitpick] The new KVCacheCompressor introduces significant logic for score computation and pruning. Please add unit tests to cover its key behaviors (e.g., scoring, index selection, and compression).

class KVCacheCompressor:

Support per_token KV Cache eviction

0f63970

l-bat requested a review from Copilot July 8, 2025 06:50

Copilot AI reviewed Jul 8, 2025

View reviewed changes

Repository owner deleted a comment from Copilot AI Jul 8, 2025

l-bat added 15 commits July 8, 2025 10:30

Support per-group eviction

3b0f1d8

Support keys rerotation

f06282c

Support per-group KVCRush and CriticalKV algos

d30ccfc

Add LongBench validation

cd5a7d4

fix bugs

e9b0358

Add Sparse Prefill

72949d1

minor fixes

a1f9109

:Add CDPruner, SparsePrefill

1640e16

Fix bug in LongBench with chat template

d665c1f

Add lambda from R-KV paper

c428659

Support R-KV and RPC for Reasoning models

f7c18a5

Add MATH500 and GSM8K sample

3ea3491

refactoring

c374096

CDPruner Refactoring

f3dfbc3

refactoring

b102d2a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support per_token KV Cache eviction#1

Support per_token KV Cache eviction#1
l-bat wants to merge 16 commits into
developfrom
lt/token_eviction

l-bat commented Jul 7, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

l-bat commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features:

Usage example:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

l-bat commented Jul 7, 2025 •

edited

Loading