Skip to content

Support per_token KV Cache eviction#1

Open
l-bat wants to merge 16 commits into
developfrom
lt/token_eviction
Open

Support per_token KV Cache eviction#1
l-bat wants to merge 16 commits into
developfrom
lt/token_eviction

Conversation

@l-bat
Copy link
Copy Markdown
Owner

@l-bat l-bat commented Jul 7, 2025

Features:

  • Support per-token KV Cache eviction (H2O, SnapKV)
  • Support per-group KV Cache eviction
  • Support keys rerotation
  • Support refined token selection algorithms (KVCrush, CriticvalKV)
  • Evaluate LLMs on LongBench

Usage example:

from transformers import AutoModelForCausalLM, AutoTokenizer
from nncf.quantization.advanced_parameters import KVCacheCompressionParameters, KVCacheCompressionMode
from nncf.quantization.algorithms.kv_cache_management.torch_backend import KVCacheCompressor

MODEL_NAME = "Qwen/Qwen3-0.6B"
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, attn_implementation="eager")
prompt = "Once home, Mira spread out her findings across her bedroom floor . The map was rudimentary, marked with simple symbols: a sun, a tree, and an ominous 'X' at the end . It felt like a treasure map, and Mira's imagination began to race . After her parents went to bed, she gathered supplies: a flashlight, a notebook, and a snack for the journey . With her heart racing at the thought of adventure, she headed out into the cool night . The moon illuminated her path as Mira made her way up the hillside, following the map's directions . The night was quiet, with only the sound of rustling leaves and the distant hoot of an owl . As she climbed higher, she felt a growing sense of purpose . “The heart that beats beneath the stones,” she muttered, trying to decipher what the words could mean . After some time, she arrived at a clearing where the ground was carpeted with moss and dotted with smooth stones . The map indicated that she needed to look closely . Mira knelt down to inspect the area and, just as she was about to give up, she heard a soft thump, like the beat of a drum . Surprised, she looked around and found a particularly large stone slightly displaced from the others . The crystal became her talisman, reminding her of her promise and the magic of storytelling—a bridge between the ordinary and the extraordinary, where dreams take flight and every book waited to be opened ."
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

eviction_parameters = KVCacheCompressionParameters(
    algorithm=KVCacheCompressionMode.SNAPKV,
    window_size=8,
    strategy="per_token",
    start_size=32,
    recent_size=64,
    intermediate_size=128,
    score_aggregation="norm_sum"
)
compress = KVCacheCompressor(eviction_parameters=eviction_parameters)
with compress(model):
    outputs = model.generate(**inputs, do_sample=False, max_new_tokens=256, use_cache=True)

generate_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
generate_answer = generate_answer[inputs.input_ids.shape[-1]:]
print("Generated answer:", generate_answer)

@l-bat l-bat requested a review from Copilot July 8, 2025 06:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds per-token KV cache eviction support (SnapKV, H2O) by defining new compression parameters and a compressor that hooks into model attention layers.

  • Introduce KVCacheCompressionMode enum and KVCacheCompressionParameters dataclass for eviction configuration
  • Implement KVCacheCompressor with scoring, index selection, and cache pruning logic in torch_backend.py
  • Update .ci/cspell_dict.txt to recognize the new “snapkv” term

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/nncf/quantization/algorithms/kv_cache_management/torch_backend.py Adds KVCacheCompressor class, scoring, index selection, and hook logic
src/nncf/quantization/algorithms/kv_cache_management/init.py Add license header
src/nncf/quantization/advanced_parameters.py Define KVCacheCompressionMode enum and KVCacheCompressionParameters
.ci/cspell_dict.txt Add “snapkv” to the custom spell‐check dictionary
Comments suppressed due to low confidence (3)

src/nncf/quantization/advanced_parameters.py:472

  • The docstring lists supported values as 'sum' and 'max', but the code accepts 'sum' and 'norm_sum'. Please update the documentation to match the implemented options.
        when determining least important tokens for eviction from cache. Supported values are 'sum' and 'max'.

src/nncf/quantization/algorithms/kv_cache_management/torch_backend.py:185

  • The docstring parameters (module, hidden_states, attentions) do not match the method signature (layer_idx, keys, values, kwargs). Please synchronize the docstring with the actual arguments.
        The core logic of the compression method.

src/nncf/quantization/algorithms/kv_cache_management/torch_backend.py:23

  • [nitpick] The new KVCacheCompressor introduces significant logic for score computation and pruning. Please add unit tests to cover its key behaviors (e.g., scoring, index selection, and compression).
class KVCacheCompressor:

Repository owner deleted a comment from Copilot AI Jul 8, 2025
Repository owner deleted a comment from Copilot AI Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants