Support for DeepseekV32ForCausalLM with DeepSeek Sparse Attention (DSA)#21149
Support for DeepseekV32ForCausalLM with DeepSeek Sparse Attention (DSA)#21149fairydreaming wants to merge 52 commits intoggml-org:masterfrom
Conversation
…e attention). Needs manual change of add_bos_token to true in tokenizer_config.json before conversion.
…I think it's best not to quantize them.
…er implementation
…indexer implementation since the former fails for large tensors even when using CCCL.
… of llama_kv_cache and new llama_ik_cache (lightning indexer key cache). model : used new llama_kv_cache_dsa instead of modified llama_kv_cache with indexer keys in DeepseekV32ForCausalLM model : removed non-MLA path in DeepseekV32ForCausalLM
…lar to torch scatter_ operation.
…e can get rid of ggml_cast() calls in sparse attention implementation
…rm implementations
…orCausalLM-based models.
…lues in test-llama-archs to prevent crashes due to unhandled values in lightning indexer CUDA kernel
…_set_rows() by using 1-element rows ggml, tests : remove GGML_OP_SCATTER
|
I managed to get rid of |
|
I did some experiment trying to optimize long context inference on a small 4-layers DeepSeek V3.2 model: No optimizationUsing
|
|
Vulkanologists needed! @jeffbolznv @0cc4m |
Some more info - looks like a problem in SET_ROWS: Edit: I added this case (same type and dimensions) to test-backend-ops and it works, so it may be a different problem. |
|
I'll take a look. |
|
For me it crashes inside of |
|
Sure, I'll look. |
|
I got the same crash as in #21149 (comment). Looks like the destination tensor is in a host buffer, which I think is unexpected. But I'm not sure why that's happening. |
|
I guess what's happening is the model building code is taking the kq_mask input and trying to set_rows in a view of it, but we shouldn't be writing to input tensors. This diff works around it, but I don't know the model building code well enough to say what the right fix really is: I don't think the vulkan backend is doing anything wrong, though maybe we should sanity check that we're not trying to write to host buffers? |
|
@jeffbolznv Hmm, but in the DSA attention implementation the first argument to |
|
Hmm, I see this is an f16 fill, which the vulkan backend doesn't currently support, so I guess that makes it still be an input tensor after the graph is split. I don't know how this is supposed to be handled - a tensor ends up being an input to the graph split and we're supposed to write to it? But ignoring this more general issue, I'll add f16 support for fill and see if it helps. BTW, I had asked Claude 4.7 about this and it was grinding away when I realized what was happening, but it came to the same conclusion about missing f16 fill support. Nice! |
|
#22177 should fix this. |
Overview
This PR adds support for DeepseekV32ForCausalLM (DeepSeek V3.2 Exp, DeepSeek V3.2, DeepSeek V3.2 Speciale) models. It contains implementation of the lightning indexer and DeepSeek Sparse Attention (DSA) - both implemented in the simplest possible way as a proof of concept. So far only CPU and CUDA backends are supported.
Due to the way it's currently implemented it doesn't improve long context performance yet, more work is needed for this.
Some GGUFs for testing are available here (-light models), I uploaded Q8_0/Q4_K_M quants, so you need over 700GB/400GB of RAM/VRAM to run them.
I also created a 16GB baby DeepSeek V3.2 GGUF for VRAM-deprived people. It outputs incoherent gibberish, but should be useful for testing and optimizing this implementation even with limited resources.
I really could use some help with verifying the implementation correctness. If you have large GPU cluster and can run some benchmarks to compare results with official reported benchmark results for DeepSeek V3.2 models then go for it. More details in #21183.
Fixes #16331, #20363
Additional information
Decisions I made when implementing this:
DEEPSEEK32was added (mostly a copy of existingGLM_DSAarch),GGML_OP_SCATTERthat works similar to torch scatter_ operation but is currently limited to setting tensor elements at specified indices to a given scalar value,was added as another new GGML opimplementation from llama : rotate activations for better quantization #21038 was used in lightning indexerGGML_OP_HADAMARDwith implementation borrowed from ik_llama.cpp (thx @ikawrakow),llama_kv_cache_dsaclass which aggregatesthe usualtwo instances ofllama_kv_cachethat caches MLA latent representations (same as before for DeepSeek V3) and another newllama_ik_cacheclass (basically a copy of llama_kv_cache stripped of code related to V vector) that caches lightning indexer keys,llama_kv_cache- one for caching MLA latent representations, second for caching lightning indexer keyssince there are no official jinja templates for V3.2 and V3.2 Speciale, I simply decided to ignore this problem for now. You have to explicitly set chat template for these models (using jinja template from V3.2 Exp with these models will allow you to chat but tool calls won't work correctly).PR chat: dedicated DeepSeek v3.2 parser + "official" template #21785 added DeepSeek V3.2 chat template that you can use with--chat-template-file models/templates/deepseek-ai-DeepSeek-V3.2.jinjaRequirements
Due to limitations of the current CUDAggml_top_k()implementation NVIDIA CUDA CCCL library (version >3.2) and enabling GGML_CUDA_USE_CUB during CUDA backend compilation is needed, otherwise the CUDA implementation will crash for context sizes larger than (I think) 1024 tokens. I use it with CUDA 13.2 and CCCL 13.2.27.Bug in
ggml_top_k()is now fixed, fix is merged, so it should work even on 2.[89] CUDA without CCCL.Also if you want to convert the model by yourself, set
add_bos_tokento true intokenizer_config.jsonbefore the model conversion - this is needed for DeepSeek V3.2 and DeepSeek V3.2 Speciale. The conversion script has assert that checks this.Next Steps