Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
a337ebd
model : Initial support for DeepseekV32ForCausalLM (for now with dens…
sszymczy Mar 11, 2026
e467684
model : added indexer q and k calculation in DeepseekV32ForCausalLM.
sszymczy Mar 12, 2026
723f0ce
ggml : add Hadamard transform GGML OP and implementation
sszymczy Mar 12, 2026
72b7214
kv-cache : add cache for indexer keys (temporary solution)
sszymczy Mar 13, 2026
961bc95
convert : DSA indexer weights are bf16 in the original fp8 model, so …
sszymczy Mar 14, 2026
9a63e7a
model : crude proof-of-concept implementation of the DSA indexer for …
sszymczy Mar 14, 2026
3eb340e
ggml : add CUDA Hadamard transformation implementation (borrowed from…
sszymczy Mar 15, 2026
08dc7fd
ggml : add new GGML_OP_WHERE_ID (akin to torch where but using indices)
sszymczy Mar 15, 2026
998f496
model : used new GGML_OP_WHERE_ID op in DeepSeek V3.2 lightning index…
sszymczy Mar 15, 2026
6c9d773
model : handle multiple streams in DeepSeek V3.2 lightning indexer
sszymczy Mar 16, 2026
cb94b56
ggml : handle multiple streams in CUDA GGML_OP_WHERE_ID implementation
sszymczy Mar 16, 2026
02c2159
kv-cache : fix crashes for models without indexer
sszymczy Mar 16, 2026
e7aa89a
model : replaced ggml_argsort_top_k with ggml_top_k in DeepSeek V3.2 …
sszymczy Mar 22, 2026
1874ac9
model : added comments in DeepSeek V3.2 lightning indexer implementat…
sszymczy Mar 23, 2026
4309c84
kv-cache : added llama_kv_cache_dsa KV cache specific to DSA composed…
sszymczy Mar 24, 2026
9b0a4ee
ggml : replaced GGML_OP_WHERE_ID with GGML_OP_SCATTER that works simi…
sszymczy Mar 24, 2026
0ee5d80
ggml : added inplace version of GGML_OP_SCATTER and tests for this OP
sszymczy Mar 24, 2026
7f5578f
gguf-py : removed obsolete KV_B tensor from DEEPSEEK32 arch
sszymczy Mar 24, 2026
54945c7
convert : make pyright happy
sszymczy Mar 24, 2026
5677f08
ggml : added f16 version of GGML_OP_SCATTER
sszymczy Mar 25, 2026
1c830a1
ggml : added f16 version of GGML_OP_FILL
sszymczy Mar 25, 2026
83a0313
model : GGML_OP_SCATTER AND GGML_OP_FILL now work with f16 data, so w…
sszymczy Mar 25, 2026
6011bdd
ggml : fix bug in CUDA Hadamard transform implementation
sszymczy Mar 27, 2026
4aec6a8
ggml : simplified testing for nh being power of 2 in Hadamard transfo…
sszymczy Mar 27, 2026
a74d83a
ggml : added test for GGML_OP_HADAMARD
sszymczy Mar 27, 2026
5b9ce6c
convert : check if add_bos_token is true when converting DeepseekV32F…
sszymczy Mar 27, 2026
57a8def
Merge remote-tracking branch 'upstream/master' into deepseek-dsa
sszymczy Mar 31, 2026
6959bcf
graph : replaced llama_ik_cache with llama_kv_cache instance created …
sszymczy Apr 1, 2026
f443d0c
graph : implemented llm_graph_input_attn_k_dsa
sszymczy Apr 1, 2026
d3236d8
graph : renamed DSA-related suffixes, since in DSA-related classes _b…
sszymczy Apr 1, 2026
346c2b4
Merge remote-tracking branch 'upstream/master' into deepseek-dsa
sszymczy Apr 2, 2026
5086217
llama : handle LLM_ARCH_DEEPSEEK32 in test-llama-archs
sszymczy Apr 2, 2026
a7820f6
model : replace ggml_hadamard() in DEEPSEEK32 with Hadamard rotation …
sszymczy Apr 2, 2026
014e63c
ggml : added new GGML_OP_LIGHTNING_INDEXER that merges ggml_mul_mat()…
sszymczy Apr 8, 2026
3d61d0b
ggml : support lightning indexer key quantization
sszymczy Apr 12, 2026
ec083ce
Merge remote-tracking branch 'upstream/master' into deepseek-dsa
sszymczy Apr 13, 2026
65c3557
chore : fix trailing whitespaces
sszymczy Apr 13, 2026
5715c36
ggml : bump RPC version
sszymczy Apr 13, 2026
e0fcb22
chore : silence Python Type-Check CI error
sszymczy Apr 13, 2026
45294c5
ggml : more assertions in GGML_OP_SCATTER since there is no broadcasting
sszymczy Apr 13, 2026
66ec7be
model : corrected number of layers for 685B.A37B DeepseekV32ForCausal…
sszymczy Apr 13, 2026
039c88d
Merge remote-tracking branch 'upstream/master' into deepseek-dsa
sszymczy Apr 14, 2026
eea1a6e
llama : set lightning indexer head count and key dimension to real va…
sszymczy Apr 14, 2026
5dc8a87
chore : whitespace formatting, comments
sszymczy Apr 14, 2026
d9a1703
chore : comments
sszymczy Apr 15, 2026
4054f0d
tests : f16 GGML_OP_FILL tests
sszymczy Apr 15, 2026
81209f9
llama : replaced ggml_scatter() usage in DSA implementation with ggml…
sszymczy Apr 15, 2026
24215b2
chore : whitespaces
sszymczy Apr 15, 2026
e0c767e
ggml : remove unused file
sszymczy Apr 15, 2026
9695fc8
chore : whitespaces
sszymczy Apr 20, 2026
938ef66
Merge remote-tracking branch 'upstream/master' into deepseek-dsa
sszymczy Apr 20, 2026
789ec5a
Merge remote-tracking branch 'upstream/master' into deepseek-dsa
sszymczy Apr 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -831,6 +831,8 @@ def prepare_tensors(self):
gguf.MODEL_TENSOR.SSM_CONV1D_Q,
gguf.MODEL_TENSOR.SSM_CONV1D_K,
gguf.MODEL_TENSOR.SSM_CONV1D_V,
# DSA indexer weights should be F32
gguf.MODEL_TENSOR.INDEXER_PROJ,
)
)
or new_name[-7:] not in (".weight", ".lora_a", ".lora_b")
Expand Down Expand Up @@ -9186,6 +9188,147 @@ def prepare_tensors(self):
raise ValueError(f"Unprocessed experts: {experts}")


@ModelBase.register(
"DeepseekV32ForCausalLM",
)
class DeepseekV32Model(TextModel):
model_arch = gguf.MODEL_ARCH.DEEPSEEK32

# TODO @ngxson : remove this when we support MTP for deepseek models
skip_mtp = True

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.block_count = self.hparams["num_hidden_layers"] + self.hparams.get("num_nextn_predict_layers", 0)
self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)

def set_vocab(self):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model)
assert getattr(tokenizer, "add_bos_token", False), "Change value of add_bos_token to true in tokenizer_config.json file."
self._set_vocab_gpt2()

def set_gguf_parameters(self):

# note: deepseek32 using MLA converts into MQA (ie: GQA with 1 group)
self.hparams["num_key_value_heads"] = 1

super().set_gguf_parameters()
hparams = self.hparams

# first_k_dense_replace: number of leading layers using dense FFN instead of MoE
self.gguf_writer.add_leading_dense_block_count(hparams["first_k_dense_replace"])
self.gguf_writer.add_vocab_size(hparams["vocab_size"])
self.gguf_writer.add_q_lora_rank(hparams["q_lora_rank"])
self.gguf_writer.add_kv_lora_rank(hparams["kv_lora_rank"])

# note: deepseek32 using MLA converts into MQA with larger heads, then decompresses to MHA
self.gguf_writer.add_key_length(hparams["kv_lora_rank"] + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length(hparams["kv_lora_rank"])
self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
self.gguf_writer.add_value_length_mla(hparams["v_head_dim"])

# MoE parameters (required by C++ code for DEEPSEEK32 arch)
self.gguf_writer.add_expert_feed_forward_length(hparams["moe_intermediate_size"])
self.gguf_writer.add_expert_count(hparams["n_routed_experts"])
self.gguf_writer.add_expert_shared_count(hparams["n_shared_experts"])
self.gguf_writer.add_expert_weights_scale(self.hparams["routed_scaling_factor"])
self.gguf_writer.add_expert_weights_norm(self.hparams["norm_topk_prob"])

self.gguf_writer.add_rope_dimension_count(hparams["qk_rope_head_dim"])

if (rope_mscale_all := self.rope_parameters.get("mscale_all_dim")) is not None:
# [TAG_DEEPSEEK2_YARN_LOG_MUL_FIX]
# note: for legacy reasons, this is not consistent with the other usages of self.gguf_writer.add_rope_scaling_yarn_log_mul
# ref https://github.com/ggml-org/llama.cpp/pull/17945
self.gguf_writer.add_rope_scaling_yarn_log_mul(0.1 * rope_mscale_all)

# NextN/MTP prediction layers
if (num_nextn_predict_layers := self.hparams.get("num_nextn_predict_layers")) is not None:
self.gguf_writer.add_nextn_predict_layers(num_nextn_predict_layers)

# DSA indexer parameters
self.gguf_writer.add_indexer_head_count(self.hparams["index_n_heads"])
self.gguf_writer.add_indexer_key_length(self.hparams["index_head_dim"])
self.gguf_writer.add_indexer_top_k(self.hparams["index_topk"])

_experts: list[dict[str, Tensor]] | None = None

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
if name.startswith("language_model."):
name = name.replace("language_model.", "")

# rename e_score_correction_bias tensors
if name.endswith("e_score_correction_bias"):
name = name.replace("e_score_correction_bias", "e_score_correction.bias")

# skip Multi-Token Prediction (MTP) layers
if self.skip_mtp:
block_count = self.hparams["num_hidden_layers"]
match = re.match(r"model.layers.(\d+)", name)
if match and int(match.group(1)) >= block_count:
return

# process the experts separately
if name.find("mlp.experts") != -1:
n_experts = self.hparams["n_routed_experts"]
assert bid is not None

if self._experts is None:
self._experts = [{} for _ in range(self.block_count)]

self._experts[bid][name] = data_torch

if len(self._experts[bid]) >= n_experts * 3:
# merge the experts into a single 3d tensor
for w_name in ["down_proj", "gate_proj", "up_proj"]:
datas: list[Tensor] = []

for xid in range(n_experts):
ename = f"model.layers.{bid}.mlp.experts.{xid}.{w_name}.weight"
datas.append(self._experts[bid][ename])
del self._experts[bid][ename]

data_torch = torch.stack(datas, dim=0)

merged_name = f"model.layers.{bid}.mlp.experts.{w_name}.weight"

yield from super().modify_tensors(data_torch, merged_name, bid)
return
else:
return

# note: MLA with the absorption optimization, needs these two split and k_b_proj transposed
if name.endswith("kv_b_proj.weight"):
name_kb = name.replace("kv_b_proj", "k_b_proj")
name_vb = name.replace("kv_b_proj", "v_b_proj")

n_head_kv = self.hparams["num_key_value_heads"]
v_head_dim = self.hparams["v_head_dim"]
qk_nope_head_dim = self.hparams["qk_nope_head_dim"]

assert data_torch.shape[0] == n_head_kv * (v_head_dim + qk_nope_head_dim)

kv_b = data_torch.view(n_head_kv, v_head_dim + qk_nope_head_dim, data_torch.shape[-1])
k_b, v_b = torch.split(kv_b, [qk_nope_head_dim, v_head_dim], dim=1)
k_b = k_b.transpose(1, 2)

yield from super().modify_tensors(k_b, name_kb, bid)
yield from super().modify_tensors(v_b, name_vb, bid)
return

yield from super().modify_tensors(data_torch, name, bid)

def prepare_tensors(self):
super().prepare_tensors()

if self._experts is not None:
# flatten `list[dict[str, Tensor]]` into `list[str]`
experts = [k for d in self._experts for k in d.keys()]
if len(experts) > 0:
raise ValueError(f"Unprocessed experts: {experts}")


@ModelBase.register(
"Mistral3ForConditionalGeneration",
"Ministral3ForCausalLM",
Expand Down
4 changes: 2 additions & 2 deletions ggml/include/ggml-rpc.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ extern "C" {

#define RPC_PROTO_MAJOR_VERSION 4
#define RPC_PROTO_MINOR_VERSION 0
#define RPC_PROTO_PATCH_VERSION 0
#define RPC_PROTO_PATCH_VERSION 1

#ifdef __cplusplus
static_assert(GGML_OP_COUNT == 96, "GGML_OP_COUNT has changed - update RPC_PROTO_PATCH_VERSION");
static_assert(GGML_OP_COUNT == 97, "GGML_OP_COUNT has changed - update RPC_PROTO_PATCH_VERSION");
#endif

#define GGML_RPC_MAX_SERVERS 16
Expand Down
9 changes: 9 additions & 0 deletions ggml/include/ggml.h
Original file line number Diff line number Diff line change
Expand Up @@ -561,6 +561,7 @@ extern "C" {
GGML_OP_RWKV_WKV7,
GGML_OP_SOLVE_TRI,
GGML_OP_GATED_DELTA_NET,
GGML_OP_LIGHTNING_INDEXER,

GGML_OP_UNARY,

Expand Down Expand Up @@ -2539,6 +2540,14 @@ extern "C" {
struct ggml_tensor * beta,
struct ggml_tensor * state);

GGML_API struct ggml_tensor * ggml_lightning_indexer(
struct ggml_context * ctx,
struct ggml_tensor * q,
struct ggml_tensor * k,
struct ggml_tensor * weights,
float scale_embd,
float scale_heads);

// custom operators

typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
Expand Down
11 changes: 11 additions & 0 deletions ggml/src/ggml-cpu/ggml-cpu.c
Original file line number Diff line number Diff line change
Expand Up @@ -2037,6 +2037,10 @@ static void ggml_compute_forward(struct ggml_compute_params * params, struct ggm
{
ggml_compute_forward_gated_delta_net(params, tensor);
} break;
case GGML_OP_LIGHTNING_INDEXER:
{
ggml_compute_forward_lightning_indexer(params, tensor);
} break;
case GGML_OP_MAP_CUSTOM1:
{
ggml_compute_forward_map_custom1(params, tensor);
Expand Down Expand Up @@ -2356,6 +2360,7 @@ static int ggml_get_n_tasks(struct ggml_tensor * node, int n_threads) {
case GGML_OP_FLASH_ATTN_BACK:
case GGML_OP_SSM_CONV:
case GGML_OP_SSM_SCAN:
case GGML_OP_LIGHTNING_INDEXER:
{
n_tasks = n_threads;
} break;
Expand Down Expand Up @@ -2939,6 +2944,12 @@ struct ggml_cplan ggml_graph_plan(
{
GGML_ABORT("fatal error");
}
case GGML_OP_LIGHTNING_INDEXER:
{
// temp buffer for dequantizing lightning indexer keys
const int64_t ne10 = node->src[1]->ne[0];
cur += sizeof(float)*ne10*n_tasks;
} break;
default:
break;
}
Expand Down
109 changes: 108 additions & 1 deletion ggml/src/ggml-cpu/ops.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2235,8 +2235,42 @@ static void ggml_compute_forward_fill_f32(const ggml_compute_params * params, gg
}
}

static void ggml_compute_forward_fill_f16(const ggml_compute_params * params, ggml_tensor * dst) {
const ggml_fp16_t c = GGML_CPU_FP32_TO_FP16(ggml_get_op_params_f32(dst, 0));

GGML_TENSOR_LOCALS(int64_t, ne, dst, ne);
GGML_TENSOR_LOCALS(size_t, nb, dst, nb);

const auto [ir0, ir1] = get_thread_range(params, dst);

for (int64_t ir = ir0; ir < ir1; ++ir) {
const int64_t i03 = ir/(ne2*ne1);
const int64_t i02 = (ir - i03*ne2*ne1)/ne1;
const int64_t i01 = (ir - i03*ne2*ne1 - i02*ne1);

ggml_fp16_t * dst_ptr = (ggml_fp16_t *) ((char *) dst->data + i03*nb3 + i02*nb2 + i01*nb1);

ggml_vec_set_f16(ne0, dst_ptr, c);
}
}

void ggml_compute_forward_fill(const ggml_compute_params * params, ggml_tensor * dst) {
ggml_compute_forward_fill_f32(params, dst);
const ggml_tensor * src0 = dst->src[0];

switch (src0->type) {
case GGML_TYPE_F32:
{
ggml_compute_forward_fill_f32(params, dst);
} break;
case GGML_TYPE_F16:
{
ggml_compute_forward_fill_f16(params, dst);
} break;
default:
{
GGML_ABORT("unsupported type for ggml_compute_forward_fill: %s", ggml_type_name(src0->type));
}
}
}

// ggml_compute_tri
Expand Down Expand Up @@ -11212,3 +11246,76 @@ void ggml_compute_forward_opt_step_sgd(const ggml_compute_params * params, ggml_
}
}
}

// ggml_compute_forward_lightning_indexer

void ggml_compute_forward_lightning_indexer(
const ggml_compute_params * params,
ggml_tensor * dst) {

const ggml_tensor * src0 = dst->src[0]; // q
const ggml_tensor * src1 = dst->src[1]; // k
const ggml_tensor * src2 = dst->src[2]; // weights

const float scale_embd = ggml_get_op_params_f32(dst, 0);
const float scale_heads = ggml_get_op_params_f32(dst, 1);

GGML_ASSERT(dst->type == GGML_TYPE_F32);
GGML_ASSERT(src0->type == GGML_TYPE_F32);
GGML_ASSERT(src2->type == GGML_TYPE_F32);

GGML_TENSOR_TERNARY_OP_LOCALS

GGML_ASSERT( nb0 == sizeof(float));
GGML_ASSERT(nb00 == sizeof(float));

int n_embd = src0->ne[0];
int n_head = src0->ne[1];
int n_batch = src0->ne[2];
int n_stream = src0->ne[3];
int n_kv = src1->ne[2];

ggml_to_float_t const k_to_float = ggml_get_type_traits(src1->type)->to_float;
GGML_ASSERT((src1->type == GGML_TYPE_F32 || k_to_float) && "lightning indexer: unsupported K-type");

const int nr = n_kv;
const int ith = params->ith;
const int nth = params->nth;

// (temporary) buffer for K converted to float
float * src1_row_f32 = (float *) params->wdata + ith*(1*n_embd + CACHE_LINE_SIZE_F32);

// rows per thread
const int dr = (nr + nth - 1)/nth;

// row range for this thread
const int ir0 = dr*ith;
const int ir1 = MIN(ir0 + dr, nr);

for (int i_stream = 0; i_stream < n_stream; ++i_stream) {
for (int i_batch = 0; i_batch < n_batch; ++i_batch) {
for (int i_kv = ir0; i_kv < ir1; ++i_kv) {
char * src1_row = (char *) src1->data + i_kv*nb12 + i_stream*nb13;
if (k_to_float) {
k_to_float(src1_row, src1_row_f32, n_embd);
} else {
src1_row_f32 = (float *) src1_row;
}
float * src2_row = (float *) ((char *) src2->data + i_batch*nb21 + i_stream*nb23);
float * dst_row = (float *) ((char *) dst->data + i_batch*nb1 + i_stream*nb3);
float score = 0.0f;
for (int i_head = 0; i_head < n_head; ++i_head) {
// dot product of q and k for head i_head
float qk = 0.0f;
float * src0_row = (float *) ((char *) src0->data + i_head*nb01 + i_batch*nb02 + i_stream*nb03);
ggml_vec_dot_f32(n_embd, &qk, 0, src0_row, 0, src1_row_f32, 0, 1);
qk *= scale_embd;
// ReLU and weights
score += MAX(qk, 0.0f) * src2_row[i_head];
}
score *= scale_heads;
dst_row[i_kv] = score;
}
}
}
}
1 change: 1 addition & 0 deletions ggml/src/ggml-cpu/ops.h
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ void ggml_compute_forward_rwkv_wkv7(const struct ggml_compute_params * params, s
void ggml_compute_forward_solve_tri(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_gla(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_gated_delta_net(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_lightning_indexer(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_map_custom1(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_map_custom2(const struct ggml_compute_params * params, struct ggml_tensor * dst);
void ggml_compute_forward_map_custom3(const struct ggml_compute_params * params, struct ggml_tensor * dst);
Expand Down
5 changes: 5 additions & 0 deletions ggml/src/ggml-cuda/ggml-cuda.cu
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@
#include "ggml-cuda/tri.cuh"
#include "ggml-cuda/cumsum.cuh"
#include "ggml-cuda/fill.cuh"
#include "ggml-cuda/lightning_indexer.cuh"
#include "ggml.h"

#include <algorithm>
Expand Down Expand Up @@ -2922,6 +2923,9 @@ static bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, struct gg
case GGML_OP_FILL:
ggml_cuda_op_fill(ctx, dst);
break;
case GGML_OP_LIGHTNING_INDEXER:
ggml_cuda_op_lightning_indexer(ctx, dst);
break;
default:
return false;
}
Expand Down Expand Up @@ -5112,6 +5116,7 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
case GGML_OP_TRI:
case GGML_OP_DIAG:
case GGML_OP_SOLVE_TRI:
case GGML_OP_LIGHTNING_INDEXER:
return true;

default:
Expand Down
Loading
Loading