Add GGML fused quantized dense and gather ops by ajroetker · Pull Request #364 · gomlx/gomlx

ajroetker · 2026-03-11T21:37:53Z

Summary

Add FusedQuantizedGather for quantized embedding lookups (analogous to ggml_get_rows)
Add FusedQuantizedDense for fused dequant + matmul + bias + activation on GGML-quantized weights
Add GGML dequantization support for Q4_0, Q8_0, IQ4_NL, Q4_K, Q6_K types
Add shift operations (LogicalShiftLeft/Right) needed for sub-byte unpacking
Add QuantizedGather high-level layer in pkg/ml/nn

Review feedback addressed

Reverted buffer pool bucketing changes to a future PR
Reverted execBitcast fix to separate PR: Fix execBitcast buffer reuse for cross-bit-width types #374
Split quantized executors into exec_fused_quantized.go and exec_fused_quantized_ggml.go
Added GGML block format documentation with references to ggml/llama.cpp
Added detailed deriveGGMLK docs explaining N/K dimensions
Renamed table/tableQuantization → data/dataQuantization in FusedQuantizedGather
Used errors.Wrapf(backends.ErrNotImplemented, ...) with feature-request guidance

Test plan

go build ./... passes
Existing tests pass
go test ./backends/simplego/... — includes new benchmark for fused ops
go test ./pkg/ml/nn/... — includes new quantized dense test

- Add QuantGGML quantization scheme with GGMLQuantType enum supporting Q4_0, Q8_0, IQ4_NL, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K block formats - Add FusedQuantizedGather op for quantized embedding lookups with graph builder, shape inference, and simplego executor - Implement GGML dequantization executors (Q4_0, Q8_0, Q4_K, Q6_K) for both FusedQuantizedDense and FusedQuantizedGather - Add QuantizedGather nn layer for quantized embedding tables - Fix packedLen to only pack sub-byte integer types (Int4/Uint4), not Bool - Add benchmarks for GGML dequantization and quantized dense tests

…vements

…y bench dep - exec_shift_ops.go: use unexported binaryOperandsAndOutput/newBroadcastIterator - Remove BenchmarkQuantizedDenseGGML that depends on go-highway/gguf

Resolve conflict in buffers.go: keep bucketSize/subSliceFlat/fullCapFlat from gguf branch, drop isPackedSubByteDType/packedLen (replaced by dtype.IsPacked() and dtype.SizeForDimensions()). Fix subSliceFlat call in getBuffer to use element count for non-packed types and byte count for packed types.

…d-dense

The parameter describes the quantization of the table (embedding matrix), not generic weights. This makes the API more precise and consistent with the table/indices naming used elsewhere in the function signature.

- Fix ggmlFp16LE subnormal path: use int32 for exponent to avoid fragile uint32 underflow/cast chain - Add allow-list check to FusedQuantizedGather so unsupported GGML types (e.g. IQ4_NL) return ErrNotImplemented at build time, enabling transparent fallback via InternalFusedOpCaller - Extract deriveGGMLK helper to deduplicate K derivation+validation in fusedQuantizedDenseGGML and FusedQuantizedGather - Extract processColumn closure in quantizedDenseGGML to deduplicate serial/parallel branches - Extract flatToIntSlice helper to deduplicate 3-way type switch in execFusedQuantizedGather - Extract extractNibbleBlock helper to deduplicate shared Q4_0/IQ4_NL nibble unpacking in dequant.go - Remove dead _ = vpb assignment in Dequant - Remove duplicated comment in local.go - Merge identical BytesPerBlock cases (GGMLQ4_0, GGMLIQ4NL)

…xecutor, harden gather - Hand-write BackendFusedQuantizedGather in fused_ops.go using toBackend(), matching the BackendFusedQuantizedDense pattern. Remove auto-generated version. - Add dequantIQ4NLRow fused executor and register IQ4_NL in builder switches for both Dense and Gather, so IQ4_NL no longer always falls back to decomposed. - Add bounds check on rowIdx in execFusedQuantizedGather to produce a meaningful error instead of an opaque Go panic on out-of-range indices. - Refactor quantizedDenseGGML to use quantizedDenseParallel for consistent parallelism with NF4/Linear paths, including M=1 column tiling. - Add QuantizedGather tests for Q8_0, Q4_0, and IQ4_NL covering both fused (simplego) and decomposed (xla:cpu) paths. - Clarify IQ4NLLookupTable comment and K-quant backend requirements in docs.

janpfeifer

Not fully reviewed, but some important change requests (sry), so let's start with this.

backends/simplego/buffers.go

backends/simplego/exec_bitcast.go

backends/simplego/exec_fused_ops.go

backends/simplego/fused_ops.go

backends/fused_ops.go

- Merge upstream/main (DTypeMap error returns, pad support, etc.) - Revert buffer pool bucketing changes to separate PR per review - Revert exec_bitcast changes to separate PR per review - Split quantized code into exec_fused_quantized.go and exec_fused_quantized_ggml.go per review - Add GGML block format documentation with references to ggml/llama.cpp - Add detailed deriveGGMLK documentation explaining N/K dimensions - Rename table/tableQuantization -> data/dataQuantization in FusedQuantizedGather (parameter names, not just for embeddings) - Use errors.Wrapf(backends.ErrNotImplemented, ...) for unsupported quantization schemes in FusedQuantizedGather - Add feature-request guidance in error messages

janpfeifer

Apologies for the delay. This took some time to review.

backends/simplego/capabilities.go

backends/simplego/exec_fused_quantized.go

backends/fused_ops.go

pkg/core/tensors/local.go

pkg/ml/ggml/dequant.go

pkg/ml/ggml/gather.go

…d-dense

…acked sub-byte types - Rename FusedQuantizedGather → QuantizedEmbeddingLookup across all backends, graph layer, and ggml package (it doesn't fuse operations) - Rename Dequant → Dequantize, GatherDecomposed → EmbeddingLookupDecomposed - Rename flatToIntSlice → quantGatherIntSliceOfFlat (too generic for specialized use) - Extract validateGGMLTypeSupported helper to deduplicate GGML type validation - Fix ValueSafe for packed sub-byte tensors (Int4, Uint4, Int2, Uint2): unpack before scalar check and multi-dimensional slice building - Fix Summary to show actual dtype name for packed types instead of Go storage type - Add comprehensive tests for packed sub-byte ValueSafe and Summary - Improve Dequantize doc comments explaining N, K, and why N is explicit - Sort shift ops alphabetically in capabilities.go - Regenerate enumerators and backend ops wrappers

…s, fix dead branch - Replace hand-rolled ggmlFp16LE with float16.Frombits from existing x448/float16 dependency (already used elsewhere in simplego) - Pre-allocate per-worker scratch buffers in quantizedDenseGGML to avoid heap allocation per tile invocation on the inference hot path - Eliminate intermediate dequantRow + copy in execQuantizedEmbeddingLookup by dequantizing directly into the output slice - Remove dead branch in execBitcast canReuse (targetDType == dtypes.Uint8 is unreachable when !sameBitWidth and srcIsUint8)

ajroetker · 2026-03-23T00:26:07Z

Tried to address everything, found a couple of bugs while refactoring after the fixups too!

Extract parallelTileCount helper and add workerIdx callback parameter to quantizedDenseParallel, eliminating duplicated tileSize/numWorkers computation in quantizedDenseGGML.

janpfeifer

I still haven't reviewed everything ... Question: in many cases it's not making use of the infra that is already there, I assume most of the code was written by the AI, correct ?

If yes, I wonder if there is a context prompt we can add to the AGENTS.md to try to make it more attentive to following the patterns in the package already.

janpfeifer · 2026-03-23T06:46:33Z

backends/simplego/exec_fused_quantized.go

-func quantizedDenseParallel(backend *Backend, M, K, N int, rowFn func(m, nStart, nEnd int)) {
+// parallelTileCount returns the number of parallel work units that
+// quantizedDenseParallel will dispatch for the given dimensions.
+func parallelTileCount(backend *Backend, M, K, N int) int {


nit: parallelTileCount -> quantizedDenseParallelTileCount ?

(Since simplego is this giant package ...)

janpfeifer · 2026-03-23T06:56:29Z

backends/simplego/exec_bitcast.go

 			_, srcIsUint8 := src.flat.([]uint8)
-			dstIsUint8Storage := targetDType == dtypes.Uint8 || targetDType.Bits() < 8
-			canReuse = srcIsUint8 && dstIsUint8Storage
+			canReuse = srcIsUint8 && targetDType.Bits() < 8


Maybe:

tgtIsUint8 := targetDType.GoType().Kind() == reflect.Uint8 canReuse = srcIsUint8 && tgtIsUint8

Just in case there eventually is a packed data type that doesn't use uint8 as the storage dtype.

janpfeifer · 2026-03-25T06:50:26Z

backends/simplego/exec_fused_quantized.go

+// unpackWeightsToInt8 unpacks sub-byte weight data (Int4, Uint4) from packed
+// []byte storage into []int8 (one value per element) for the matmul kernel.
+// For non-sub-byte types, returns the flat data as-is.
+func unpackWeightsToInt8(wBuf *Buffer) any {


We probably should return a buffer here: meaning the buffer pool was created exactly for these quick temporary allocations that will be the same size at every execution of the graph -- so that we expect to happen often.

And, if that makes sense, then we may as well use the already existing execConvertDType from Uint4/Int4 to Int8.

janpfeifer · 2026-03-25T07:01:36Z

backends/simplego/exec_fused_quantized.go

+	return output, nil
+}
+
+// quantGatherIntSliceOfFlat converts a flat index slice ([]int32, []int64, or []int) to []int.


Same as above, we likely want to use the buffer pool to allocate the converted ints. In which case, you may just use the corresponding execConvertDType from whatever is the indices to int64 (better use int64 than int in this case, with explicit number of bits -- for all our platforms int=int64 anyway).

janpfeifer · 2026-03-25T07:05:57Z

backends/simplego/exec_fused_quantized.go

+		return nil, err
+	}
+
+	numIndices := indicesBuf.shape.Size() / indicesBuf.shape.Dimensions[indicesBuf.shape.Rank()-1]


The logic here is very odd.

If the last dimension of indices != 1, you would be simply truncating the indices and throwing the rest away !?

But we know the last dimension is 1 (it's pre-checked) and so numIndices := indicesBuf.shape.Size().

janpfeifer · 2026-03-25T07:13:49Z

backends/simplego/exec_fused_quantized.go

+
+	numIndices := indicesBuf.shape.Size() / indicesBuf.shape.Dimensions[indicesBuf.shape.Rank()-1]
+
+	indices, err := quantGatherIntSliceOfFlat(indicesBuf.flat, numIndices)


Optional: instead of converting to int, what about having a generic function for all types of ints and use the usual registration system to map the index dtype to the corresponding generic instantiation ? You save one temporary allocation, and it will probably be a tiny bit faster for smaller integers (less memory to scan).

Very optional, since likely this won't make much of a difference, I can't imagine the embedding lookup being the bottleneck :)

janpfeifer · 2026-03-25T07:14:50Z

backends/simplego/exec_fused_quantized.go

+	}
+}
+
+// convertToIntSlice converts the first n elements of an integer slice to []int.


Not needed, if you use the execConvertDType... per comment above.

backends/simplego/exec_shift_ops.go

- Use buffer pool + ConvertDType for weight unpacking and index conversion instead of ad-hoc allocations (unpackWeightsToBuffer, convertIndicesToInt64) - Inline shift operations following binary ops pattern to eliminate per-element closure overhead (shiftLeftOp, shiftRightArithmeticOp, shiftRightLogicalUnsignedOp, shiftRightLogicalSignedOp) - Rename parallelTileCount → quantizedDenseParallelTileCount - Simplify numIndices calculation (last dim pre-validated as 1) - Use tgtIsUint8 GoType check in exec_bitcast instead of Bits() < 8 - Add GGML format references and doc links to fused_ops.go - Add "Follow Existing Patterns" guidance to AGENTS.md

janpfeifer

Very neat, thanks @ajroetker !!

A few more minor comments, all reviewed now!

janpfeifer · 2026-03-26T06:19:10Z

backends/simplego/exec_fused_quantized.go

+func unpackWeightsToBuffer(backend *Backend, wBuf *Buffer) (*Buffer, bool, error) {
+	var targetDType dtypes.DType
+	switch wBuf.shape.DType {
+	case dtypes.Int4:


Question: what about Int2 and Uint2 ?

janpfeifer · 2026-03-26T06:22:17Z

backends/simplego/exec_fused_quantized.go

+
+	// For packed sub-byte weights (from Bitcast), unpack nibbles via the buffer pool
+	// and ConvertDType infrastructure. Non-sub-byte types pass through unchanged.
+	unpackedBuf, unpackedPooled, err := unpackWeightsToBuffer(backend, wBuf)


nit: maybe unpackedPooled -> isUnpackedPooled or isUnpackedOwned since it comes from the pool anyway (wBuf is pooled), the question is whether we need to release it.

janpfeifer · 2026-03-26T06:29:17Z

backends/simplego/exec_fused_quantized.go

+	numIndices := indicesBuf.shape.Size()
+
+	// Convert indices to int64 via the buffer pool and ConvertDType infrastructure.
+	idxBuf, idxPooled, err := convertIndicesToInt64(backend, indicesBuf)


nit: when reading I keep thinking idxPooled is another buffer. I was going to suggest as above, prefix is "is" or "are" or "has" for booleans, and maybe call it "owned" since all buffers are pooled. So idxPooled -> isIdxOwned.

Which reminds me that I should have named inputsOwned to areInputsOwned 😄

janpfeifer · 2026-03-26T07:03:49Z

backends/simplego/exec_shift_ops.go

+}
+
+// execShiftLeft executes lhs << rhs for integer types.
+func execShiftLeft(backend *Backend, node *Node, inputs []*Buffer, inputsOwned []bool) (*Buffer, error) {


Very optional: you could use the DTypeMap to register functions per dtype (and get rid of the various laundry list switches). Well, the ShfitLogicalRight would have to be manually registered because of taking two type parameters.

But ... when we have SIMD, it will be easy to simply register the supported SIMD version detected in runtime.

But it can also wait ...

janpfeifer · 2026-03-26T08:04:09Z

pkg/core/tensors/prettyprint.go

 			w("[%d]", dim)
 		}
-		w("%s", values.Type().Elem())
+		if t.shape.DType.IsPacked() {


Hmm ... this is delicate: so far I used Summary() to print a sample of even very large tensors, because it doesn't create a copy of it.

If we use unpackFlatValues() here this becomes a very costly operation, potentially requiring a giant allocation.

Can I suggest (I'm hoping the AI can easily handle this):

Refactor wValue() above to take instead of the value itself, just an index of of the value in the flat vector.

Write separately a small extractPackedElement(flatPacked []uint8, packedDType dtypes.DType, index int) int, that takes the packed bytes (in flatPacked), the packed dtype (Int2, Int4, Uint2, or Uint8) and the index, and returns the corresponding value unpacked to an int.

Then in the wValue(index ii) you can check if dtype.IsPacked() and if true, call this extractPackedElement() instead.

Wdyt ?

janpfeifer · 2026-03-26T08:05:42Z

pkg/ml/ggml/dense.go

+	dequantW := Dequantize(weights, ggmlType, N)
+
+	// Transpose to [K, N] for matmul.
+	dequantW = Transpose(dequantW, 0, 1) // [K, N]


Instead of transpose here, just change the below Dot.Product() to the corresponding Dot.General() with the appropriate contraction axes. Let the dotgeneral algorithm decide what it wants to do (if it wants to transpose or not).

- Register Int2/Uint2 → {Int8,Uint8,Int32,Int64,Float32,Float64} converters via execConvertPackedSubByte with valuesPerByte=4 - Add unpackInt2Bits and unpackUint2Bits for 2-bit packed data - Handle Int2/Uint2 in unpackWeightsToBuffer alongside Int4/Uint4 - Register mutableBytes and fillBuffer for Int2/Uint2 - Rename unpackedPooled → isUnpackedOwned, idxPooled → isIdxOwned for clarity (all buffers are pooled; the bool tracks ownership)

ajroetker added 9 commits March 9, 2026 21:19

Add GGML dequant support, shift ops, and quantized dense/gather impro…

b8f26f4

…vements

Merge branch 'quantized-fused-ops' into gguf-fused-quantized-dense

537366c

Fix merge issues: lowercase internal simplego funcs, remove go-highwa…

ef5d3d9

…y bench dep - exec_shift_ops.go: use unexported binaryOperandsAndOutput/newBroadcastIterator - Remove BenchmarkQuantizedDenseGGML that depends on go-highway/gguf

Merge remote-tracking branch 'upstream/main' into gguf-fused-quantize…

8877a34

…d-dense

Rename weightsQuantization -> tableQuantization in FusedQuantizedGather

60eaba7

The parameter describes the quantization of the table (embedding matrix), not generic weights. This makes the API more precise and consistent with the table/indices naming used elsewhere in the function signature.

janpfeifer requested changes Mar 16, 2026

View reviewed changes

ajroetker mentioned this pull request Mar 16, 2026

Fix execBitcast buffer reuse for cross-bit-width types #374

Merged

2 tasks

Update CHANGELOG with GGML quantized ops and bitcast fix

adfc4b4

janpfeifer requested changes Mar 22, 2026

View reviewed changes

ajroetker added 3 commits March 22, 2026 16:27

Merge remote-tracking branch 'upstream/main' into gguf-fused-quantize…

1e252ed

…d-dense

ajroetker requested a review from janpfeifer March 23, 2026 00:25

Deduplicate tiling logic: add workerIdx to quantizedDenseParallel

66fe00e

Extract parallelTileCount helper and add workerIdx callback parameter to quantizedDenseParallel, eliminating duplicated tileSize/numWorkers computation in quantizedDenseGGML.

janpfeifer requested changes Mar 25, 2026

View reviewed changes

ajroetker force-pushed the gguf-fused-quantized-dense branch from 5250386 to 18d38b4 Compare March 25, 2026 19:40

janpfeifer requested changes Mar 26, 2026

View reviewed changes

ajroetker force-pushed the gguf-fused-quantized-dense branch from c93f94a to a1494db Compare March 26, 2026 14:28


		numIndices := indicesBuf.shape.Size() / indicesBuf.shape.Dimensions[indicesBuf.shape.Rank()-1]

		indices, err := quantGatherIntSliceOfFlat(indicesBuf.flat, numIndices)

Uh oh!

Conversation

ajroetker commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review feedback addressed

Test plan

Uh oh!

janpfeifer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janpfeifer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajroetker commented Mar 23, 2026

Uh oh!

janpfeifer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janpfeifer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ajroetker commented Mar 11, 2026 •

edited

Loading