FP4 inference for 4 bit

Right now we have Q4_K GGUF, however FP4 is the way the model is shipped and is numerically different than Q4_K which is uniform quantization. Given that llama.cpp is able to do the inference of GPT120B OSS (I believe) fast enough, it should be possible to support at least as *optional format* the FP4 weights, in order to really do the inference of the Real Thing that DeepSeek shipped. I doubt there are large differences as even the 2 bit quants work well, but... still. Not a priority but something to remember it is worth investigating.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP4 inference for 4 bit #74

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

FP4 inference for 4 bit #74

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions