MatGPTQ

MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization

Official implementation of MatGPTQ (Matryoshka GPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set.

Abstract

Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer-quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single checkpoint to cover a wide range of memory and latency budgets, but renders quantization much more challenging. In particular, the initial MatQuant relies on expensive quantization-aware training (QAT) variants, rather than fast one–shot post training quantization (PTQ), and lacks open-source and kernel support. We address all of these limitations by introducing Post-Training Matryoshka Quantization (MatGPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set. MatGPTQ casts Matryoshka quantization as a multi–precision objective with bit-slicing and cross–bit error compensation, resulting in an algorithm that produces a multi-bit-width, "sliceable" model in a single pass. We also incorporate a new budget–aware search for heterogeneous per–layer bit-witdhs and provide efficient kernels that implement slicing and mixed–precision execution. Across standard LLMs and benchmarks, MatGPTQ preserves high–bit accuracy while substantially improving performance at low-bit-witdh settings. Overall, we establish a new state of the art for Matryoshka–style post–training quantization and make single–checkpoint, multi–precision deployment open and practical.

Repository structure

scripts/ — contains bash scripts with the required arguments to run the method
src/ — directory for helper methods and utility functions
inference_lib/ — directory contains cuda-kernel and deployment, benchmarking scripts
evo_quant_search.py — evolutionary quantization bitwidth allocation
quant.py — MatGPTQ/GPTQ quantization
lmeval.py — LM Eval Harness evalution script
eval_ppl.py — perplexity evalution script
```pack_quantized_model.py`- packing script for vLLM deployment
setup.py - installation script for cuda-kernels
requirements.txt- requirements file

Installation

Create a virtual environment and install dependencies (we recommend Python 3.12):

uv venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txt

Note: The code has been tested with CUDA 12.4 and PyTorch 2.7.1

Quantization

We provide quant.py for producing the MatGPTQ/GPTQ models. To produce the respective model see either scripts/run_gptq.sh or scripts/run_matgptq.sh for examples on how to run quantized training:

bash scripts/run_matgptq.sh

Mix'n'Match

We provide evo_quant_search.py for producing the Mix'n'Match MatGPTQ models. To produce the respective model see scripts/run_quant_search.sh for an example on how to run EvoPress for MatGPTQ:

bash scripts/run_quant_search.sh

Evaluations

We provide lmeval.py and eval_ppl.py scripts for evaluation on the Language Model Evaluation Harness benchmarks and perplexity measurements. The interface of lmeval.py mostly follows the instructions from the original. In addition, one should specify the path to quantized weights via the quant_weights_path argument and the default uniform quantization bitwidth quant_uniform_bitwidth and master bitwidth --quant_master_bitwidth, or a path to a .txt file with chosen compression levels via the --quant_non_uniform_config_path argument. Furthermore, with --method, you define whether to evaluate MatGPTQ or GPTQ.

Deployment

For deployment install our custom kernels and use our vLLM plugin to run it. Further information can be found in ./inference_lib.

Available Models

You can start by using our already quantized models at ISTA-DASLab/MatGPTQ. Simply update the inference_bitwidth parameter in the config.json file to your desired value, and you're all set.

Quick Start

To install the kernel run the following command:

uv pip install --no-build-isolation -e .

To deploy MatGPTQ models into various environment, you can use MatGPTQ with our vLLM plugin as followed (more information here). We recommend to use a clean and new environment (tested with vllm>=0.14.0):

import vllm_matgptq
from vllm import LLM

llm = LLM(model="your-model", quantization="my_quant")

See inference_lib/ìnference_demo_vllm.py as an example on how to implement it. To run the model see inference_lib/scripts/run_inference_vllm.sh for an example.

Packing

Models can be packed using pack_quantized_model.py as followed:

python pack_quantized_model.py \ 
  --model_name_or_path /path/to/model \ 
  --quantized_weights_path /path/to/quantized/weights \ 
  --packed_output_path /output/path \ 
  --inference_bitwidth 4 \ 
  --master_bitwidth 8 \ 
  --group_size 128 \ 
  --quant_dtype float16

Citation

If you use MatGPTQ in your research, please cite:

@misc{kleinegger2026matgptqaccurateefficientposttraining,
      title={MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization}, 
      author={Maximilian Kleinegger and Elvir Crnčević and Dan Alistarh},
      year={2026},
      eprint={2602.03537},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.03537}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MatGPTQ

Abstract

Repository structure

Installation

Quantization

Mix'n'Match

Evaluations

Deployment

Available Models

Quick Start

Packing

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
inference_lib		inference_lib
scripts		scripts
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
eval_ppl.py		eval_ppl.py
evo_quant_search.py		evo_quant_search.py
lmeval.py		lmeval.py
pack_quantized_model.py		pack_quantized_model.py
quant.py		quant.py
requirements.txt		requirements.txt
setup.py		setup.py

License

IST-DASLab/MatGPTQ

Folders and files

Latest commit

History

Repository files navigation

MatGPTQ

Abstract

Repository structure

Installation

Quantization

Mix'n'Match

Evaluations

Deployment

Available Models

Quick Start

Packing

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages