MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization
Official implementation of MatGPTQ (Matryoshka GPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set.
Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer-quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single checkpoint to cover a wide range of memory and latency budgets, but renders quantization much more challenging. In particular, the initial MatQuant relies on expensive quantization-aware training (QAT) variants, rather than fast one–shot post training quantization (PTQ), and lacks open-source and kernel support. We address all of these limitations by introducing Post-Training Matryoshka Quantization (MatGPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set. MatGPTQ casts Matryoshka quantization as a multi–precision objective with bit-slicing and cross–bit error compensation, resulting in an algorithm that produces a multi-bit-width, "sliceable" model in a single pass. We also incorporate a new budget–aware search for heterogeneous per–layer bit-witdhs and provide efficient kernels that implement slicing and mixed–precision execution. Across standard LLMs and benchmarks, MatGPTQ preserves high–bit accuracy while substantially improving performance at low-bit-witdh settings. Overall, we establish a new state of the art for Matryoshka–style post–training quantization and make single–checkpoint, multi–precision deployment open and practical.
scripts/— contains bash scripts with the required arguments to run the methodsrc/— directory for helper methods and utility functionsinference_lib/— directory contains cuda-kernel and deployment, benchmarking scriptsevo_quant_search.py— evolutionary quantization bitwidth allocationquant.py— MatGPTQ/GPTQ quantizationlmeval.py— LM Eval Harness evalution scripteval_ppl.py— perplexity evalution script- ```pack_quantized_model.py`- packing script for vLLM deployment
setup.py- installation script for cuda-kernelsrequirements.txt- requirements file
Create a virtual environment and install dependencies (we recommend Python 3.12):
uv venv --python 3.12
source .venv/bin/activate
uv pip install -r requirements.txtNote: The code has been tested with CUDA 12.4 and PyTorch 2.7.1
We provide quant.py for producing the MatGPTQ/GPTQ models. To produce the respective model see either scripts/run_gptq.sh or scripts/run_matgptq.sh for examples on how to run quantized training:
bash scripts/run_matgptq.shWe provide evo_quant_search.py for producing the Mix'n'Match MatGPTQ models. To produce the respective model see scripts/run_quant_search.sh for an example on how to run EvoPress for MatGPTQ:
bash scripts/run_quant_search.shWe provide lmeval.py and eval_ppl.py scripts for evaluation on the Language Model Evaluation Harness benchmarks and perplexity measurements. The interface of lmeval.py mostly follows the instructions from the original. In addition, one should specify the path to quantized weights via the quant_weights_path argument and the default uniform quantization bitwidth quant_uniform_bitwidth and master bitwidth --quant_master_bitwidth, or a path to a .txt file with chosen compression levels via the --quant_non_uniform_config_path argument. Furthermore, with --method, you define whether to evaluate MatGPTQ or GPTQ.
For deployment install our custom kernels and use our vLLM plugin to run it. Further information can be found in ./inference_lib.
You can start by using our already quantized models at ISTA-DASLab/MatGPTQ. Simply update the inference_bitwidth parameter in the config.json file to your desired value, and you're all set.
To install the kernel run the following command:
uv pip install --no-build-isolation -e . To deploy MatGPTQ models into various environment, you can use MatGPTQ with our vLLM plugin as followed (more information here). We recommend to use a clean and new environment (tested with vllm>=0.14.0):
import vllm_matgptq
from vllm import LLM
llm = LLM(model="your-model", quantization="my_quant")See inference_lib/ìnference_demo_vllm.py as an example on how to implement it. To run the model see inference_lib/scripts/run_inference_vllm.sh for an example.
Models can be packed using pack_quantized_model.py as followed:
python pack_quantized_model.py \
--model_name_or_path /path/to/model \
--quantized_weights_path /path/to/quantized/weights \
--packed_output_path /output/path \
--inference_bitwidth 4 \
--master_bitwidth 8 \
--group_size 128 \
--quant_dtype float16If you use MatGPTQ in your research, please cite:
@misc{kleinegger2026matgptqaccurateefficientposttraining,
title={MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization},
author={Maximilian Kleinegger and Elvir Crnčević and Dan Alistarh},
year={2026},
eprint={2602.03537},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.03537},
}