Skip to content

JacktheLander/Quantization

Repository files navigation

Quantization - Reduce memory usage by >40% during inference!

QAI Hub - Qualcomm's open-source library used to quantize models for a specific chipset

Linear Quantization - Simple and easy to use

Asymmetric Quantization - Good for activations that aren't centered at zero, because it uses the range dynamically

Symmetric Quantization - Centered at zero, but extremely fast due to the simple math and low memory usage

Per-Channel Quantization - Rather than quantizing the whole tensor we can quantize each output specifically for the highest accuracy with minimal performance drop

Per-Group Quantization - Best for maintaining precision, improves memory slightly, really for training at scale like LLMs

Inference Quantization - Using quantization for the weights in an activation

8-bit Quantizer - Performs an accelerated forward pass in the neural network

Quantize Layers - Replaces the linear layers in a model with quantized layers

Quantizing Models - Testing the Quantizer on an open-source LLM and Object Detection model, we see significant memory reduction with the same performance

About

Exploring quantization techniques to make compressed ML models that are optimized for inference

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages