TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
-
Updated
Mar 28, 2026 - Python
TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement
First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.
Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.
No bs theatricals. Real automated pentesting. Mac only.
Hardware-agnostic machine learning infrastructure for .NET. Implements high-performance neural network layers in C# that are transpiled to run on WebGPU, CUDA, OpenCL, WebGL, CPU, and Wasm via SpawnDev.ILGPU. Optimized for Blazor WebAssembly and native GPU execution.
TurboQuant‑style embedding compression for RAG: an SDK using fixed rotations, PolarQuant, and QJL residual sketches for compact storage and fast similarity search
TurboQuant: Native 3-Bit Quantization for Ollama - Achieve 25-28% better compression than Q4_0 while maintaining high-speed CPU inference. Experimentally integrated into Ollama with custom GGML kernels for LLM efficiency.
KV Cache with PagedAttention vs PagedAttention + TurboQuant - experiments across token sizes comparing memory, latency, and accuracy.
TurboQuant (ICLR 2026) ported to Apple Silicon — KV cache compression with MLX Metal kernels + PyTorch CPU
Interactive Benchmarking Tool for TurboQuant KV Cache Compression. Supports 2-4 bit quantization with Real-time Metrics
AI Code Review Memory - learns from your team's bug history and warns when similar patterns appear
Turbo Index
ChatMind: Semantic search for Discord & KakaoTalk chat messages. Search by meaning, not keywords. Powered by TurboQuant compression (ICLR 2026).
AI agent skill implementing Google's TurboQuant compression algorithm (ICLR 2026) — 6x KV cache memory reduction, 8x speedup, zero accuracy loss. Compatible with Claude Code, Codex CLI, and all Agent Skills-compatible tools.
Near-optimal vector quantization for LLM KV cache compression. Python implementation of TurboQuant (ICLR 2026) — PolarQuant + QJL for 3-bit quantization with minimal accuracy loss and up to 8x memory reduction.
CommitMind: Semantic search for Git commit history powered by TurboQuant vector compression (ICLR 2026). Search commits by meaning, not just keywords.
TurboQuant (ICLR 2026) vector quantization for memory/RAG embedding compression | 5-8x压缩 98%+召回率 | numpy only, no GPU
AI-powered log anomaly detection CLI — learns normal patterns, detects anomalies with semantic embeddings, matches past incidents. Powered by TurboQuant 3-bit compression (ICLR 2026).
Near-optimal vector quantization with zero metadata overhead — PyTorch SDK based on Google Research ICLR 2026
Add a description, image, and links to the turboquant topic page so that developers can more easily learn about it.
To associate your repository with the turboquant topic, visit your repo's landing page and select "manage topics."