TurboQuant — AI Glossary

TurboQuant is a vector quantization algorithm from Google Research, presented at ICLR 2026. It was developed by Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni, and targets one of the most pressing practical bottlenecks in large language model inference: the KV cache.

The Problem: KV Cache Memory

During transformer inference, every token’s attention keys and values must be stored in memory so that subsequent tokens can attend back to them. This KV cache grows linearly with sequence length and batch size, quickly becoming the dominant consumer of GPU VRAM for long-context and high-throughput deployments. At 32-bit precision, a single large model serving 1M-token contexts can require hundreds of gigabytes of KV cache alone — far exceeding what fits on a single H100.

Prior solutions either required fine-tuning the model on compressed representations, or used learned quantization that demanded dataset-specific calibration runs. TurboQuant eliminates both requirements.

Two-Stage Architecture

TurboQuant achieves 3–4 bits per coordinate compression through two mathematically complementary stages:

Stage 1 — PolarQuant (Geometric Rotation + Polar Quantization)

Standard quantization in Cartesian space fails at extreme compression because vectors have unequal magnitudes and non-uniform distributions — concentrating quantization error unevenly. PolarQuant addresses this by:

Applying a random orthogonal rotation to the key/value vectors, which spreads energy uniformly across dimensions (a technique related to the Johnson-Lindenstrauss transform).
Converting the rotated vectors to polar coordinates — separating magnitude (radius) from direction (angles). In high-dimensional space, the angular distribution of vectors is known, concentrated, and predictable, so direction can be quantized very aggressively.
Quantizing angles using Lloyd-Max optimal centroids — the information-theoretically optimal scalar quantization scheme for a known distribution.

The critical advantage: because the rotation is random (not learned), no per-dataset calibration is needed. And because polar coordinates separate magnitude from direction, no extra per-vector scale constants need to be stored — eliminating the memory overhead that plagues most quantization schemes.

Stage 2 — QJL (Quantized Johnson-Lindenstrauss Error Correction)

Even with optimal quantization, some residual error accumulates in attention score computation at 3–4 bits. QJL adds a single extra bit per vector — a binary sketch derived from the Johnson-Lindenstrauss transform — that corrects the systematic component of this residual error.

The Johnson-Lindenstrauss lemma guarantees that random projections approximately preserve pairwise distances between vectors. QJL exploits this: the 1-bit sketch captures enough geometric structure to detect and correct the dominant error pattern in attention dot products, restoring accuracy to match 32-bit baseline without storing additional full-precision data.

Together, PolarQuant handles the bulk compression and QJL provides a lightweight accuracy safety net — the combination reaching within a constant factor of ~2.7 of the Shannon information-theoretic lower bound on distortion rate.

Performance

Benchmarked on H100 GPUs against 32-bit key-value storage:

6× reduction in KV cache memory usage
Up to 8× speedup in attention computation
Zero accuracy loss across the full benchmark suite
Works on any transformer model without retraining or calibration
Compresses to 3 bits per coordinate (down from 16 or 32)

The memory reduction holds consistently regardless of model architecture or domain, because TurboQuant’s random rotation is data-oblivious — it makes no assumptions about the distribution of your specific model’s activations.

Why It Matters

The KV cache is the primary constraint on LLM serving economics: it determines how many concurrent users you can serve per GPU, how long a context window you can sustain, and how much each inference call costs. A 6× reduction in KV cache memory translates directly to:

6× more concurrent requests per GPU at the same context length, or
6× longer context windows at the same batch size, or
Significantly reduced hardware spend for the same serving capacity

Native TurboQuant model integrations were scheduled for production deployment in Q3 2026, with a Triton kernel implementation already available for developers running custom inference stacks.

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams