Google TurboQuant: 6x Memory Reduction for LLMs

Google TurboQuant reduces LLM memory usage by up to 6x without quality loss.

Google TurboQuant: 6x Memory Reduction, 8x Inference Speedup for LLMs

Google Research unveiled TurboQuant in March 2026 - a data-oblivious quantization framework achieving 6x+ KV cache compression, 8x attention speedup, near-zero accuracy loss. Presenting at ICLR 2026 and AISTATS 2026.

Two-stage compression: PolarQuant (random orthogonal rotation to polar coordinates, separating magnitude and direction without storing quantization constants); QJL (1-bit transform eliminating residual bias for unbiased inner product estimates critical for attention accuracy).

Core advantages: KV cache at 3-4 bits per element (6x compression); 8x speed on H100 GPUs; 100% recall in needle-in-a-haystack to 104K tokens; training-free plug-and-play for any model.

For a 70B model at 128K context: uncompressed KV cache needs 30-50GB VRAM, TurboQuant reduces to 5-8GB.

Practical impact: lower inference costs, expanded deployment to consumer GPUs, vector database optimization. Unlike GPTQ/AWQ, TurboQuant is data-oblivious - no calibration datasets needed, robust across data distributions.

Potentially the most impactful single algorithmic breakthrough for LLM infrastructure in 2026 - solving deployment, not capability.