TurboQuant: Google's AI Compression That Now Runs on CPU

TurboQuant: Google’s AI Compression That Now Runs on CPU

Google has introduced TurboQuant, a new quantization technique designed for large language models and vector search. The research targets one of the most persistent bottlenecks in AI deployment: the KV cache, which grows proportionally with context length and has historically forced teams toward expensive GPU clusters. TurboQuant changes the equation. By compressing KV cache entries to around 3 bits with no fine-tuning and no accuracy loss, it makes AI inference viable on ordinary CPU hardware - the kind that powers standard cloud servers today.

TurboQuant AI compression

How TurboQuant Works

TurboQuant is a two-part system. PolarQuant handles the bulk of the compression work, reducing the majority of the data. QJL then handles the remaining 1-bit error correction pass. Together they achieve 3-bit KV cache quantization. No fine-tuning is required, and accuracy on standard benchmarks is preserved. The key insight is that these two methods are complementary - each compensates for the other’s residual error in a way that adds up to near-theoretical compression limits.

QJL: The Zero-Overhead 1-Bit Trick

QJL applies the Johnson-Lindenstrauss transform to high-dimensional key and value vectors. This mathematical transform is known for shrinking data while preserving relative distances between points. QJL takes this further by reducing each vector to a single sign bit - either +1 or -1 per dimension. The result is an extreme reduction in memory footprint with zero additional overhead. Attention score computation remains accurate because the sign-bit projection preserves the geometric relationships that matter most during inference.

PolarQuant: A New Angle on Compression

PolarQuant reframes the compression problem geometrically. Rather than working in standard Cartesian coordinates, it converts vectors into polar form - a radius representing magnitude and angles representing direction. This eliminates the expensive normalization step that most quantization methods require. The polar representation maps naturally onto a predictable circular grid, which quantizes cleanly. Recursive polar transforms can distill a full high-dimensional vector down to a single radius combined with a compact set of angles, achieving aggressive compression without distorting the underlying data.

Experiments and Results

The Google team evaluated TurboQuant across a range of long-context benchmarks: LongBench, Needle-in-Haystack, ZeroSCROLLS, RULER, and L-Eval. Models tested include Gemma, Mistral, and Llama-3.1-8B-Instruct. KV cache memory was reduced by 6x or more. At 4-bit quantization, TurboQuant achieves an 8x speedup over standard 32-bit on H100 GPUs. For vector search tasks, TurboQuant outperforms both Product Quantization (PQ) and RaBitQ baselines on recall and search quality metrics.

CPU Inference Is Now Production-Ready

This is the practical takeaway. TurboQuant compresses models so aggressively that CPU inference becomes viable for real production workloads - not just research demos. The llama.cpp community recognized this quickly and has already shipped working implementation branches:

Cloud servers - like those available at DCXV - are now more than capable of running AI inference without any GPU hardware at all. If you have been waiting for a reason to move AI workloads off expensive GPU instances and onto standard cloud VMs, TurboQuant is that reason. See https://dcxv.com/data-center#cloud for current cloud server options.

Looking Ahead

TurboQuant addresses the KV cache bottleneck that has constrained Gemini-scale models since their release. It also enables high-quality semantic vector search at Google’s own operational scale. The benchmarks suggest the method is approaching near-theoretical lower bounds for this class of compression. As AI capabilities integrate deeper into software products, efficient quantization becomes foundational infrastructure - not a research curiosity. TurboQuant points toward a future where capable AI runs on commodity hardware, available to anyone with a standard server.

AI's Insatiable Hunger for IPv4 - The Hidden Scarcity Crisis

While IPv4 prices dip for large blocks, AI's explosive growth is quietly tightening supply. Hyperscalers and AI giants now consume millions of IPv4 addresses for NVIDIA H200 clusters and LLM training farms.

August 21, 2025

AI compression quantization LLM Google Cloud