TurboQuant: Google’s AI Compression That Now Runs on CPU
Google has introduced TurboQuant, a new quantization technique designed for large language models and vector search. The research targets one of the most persistent bottlenecks in AI deployment: the KV cache, which grows proportionally with context length and has historically forced teams toward expensive GPU clusters. TurboQuant changes the equation. By compressing KV cache entries to around 3 bits with no fine-tuning and no accuracy loss, it makes AI inference viable on ordinary CPU hardware - the kind that powers standard cloud servers today.

How TurboQuant Works
TurboQuant is a two-part system. PolarQuant handles the bulk of the compression work, reducing the majority of the data. QJL then handles the remaining 1-bit error correction pass. Together they achieve 3-bit KV cache quantization. No fine-tuning is required, and accuracy on standard benchmarks is preserved. The key insight is that these two methods are complementary - each compensates for the other’s residual error in a way that adds up to near-theoretical compression limits.
QJL: The Zero-Overhead 1-Bit Trick
QJL applies the Johnson-Lindenstrauss transform to high-dimensional key and value vectors. This mathematical transform is known for shrinking data while preserving relative distances between points. QJL takes this further by reducing each vector to a single sign bit - either +1 or -1 per dimension. The result is an extreme reduction in memory footprint with zero additional overhead. Attention score computation remains accurate because the sign-bit projection preserves the geometric relationships that matter most during inference.
PolarQuant: A New Angle on Compression
PolarQuant reframes the compression problem geometrically. Rather than working in standard Cartesian coordinates, it converts vectors into polar form - a radius representing magnitude and angles representing direction. This eliminates the expensive normalization step that most quantization methods require. The polar representation maps naturally onto a predictable circular grid, which quantizes cleanly. Recursive polar transforms can distill a full high-dimensional vector down to a single radius combined with a compact set of angles, achieving aggressive compression without distorting the underlying data.
Experiments and Results
The Google team evaluated TurboQuant across a range of long-context benchmarks: LongBench, Needle-in-Haystack, ZeroSCROLLS, RULER, and L-Eval. Models tested include Gemma, Mistral, and Llama-3.1-8B-Instruct. KV cache memory was reduced by 6x or more. At 4-bit quantization, TurboQuant achieves an 8x speedup over standard 32-bit on H100 GPUs. For vector search tasks, TurboQuant outperforms both Product Quantization (PQ) and RaBitQ baselines on recall and search quality metrics.
CPU Inference Is Now Production-Ready
This is the practical takeaway. TurboQuant compresses models so aggressively that CPU inference becomes viable for real production workloads - not just research demos. The llama.cpp community recognized this quickly and has already shipped working implementation branches:
- https://github.com/elusznik/llama.cpp/tree/turboquant-cpu-tbq-pr
- https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0
Cloud servers - like those available at DCXV - are now more than capable of running AI inference without any GPU hardware at all. If you have been waiting for a reason to move AI workloads off expensive GPU instances and onto standard cloud VMs, TurboQuant is that reason. See https://dcxv.com/data-center#cloud for current cloud server options.
Looking Ahead
TurboQuant addresses the KV cache bottleneck that has constrained Gemini-scale models since their release. It also enables high-quality semantic vector search at Google’s own operational scale. The benchmarks suggest the method is approaching near-theoretical lower bounds for this class of compression. As AI capabilities integrate deeper into software products, efficient quantization becomes foundational infrastructure - not a research curiosity. TurboQuant points toward a future where capable AI runs on commodity hardware, available to anyone with a standard server.



