TurboQuant: Google's AI Compression That Now Runs on CPU

TurboQuant: Google's AI Compression That Now Runs on CPU

TurboQuant: Google's AI Compression That Now Runs on CPU

Google has introduced TurboQuant, a new quantization technique designed for large language models and vector search. The research targets one of the most persistent bottlenecks in AI deployment: the KV cache, which grows proportionally with context length and has historically forced teams toward expensive GPU clusters. TurboQuant changes the equation. By compressing KV cache entries to around 3 bits with no fine-tuning and no accuracy loss, it makes AI inference viable on ordinary CPU hardware - the kind that powers standard cloud servers today.

TurboQuant AI compression

How TurboQuant Works

TurboQuant is a two-part system. PolarQuant handles the bulk of the compression work, reducing the majority of the data. QJL then handles the remaining 1-bit error correction pass. Together they achieve 3-bit KV cache quantization. No fine-tuning is required, and accuracy on standard benchmarks is preserved. The key insight is that these two methods are complementary - each compensates for the other's residual error in a way that adds up to near-theoretical compression limits.

QJL: The Zero-Overhead 1-Bit Trick

QJL applies the Johnson-Lindenstrauss transform to high-dimensional key and value vectors. This mathematical transform is known for shrinking data while preserving relative distances between points. QJL takes this further by reducing each vector to a single sign bit - either +1 or -1 per dimension. The result is an extreme reduction in memory footprint with zero additional overhead. Attention score computation remains accurate because the sign-bit projection preserves the geometric relationships that matter most during inference.

PolarQuant: A New Angle on Compression

PolarQuant reframes the compression problem geometrically. Rather than working in standard Cartesian coordinates, it converts vectors into polar form - a radius representing magnitude and angles representing direction. This eliminates the expensive normalization step that most quantization methods require. The polar representation maps naturally onto a predictable circular grid, which quantizes cleanly. Recursive polar transforms can distill a full high-dimensional vector down to a single radius combined with a compact set of angles, achieving aggressive compression without distorting the underlying data.

Experiments and Results

The Google team evaluated TurboQuant across a range of long-context benchmarks: LongBench, Needle-in-Haystack, ZeroSCROLLS, RULER, and L-Eval. Models tested include Gemma, Mistral, and Llama-3.1-8B-Instruct. KV cache memory was reduced by 6x or more. At 4-bit quantization, TurboQuant achieves an 8x speedup over standard 32-bit on H100 GPUs. For vector search tasks, TurboQuant outperforms both Product Quantization (PQ) and RaBitQ baselines on recall and search quality metrics.

CPU Inference Is Now Production-Ready

This is the practical takeaway. TurboQuant compresses models so aggressively that CPU inference becomes viable for real production workloads - not just research demos. The llama.cpp community recognized this quickly and has already shipped working implementation branches:

Cloud servers - like those available at DCXV - are now more than capable of running AI inference without any GPU hardware at all. If you have been waiting for a reason to move AI workloads off expensive GPU instances and onto standard cloud VMs, TurboQuant is that reason. See https://dcxv.com/data-center#cloud for current cloud server options.

Looking Ahead

TurboQuant addresses the KV cache bottleneck that has constrained Gemini-scale models since their release. It also enables high-quality semantic vector search at Google's own operational scale. The benchmarks suggest the method is approaching near-theoretical lower bounds for this class of compression. As AI capabilities integrate deeper into software products, efficient quantization becomes foundational infrastructure - not a research curiosity. TurboQuant points toward a future where capable AI runs on commodity hardware, available to anyone with a standard server.

Roll Back a Cloud Server to a Recent Backup in Two Clicks
backuprecoverycloudCloud

Roll Back a Cloud Server to a Recent Backup in Two Clicks

DCXV cloud servers now let you restore a recent automatic backup straight from your control panel - pick a backup, confirm, and the VM rolls back in minutes.

Manage Client Accounts From One Login - The DCXV Reseller Dashboard
resellercontrol-panelcloudCloud

Manage Client Accounts From One Login - The DCXV Reseller Dashboard

The new DCXV reseller dashboard lets you create client sub-accounts, track their balances and servers, and log into any of them from a single control panel.

GLM-5.2 - The New Leading Open Weights LLM
aillmopen-sourceglmCloud

GLM-5.2 - The New Leading Open Weights LLM

Z.ai's GLM-5.2 is the new leading open weights model on the Artificial Analysis Intelligence Index, scoring 51 under an MIT license with a 1M token context.

Snapshot Before Risky Changes, Roll Back Instantly
snapshotcloudCloud

Snapshot Before Risky Changes, Roll Back Instantly

Create an on-demand snapshot of your DCXV cloud server before any risky change, then roll back in seconds. Add a snapshot in the control panel with one click.

Install Any OS - Boot Your Cloud VM From Your Own ISO
isoinstallcloudCloud

Install Any OS - Boot Your Cloud VM From Your Own ISO

Upload a bootable ISO from any HTTPS URL and boot your DCXV cloud VM from it - install any operating system or run a rescue disk, straight from the control panel.