TurboQuant: Google's AI Compression That Now Runs on CPU

TurboQuant: Google's AI Compression That Now Runs on CPU

TurboQuant: Google's AI Compression That Now Runs on CPU

Google has introduced TurboQuant, a new quantization technique designed for large language models and vector search. The research targets one of the most persistent bottlenecks in AI deployment: the KV cache, which grows proportionally with context length and has historically forced teams toward expensive GPU clusters. TurboQuant changes the equation. By compressing KV cache entries to around 3 bits with no fine-tuning and no accuracy loss, it makes AI inference viable on ordinary CPU hardware - the kind that powers standard cloud servers today.

TurboQuant AI compression

How TurboQuant Works

TurboQuant is a two-part system. PolarQuant handles the bulk of the compression work, reducing the majority of the data. QJL then handles the remaining 1-bit error correction pass. Together they achieve 3-bit KV cache quantization. No fine-tuning is required, and accuracy on standard benchmarks is preserved. The key insight is that these two methods are complementary - each compensates for the other's residual error in a way that adds up to near-theoretical compression limits.

QJL: The Zero-Overhead 1-Bit Trick

QJL applies the Johnson-Lindenstrauss transform to high-dimensional key and value vectors. This mathematical transform is known for shrinking data while preserving relative distances between points. QJL takes this further by reducing each vector to a single sign bit - either +1 or -1 per dimension. The result is an extreme reduction in memory footprint with zero additional overhead. Attention score computation remains accurate because the sign-bit projection preserves the geometric relationships that matter most during inference.

PolarQuant: A New Angle on Compression

PolarQuant reframes the compression problem geometrically. Rather than working in standard Cartesian coordinates, it converts vectors into polar form - a radius representing magnitude and angles representing direction. This eliminates the expensive normalization step that most quantization methods require. The polar representation maps naturally onto a predictable circular grid, which quantizes cleanly. Recursive polar transforms can distill a full high-dimensional vector down to a single radius combined with a compact set of angles, achieving aggressive compression without distorting the underlying data.

Experiments and Results

The Google team evaluated TurboQuant across a range of long-context benchmarks: LongBench, Needle-in-Haystack, ZeroSCROLLS, RULER, and L-Eval. Models tested include Gemma, Mistral, and Llama-3.1-8B-Instruct. KV cache memory was reduced by 6x or more. At 4-bit quantization, TurboQuant achieves an 8x speedup over standard 32-bit on H100 GPUs. For vector search tasks, TurboQuant outperforms both Product Quantization (PQ) and RaBitQ baselines on recall and search quality metrics.

CPU Inference Is Now Production-Ready

This is the practical takeaway. TurboQuant compresses models so aggressively that CPU inference becomes viable for real production workloads - not just research demos. The llama.cpp community recognized this quickly and has already shipped working implementation branches:

Cloud servers - like those available at DCXV - are now more than capable of running AI inference without any GPU hardware at all. If you have been waiting for a reason to move AI workloads off expensive GPU instances and onto standard cloud VMs, TurboQuant is that reason. See https://dcxv.com/data-center#cloud for current cloud server options.

Looking Ahead

TurboQuant addresses the KV cache bottleneck that has constrained Gemini-scale models since their release. It also enables high-quality semantic vector search at Google's own operational scale. The benchmarks suggest the method is approaching near-theoretical lower bounds for this class of compression. As AI capabilities integrate deeper into software products, efficient quantization becomes foundational infrastructure - not a research curiosity. TurboQuant points toward a future where capable AI runs on commodity hardware, available to anyone with a standard server.

DeepSeek V4: 1.6T MoE Model with 1M Context on EU Server
aideepseekllm

DeepSeek V4: 1.6T MoE Model with 1M Context on EU Server

DeepSeek V4 launches Pro (1.6T) and Flash (284B) MoE models with 1M token context, hybrid attention architecture, and three reasoning modes for EU self-hosting.

Cloud Server for Stable Diffusion in Europe: GPU Setup
cloudaigpu

Cloud Server for Stable Diffusion in Europe: GPU Setup

Run Stable Diffusion on a GDPR-compliant EU cloud server. Covers GPU requirements, AUTOMATIC1111 and ComfyUI setup, model storage, and generation benchmarks.

Cloud Server for Redis in Europe: Low-Latency EU Setup
cloudredisdatabase

Cloud Server for Redis in Europe: Low-Latency EU Setup

Run Redis on a GDPR-compliant EU cloud server. Covers memory sizing, persistence modes, Sentinel setup, and latency benchmarks for European deployments.

Cloud Server for PostgreSQL in Europe: GDPR Setup Guide
cloudpostgresqldatabase

Cloud Server for PostgreSQL in Europe: GDPR Setup Guide

Run PostgreSQL on a GDPR-compliant EU cloud server. Compare specs, costs, and setup steps for hosting your database in Europe with full data residency.

Cloud Server for Ollama in Europe: Self-Host AI EU Guide
cloudaigpu

Cloud Server for Ollama in Europe: Self-Host AI EU Guide

Run Ollama on a GDPR-compliant EU cloud server. Covers model selection, GPU setup, API configuration, and performance benchmarks for self-hosted AI in Europe.