Google reduced AI model memory consumption sixfold while preserving accuracy thanks to the TurboQuant algorithm

Google Research introduced a new method for compressing the KV cache of large language models—TurboQuant. The algorithm reduces the cache precision to 3 bits (4 bits with error correction), without degrading answer accuracy or requiring additional training. On Nvidia H100 accelerators, TurboQuant increased attention logit computation throughput eightfold and reduced the KV cache size sixfold.

What is a KV cache and why it matters
* The KV cache stores keys (K) and values (V) generated during the attention mechanism calculation.

This allows the model to avoid recomputing them at each token generation step.

* As the context window grows, the cache expands exponentially, leading to high memory usage.

* Traditional quantization methods shrink the cache but require storing quantization constants (dictionaries) similar to ZIP/RAR files. These dictionaries impose significant overhead.

How TurboQuant works
TurboQuant consists of two stages and eliminates dictionaries entirely.

Stage	What is done	Why it matters
1. PolarQuant	Convert vectors from Cartesian coordinates to polar (radius + angle). Angular distributions are predictable and concentrated, so a costly per-block normalization step isn’t needed. This yields high‑quality compression without dictionaries.	Provides efficient, dictionary‑free compression.
2. 1‑bit error‑correction layer	Apply the Johnson–Lindenstrauss quantized algorithm; residual error is reduced to one bit. Eliminates systematic bias in attention calculations with minimal extra cost.	Removes systematic errors while keeping overhead low.

Practical results
Test | Algorithms | Results
---|---|---
LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, L‑Eval (Gemma & Mistral) | TurboQuant vs KIVI | TurboQuant: at least 6× KV cache compression; in needle‑in‑a‑haystack tasks—no loss of accuracy. In LongBench—no worse, sometimes better than KIVI.
Vector search (GloVe) | TurboQuant vs Product Quantization, RabbiQ | Even without training, TurboQuant outperformed trained competitors in result quality and memory consumption.

Conclusions
* TurboQuant achieves strong KV cache compression to 3–4 bits without accuracy loss or extra training.
* Performance on Nvidia H100 increased eightfold, and cache size shrank sixfold.
* The algorithm works for large language models and vector search tasks without fine‑tuning.

Thus, TurboQuant is ready for practical deployment even under heavy load and opens new possibilities for efficient operation with large models.

Google reduced AI model memory consumption sixfold while preserving accuracy thanks to the TurboQuant algorithm

Related news

Apple-Car could look like this: Ferrari showcases the interior of the electric car Luce, designed by Johnny Aiv.

Sales of Mortal Kombat 1 exceeded 8 million copies, but the record for the preceding game remains out of reach.

Tesla launched a campaign against “deceptive” methods of activating autopilot in regions where its use is prohibited.

Over the next five years, demand for memory is expected to grow more than 600-fold, according to Dell’s chief, driven by the rise in AI workloads.

Comments (0)

Log in to comment

Google reduced AI model memory consumption sixfold while preserving accuracy thanks to the TurboQuant algorithm

Related news

Apple-Car could look like this: Ferrari showcases the interior of the electric car Luce, designed by Johnny Aiv.

Sales of Mortal Kombat 1 exceeded 8 million copies, but the record for the preceding game remains out of reach.

Tesla launched a campaign against “deceptive” methods of activating autopilot in regions where its use is prohibited.

Over the next five years, demand for memory is expected to grow more than 600-fold, according to Dell’s chief, driven by the rise in AI workloads.

Log in to comment

Sales of Mortal Kombat 1 exceeded 8 million copies, but the record for the preceding game remains out of reach.