Google reduced AI model memory consumption sixfold while preserving accuracy thanks to the TurboQuant algorithm

Google reduced AI model memory consumption sixfold while preserving accuracy thanks to the TurboQuant algorithm

18 hardware

Google Research introduced a new method for compressing the KV cache of large language models—TurboQuant. The algorithm reduces the cache precision to 3 bits (4 bits with error correction), without degrading answer accuracy or requiring additional training. On Nvidia H100 accelerators, TurboQuant increased attention logit computation throughput eightfold and reduced the KV cache size sixfold.

What is a KV cache and why it matters
* The KV cache stores keys (K) and values (V) generated during the attention mechanism calculation.

This allows the model to avoid recomputing them at each token generation step.

* As the context window grows, the cache expands exponentially, leading to high memory usage.

* Traditional quantization methods shrink the cache but require storing quantization constants (dictionaries) similar to ZIP/RAR files. These dictionaries impose significant overhead.

How TurboQuant works
TurboQuant consists of two stages and eliminates dictionaries entirely.

StageWhat is doneWhy it matters
1. PolarQuantConvert vectors from Cartesian coordinates to polar (radius + angle). Angular distributions are predictable and concentrated, so a costly per-block normalization step isn’t needed. This yields high‑quality compression without dictionaries.Provides efficient, dictionary‑free compression.
2. 1‑bit error‑correction layerApply the Johnson–Lindenstrauss quantized algorithm; residual error is reduced to one bit. Eliminates systematic bias in attention calculations with minimal extra cost.Removes systematic errors while keeping overhead low.

Practical results
Test | Algorithms | Results
---|---|---
LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, L‑Eval (Gemma & Mistral) | TurboQuant vs KIVI | TurboQuant: at least 6× KV cache compression; in needle‑in‑a‑haystack tasks—no loss of accuracy. In LongBench—no worse, sometimes better than KIVI.
Vector search (GloVe) | TurboQuant vs Product Quantization, RabbiQ | Even without training, TurboQuant outperformed trained competitors in result quality and memory consumption.

Conclusions
* TurboQuant achieves strong KV cache compression to 3–4 bits without accuracy loss or extra training.
* Performance on Nvidia H100 increased eightfold, and cache size shrank sixfold.
* The algorithm works for large language models and vector search tasks without fine‑tuning.

Thus, TurboQuant is ready for practical deployment even under heavy load and opens new possibilities for efficient operation with large models.

Comments (0)

Share your thoughts — please be polite and stay on topic.

No comments yet. Leave a comment — share your opinion!

To leave a comment, please log in.

Log in to comment