Nvidia released the Groq 3 LPU chip, which accelerates AI model inference to a token-level speed.

Nvidia released the Groq 3 LPU chip, which accelerates AI model inference to a token-level speed.

8 software

Nvidia Reveals New Capabilities for the Vera Rubin Platform

At this year’s GTC conference, Nvidia CEO Jensen Huang announced an expansion of the Vera Rubin platform. The new capabilities are built on intellectual property acquired from Groq, and the Rubin now includes the *Groq 3 LPU* chip—a inference accelerator designed to deliver tokens at high speed with low latency.

What Already Exists in Vera Rubin
The platform consists of six key components that Nvidia assembles into rack‑mounted systems and scales up to large AI factories:

ComponentDescription
GPU Rubin288 GB HBM4 graphics card
CPU VeraCentral processor
NVLink 6Intra‑system scaling system
ConnectX‑9Intelligent network adapter
BlueField‑4Data processing unit
Spectrum‑XInter‑system scaling switch with integrated optics

The Groq 3 LPU is now added as a new building block that will be used when deploying large systems.

Why the Groq 3 LPU Stands Out
The main difference is memory architecture. While most accelerators use HBM as working memory, each Groq 3 LPU contains 500 MB of SRAM. Comparison:

ParameterGPU Rubin (HBM4)Groq 3 LPU (SRAM)
Capacity288 GB0.5 GB
Bandwidth~22 TB/sup to 150 TB/s

For inference tasks that are bandwidth‑sensitive, the advantage of SRAM is obvious. That’s why Nvidia included Groq 3 in Rubin—to increase token delivery speed.

Groq 3 LPX Rack
The rack contains 256 Groq 3 LPU chips, providing:

- 128 GB of SRAM
- 40 PB/s total bandwidth
- 640 TB/s intra‑system interface

Vice President for Hyper‑Scalable Solutions Ian Buck called this rack a coprocessor for Rubin, emphasizing its role in boosting decoding performance at every model layer and token.

Impact on Multi‑Agent Systems
Buck noted that the Groq 3 LPX will be a key element for the future AI market—multi‑agent systems. When agents exchange data directly rather than through chatbots, response requirements change: from 100 tokens/s to over 1,500+ tokens/s and beyond.

Competitors and Outlook
The text mentions competitor Cerebras, which uses a Wafer‑Scale Engine (WSE) with massive SRAM for low‑latency inference. OpenAI has already deployed Cerebras in its cutting‑edge models thanks to favorable latency.

Buck also noted that the introduction of Groq 3 LPU could reduce dependence on the Rubin CPX accelerator. While Nvidia focuses on integrating the Groq 3 LPX rack with the platform, both chips are intended to boost inference without requiring large amounts of GDDR7 memory.

Conclusion:

The new Groq 3 LPU chip and its LPX rack strengthen Vera Rubin in the low‑latency inference segment, paving the way for faster multi‑agent AI systems and competing with players like Cerebras.

Comments (0)

Share your thoughts — please be polite and stay on topic.

No comments yet. Leave a comment — share your opinion!

To leave a comment, please log in.

Log in to comment