Nvidia noted that thanks to improvements in the Blackwell architecture, the cost of neural network inference has dropped to a tenfold level, and they attribute this success not only to hardware.

17.02.2026 15 hardware

Reduction of inference cost on Nvidia Blackwell architecture

New Nvidia Blackwell accelerators allow the launch price of trained AI systems to be reduced by 4–10×. These are figures published by Nvidia itself. However, without accompanying software and infrastructure improvements such a gain is unattainable.

How significant cost reductions were achieved
MetricWhat helpedArchitectureBlackwell acceleratorsModelsOpen source (MoE, NVFP4, etc.)PlatformsBaseten, DeepInfra, Fireworks AI, Together AISoftware stacksOptimized pipelines for low precision
* Porting to Blackwell doubles efficiency compared with the previous generation of accelerators.
* Using low‑precision formats (e.g., NVFP4) further cuts costs.

Practical examples
CompanyTaskResultSully.aiHealthcare, open models on Baseten90 % inference savings (10× reduction), 65 % response time cut. Code and medical record automation saved 30 million work minutes.Latitude (AI Dungeon)Games, MoE models in DeepInfraInference cost for 1 million tokens fell from $0.20 to $0.05: first on MoE (down to $0.10), then on NVFP4.Sentient FoundationAgent chat, Fireworks AIEconomic efficiency rose by 25–50 %. The platform handled 5.6 million requests per week without increasing latency.DecagonCustomer voice support, Together AIRequest cost dropped sixfold thanks to a multi‑model stack on Blackwell. Response time <400 ms even with several thousand tokens.

Why workload characteristics matter
* Reasoning models generate more tokens, requiring more powerful accelerators.
* Platforms use *disaggregated servicing*: separate pre‑context and token generation to handle long sequences efficiently.
* With large generation volumes up to 10× efficiency gains are possible; with small volumes only up to 4×.

Alternatives to Blackwell
Porting to AMD Instinct MI300, Google TPU, Groq or Cerebras also reduces costs. The key is to match hardware, software and models to the specific workload, not just use Blackwell.

Conclusion:

Inference cost reduction is achieved through a comprehensive approach: hardware power (Blackwell), open models, optimized stacks and proper task distribution. This allows companies to save up to tenfold in healthcare, gaming, agent AI and voice support without compromising quality or speed.

Nvidia noted that thanks to improvements in the Blackwell architecture, the cost of neural network inference has dropped to a tenfold level, and they attribute this success not only to hardware.

Related news

Samsung is working on HBM5 with the possibility of using even 2‑nm crystals

DDR5 now brings more profit than HBM, according to leading memory manufacturers

Intel admitted that its new desktop Core Ultra Plus CPUs are almost no faster than Ryzen in games

NASA is working to rescue the falling Swift Observatory, which could leave orbit by the end of the year

Comments (0)

Log in to comment