
It’s a big cost play, he pointed out, and it “has to happen everywhere, all the time, for all users.”
The next phase of inferencing
The new Groq 3 language processing units (LPUs) are based on intellectual property (IP) from Groq, which signed a $20 billion licensing agreement with Nvidia late last year. According to the chip company, a fleet of LPUs can function as a “giant single processor.”
While Rubin GPUs will continue to handle prefill (prompt processing), Groq’s LPX will now handle latency-sensitive portions of decode (response). Together, they can deliver a “new class of inference performance,” Nvidia says.
Each LPX rack features 256 LPUs with 128 GB of on-chip static random-access memory (SRAM), 150 terabyte per second (TB/s) bandwidth, chip-to-chip links and high-speed connections to NVL72, Nvidia’s liquid-cooled AI supercomputer. Combined, these can reduce latency to “near zero,” Nvidia claims.
The LPX integration with Vera Rubin AI factories will be available in the second half of this year.
Training versus inferencing
Training and inference stress infrastructure in very different ways, noted Sanchit Vir Gogia, chief analyst at Greyhound Research. While training rewards “massive parallelism and brute-force scale,” inferencing (especially for long context and interactive reasoning) is far more sensitive to latency, memory movement, cache behavior, concurrency, and cost per delivered token.
