
Marquee names like Anthropic and Uber are “putting AWS’s efficiency claims to the test,” he noted; on the other hand, customers like Cohere and Stability AI prefer Nvidia’s mature tooling framework and “superior chip designs,” citing AWS service and availability issues.
Moor’s Kimball pointed out that another factor to consider is AWS’ partnership with Cerebras. Trainium is optimized for prefill and Cerebras CS-3 is optimized for decode, allowing the two to deliver what they claim is the best inference performance with no user intervention required. “This is the kind of ‘point-and-click’ simplicity enterprise users are looking for,” he said.
Ultimately, Jassy is drawing a direct line from what Graviton did to x86 to what Trainium is doing to Nvidia, he said. Inference is the “fastest-growing and most cost-sensitive workload in enterprise AI, and that’s exactly where Trainium is gaining the most ground.”
Learning from the Mantle scale-up
Jassy also emphasized the importance of being able to go back to the starting line to “redirect the trajectory.” For instance, Amazon Bedrock was built rapidly and scaled “faster than expected,” and the team realized it required a whole different type of inference engine, not just a tweak.
The Bedrock team quickly spun up a group of six “very skilled engineers” using AWS’ agentic coding service, Kiro, to deliver a new engine, Mantle, in 76 days. Mantle has since become the backbone of Bedrock, which processed more tokens in Q1 2026, Jassy claimed, than had been processed in all prior years combined.
The ability for a small team to accomplish such a large rebuild in such a short time frame, alongside adding features such as stateful conversation management, asynchronous inference, and higher default quotas, among others, is “impressive at first blush,” noted Info-Tech’s Bickley.
