Because the GPUs are operating in parallel, the implications are different than in a classical network.
“If we were on video and an error burst occurs, TCP/IP does a pretty good job of bridging that and retransmitting,” Gartner said. “But in AI infrastructure, because the GPUs are operating in parallel, it’s very sensitive to issues that might occur on one link. All those GPUs are exchanging information and synchronized, and so basically, you have to kind of stop the workload and back up to a checkpoint and restart the workload. And that can result in a 40% reduction in the performance of the cluster when you have these link errors occurring.”
“It really suggests to our customers that they need to be focused much more on reliability of the optic,” Gartner said.
Reliability testing reveals weaknesses
Cisco in the past conducted a reliability test for which it acquired 20 different optics from different suppliers, Gartner recalled. “These were 100G and 400G optics at the time,” and all were compliant with industry standards, and yet “none of those optics passed our stress test,” he said.
Cisco’s testing environments make changes to different conditions, such as the temperature or humidity level, or the voltage level that the optic is seeing on the host, or the skew between the signals coming from the host. “We do all of those things in various combinations,” Gartner said.
While optics might technically comply with industry standards, “what we know is that if they were put into a stressful environment … they wouldn’t perform,” he said, “and so that’s the thing that we’re trying to raise awareness of for our customers.”