
“Network congestion, link, and device failures are the most common sources of delay and jitter in transfers,” OpenAI wrote in a blog post announcing the project. “These problems get more frequent, and harder to solve, as the size of the cluster increases.”
It went on to note that a single failure could often cause a training job to crash, forcing a restart from a saved checkpoint, or stall progress for many seconds while the network recomputed routes. Such interruptions are costly in both GPU cycles and time.
“The larger the job we run, the greater the impact of any single link flap or failure. These workloads act as a form of ‘failure amplifier,’ so preventing this has become critical,” the company said.
OpenAI led the development of the protocol and worked with AMD, Broadcom, Intel, Microsoft, and Nvidia, all of whom made significant technical contributions. The project is hosted and coordinated by the Open Compute Platform (OCP) consortium.
