- Nvidia and xAI collaborate on the development of Colossus
- xAI has noticeably reduced “flow collisions” during AI model training
- Spectrum-X has been crucial in training Grok's family of AI models
Nvidia has shed light on how xAI's 'Colossus' supercomputer cluster can control 100,000 Hopper GPUs, and it all comes down to using the chipmaker's Spectrum-X Ethernet networking platform.
Spectrum-X, the company revealed, is designed to provide massive performance capabilities to hyperscale and multi-tenant AI factories using its remote directory memory access (RDMA) network.
The platform has been deployed on Colossus, the world's largest AI supercomputer, since its inception. The Elon Musk-owned company has been using the cluster to train its Grok series of large language models (LLMs), which power the chatbots offered to X users.
The facility was built in collaboration with Nvidia in just 122 days and xAI is currently in the process of expanding it, with plans to deploy a total of 200,000 Nvidia Hopper GPUs.
Training Grok requires a lot of firepower
The Grok AI models are extremely large, with Grok-1 measuring 314 billion parameters and Grok-2 surpassing Claude 3.5 Sonnet and GPT-4 Turbo at the time of their launch in August.
Naturally, training these models requires significant network performance. Using Nvidia's Spectrum-X platform, xAI saw no legacy application degradation or packet loss as a result of “flow collisions” or bottlenecks within AI network paths.
xAI revealed that it has been able to maintain 95% of data throughput thanks to Spectrum-X's congestion control capabilities. The company added that this level of performance cannot be delivered at this scale over standard Ethernet.
According to Nvidia, when using traditional Ethernet, this typically creates thousands of stream collisions and delivers only 60% data throughput.
An xAI spokesperson said the combination of Hopper and Spectrum-X GPUs has allowed the company to “push the boundaries of AI model training” and created a “super-accelerated and optimized AI factory.”
“AI is becoming mission-critical and requires greater performance, security, scalability and cost-effectiveness,” said Gilad Shainer, senior vice president of networking at Nvidia.
“The NvidiaSpectrum-X Ethernet networking platform is designed to provide innovators like xAI with faster processing, analysis and execution of AI workloads while accelerating solution development, deployment and time to market.” of AI”.
Part of the Spectrum-X platform includes the Spectrum SN5600 Ethernet switch, which supports port speeds of up to 800 Gb/s and is based on the Spectrum-4 switch ASIC, according to Nvidia.
xAI chose to combine the Spectrum-X SN5600 switch with NVIDIA BlueField-3 SuperNIC for higher performance.