AI workloads are expected to put unprecedented performance and capacity demands on networks, and a handful of networking vendors have teamed up to enhance today’s Ethernet technology in order to handle the scale and speed required by AI.
AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta and Microsoft announced the Ultra Ethernet Consortium (UEC), a group hosted by the Linux Foundation that’s working to develop physical, link, transport and software layer Ethernet advances.
The industry celebrated Ethernet’s 50th anniversary this year. The hallmark of Ethernet has been its flexibility and adaptability, and the venerable technology will undoubtedly play a critical role when it comes to supporting AI infrastructures. But there are concerns that today’s traditional network interconnects cannot provide the required performance, scale and bandwidth to keep up with AI demands, and the consortium aims to address those concerns.
“AI workloads are demanding on networks as they are both data- and compute-intensive. The workloads are so large that the parameters are distributed across thousands of processors," wrote Arista CEO Jayshree Ullal in a blog about the new consortium.
"Large Language Models (LLMs) such as GPT-3, Chinchilla, and PALM, as well as recommendation systems like DLRM [deep learning recommendation] and DHEN [Deep and Hierarchical Ensemble Network] are trained on clusters of many 1000s of GPUs sharing the ‘parameters’ with other processors involved in the computation."
"In this compute-exchange-reduce cycle, the volume of data exchanged is so significant that any slowdown due to a poor/congested network can critically impact the AI application performance."
Historically, the only option to connect processor cores and memory has been interconnects such as InfiniBand, PCI Express, Remote Direct Memory Access over Ethernet and other protocols that connect compute clusters with offloads but have limitations when it comes to AI workload requirements.
“Arista and Ultra Ethernet Consortium's founding members believe it is time to reconsider and replace RDMA limitations. Traditional RDMA, as defined by InfiniBand Trade Association (IBTA) decades ago, is showing its age in highly demanding AI/ML network traffic. RDMA transmits data in chunks of large flows, and these large flows can cause unbalanced and over-burdened links,” Ullal wrote.
“It is time to begin with a clean slate to build a modern transport protocol supporting RDMA for emerging applications,” Ullal wrote. “The [consortium’s] UET (Ultra Ethernet Transport) protocol will incorporate the advantages of Ethernet/IP while addressing AI network scale for applications, endpoints and processes, and maintaining the goal of open standards and multi-vendor interoperability.”
The UEC wrote in a white paper that it will further an Ethernet specification to feature a number of core technologies and capabilities including:
- Multi-pathing and packet spraying to ensure AI packets have access to a destination simultaneously.
- Flexible delivery order to make sure Ethernet links are optimally balanced; ordering is only enforced when the AI workload requires it in bandwidth-intensive operations.
- Modern congestion-control mechanisms to ensure AI workloads avoid hotspots and evenly spread the load across multipaths. They can be designed to work in conjunction with multipath packet spraying, enabling a reliable transport of AI traffic.
- End-to-end telemetry to manage congestion. Information originating from the network can advise the participants of the location and cause of the congestion. Shortening the congestion signaling path and providing more information to the endpoints allows more responsive congestion control.
The UEC said it will increase the scale, stability, and reliability of Ethernet networks along with improved security.
“The UEC transport incorporates network security by design and can encrypt and authenticate all network traffic sent between computation endpoints in an AI training or inference job. The UEC will develop a transport protocol that leverages the proven core techniques for efficient session management, authentication, and confidentiality from modern encryption methods like IPSec and PSP,” the UEC wrote.
“As jobs grow, it is necessary to support encryption without ballooning the session state in hosts and network interfaces. In service of this, UET incorporates new key management mechanisms that allow efficient sharing of keys among tens of thousands of compute nodes participating in a job. It is designed to be efficiently implemented at the high speeds and scales required by AI training and inference,” the UEC stated.
“This isn’t about overhauling Ethernet,” said Dr. J Metz, chair of the Ultra Ethernet Consortium, in a statement. “It’s about tuning Ethernet to improve efficiency for workloads with specific performance requirements. We’re looking at every layer – from the physical all the way through the software layers – to find the best way to improve efficiency and performance at scale.”
The need for improved AI connectivity technology is beginning to emerge. For example, in its most recent “Data Center 5-Year July 2023 Forecast Report,” the Dell’Oro Group stated that 20% of Ethernet data center switch ports will be connected to accelerated servers to support AI workloads by 2027.
The rise of new generative AI applications will help fuel more growth in an already robust data center switch market, which is projected to exceed $100 billion in cumulative sales over the next five years, said Sameh Boujelbene, vice president at Dell’Oro.
In another recently released report, the 650 Group stated that AI/ML puts a tremendous amount of bandwidth performance requirements on the network, and AI/ML is one of the major growth drivers for data center switching over the next five years.
“With bandwidth in AI growing, the portion of Ethernet switching attached to AI/ML and accelerated computing will migrate from a niche today to a significant portion of the market by 2027. We are about to see record shipments in 800Gbps based switches and optics as soon as products can reach scale in production to address AI/ML,” said Alan Weckel, founder and technology analyst at 650 Group.