SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : The end of Moore's law - Poet Technologies
POET 1.690-4.5%Jun 14 4:00 PM EDT

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
From: toccodolce5/14/2024 7:23:28 AM
  Read Replies (1) of 867
 
GREASING THE SKIDS TO MOVE AI FROM INFINIBAND TO ETHERNET

Just about everybody, including Nvidia, thinks that in the long run, most people running most AI training and inference workloads at any appreciable scale – hundreds to millions of datacenter devices – will want a cheaper alternative for networking AI accelerators than InfiniBand.

While Nvidia has argued that InfiniBand only represents 20 percent of the cluster cost and it boosts performance of AI training by 20 percent – and therefore pays for itself – you still have to come up with that 20 percent of the cluster cost, which is considerably higher than the 10 percent or lower that is the normal for clusters based on Ethernet. The latter of which has feeds and speeds that, on paper and often in practice, make it a slightly inferior technical choice.

But, thanks in large part to the Ultra Ethernet Consortium, the several issues with Ethernet running AI workloads are going to be fixed, and we think will also help foment greater adoption of Ethernet for traditional HPC workloads. Well above and far beyond the adoption of the Cray-designed “Rosetta” Ethernet switch and “Cassini” network interface cards that comprised the Slingshot interconnect from Hewlett Packard Enterprise and not including middle of the bi-annual Top500 rankings of “supercomuters” that do not really do either HPC or AI as their day jobs and are a publicity stunt by vendors and nations.

The discussion of how Ethernet is evolving was the most important thing discussed during the most recent call with Wall Street by Arista Networks, which was going over its financial results for the first quarter of 2024 ended in March.



As we previously reported, Meta Platforms is in the process of building two clusters with 24,576 GPUs each, one based on Nvidia’s 400 Gb/sec Quantum 2 InfiniBand (we presume) and one built with Arista Network’s flagship 400 Gb/sec 7800R3 AI Spine (we know), which is a multi-ASIC modular switch with 460 Tb/sec of aggregate bandwidth that supports packet spraying (a key technology to make Ethernet better at the collective network operations that are central to both AI and HPC). The 7830R3 spine switch is based on Broadcom’s Jericho 2c+ ASIC, not the more AI-tuned Jericho 3AI chip that Broadcom is aiming more directly at Nvidia’s InfiniBand and that is still not shipping in volume in products as far as we know.

The interconnect being built by Arista Networks for the Ethernet cluster at Meta Platforms also includes Wedge 400C and Minipack2 network enclosures that adhere to the Open Compute Projects favored by Meta Platforms. (The original Wedge 400 was based on Broadcom’s 3.2 Tb/sec “Tomahawk 3” StrataXGS ASIC and the Wedge 400C used as the top of racker in the AI cluster is based on a 12.8 Tb/sec Silicon One ASIC from Cisco Systems. The Minipack2 is based on Broadcom’s 25.6 Tb/sec “Tomahawk 4” ASIC. It looks like Wedge 400C and Minipack2 are being used to cluster server hosts and the 7800R AI Spine is being used to cluster the GPUs, but Meta Platforms is not yet divulging the details. ( We detailed the Meta Platforms fabric and switches that create it back in November 2021.)

Meta Platforms is the flagship customer for Ethernet in AI, and Microsoft will be as well. But others are also leading the charge. Arista Networks revealed in February that it has design wins for fairly large AI clusters. Jayshree Ullal, co-founder and chief executive officer at the company, provided some insight into how these wins are progressing towards money and how it sets Arista Networks up to reach its stated goal of $750 million in AI networking revenue by 2025.

“This cluster,” Ullal said on the call referring to the Meta Platforms cluster, “tackles complex AI training tasks that involve a mix of model and data parallelization across thousands of processors, and Ethernet is proving to offer at least 10 percent improvement of job completion performance across all packet sizes versus InfiniBand. We are witnessing an inflection of AI networking and expect this to continue throughout the year and decade. Ethernet is emerging as a critical infrastructure across both front-end and back-end AI data centers. AI applications simply cannot work in isolation and demand seamless communication among the compute nodes consisting of back-end GPUs and AI accelerators, as well as the front-end nodes like the CPUs alongside storage.”

That 10 percent improvement in completion time is something that is being done with the current Jericho 2c+ ASIC as the spine in the network, not the Jericho 3AI.

Later in the call, Ullal went into a little more detail about the landscape between InfiniBand and Ethernet, which is useful perspective.

“Historically, as you know, when you look at InfiniBand and Ethernet in isolation, there are a lot of advantages of each technology,” she continued. “Traditionally, InfiniBand has been considered lossless. And Ethernet is considered to have some loss properties. However, when you actually put a full GPU cluster together along with the optics and everything, and you look at the coherence of the job completion time across all packet sizes, data has shown – and this is data that we have gotten from third parties, including Broadcom – that just about in every packet size in a real-world environment, comparing those technologies, the job completion time of Ethernet was approximately 10 percent faster. So, you can look at this thing in a silo, and you can look at it in a practical cluster. And in a practical cluster, we are already seeing improvements on Ethernet. Now, don’t forget, this is just Ethernet as we know it today. Once we have the Ultra Ethernet Consortium and some of the improvements you are going to see on packet spraying and dynamic load balancing and congestion control, I believe those numbers will get even better.”

And then Ullal talked about the four AI cluster deals that Arista Networks won versus InfiniBand out of five major deals it was participating in. (Presumably InfiniBand won the other deal.)

“In all four cases, we are now migrating from trials to pilots, connecting thousands of GPUs this year, and we expect production in the range of 10K to 100K GPUs in 2025,” Ullal continued. “Ethernet at scale is becoming the de facto network and premier choice for scale-out AI training workloads. A good AI network needs a good data strategy delivered by a highly differentiated EOS and network data lake architecture. We are therefore becoming increasingly constructive about achieving our AI target of $750 million in 2025.”

If Ethernet costs one half to one third as much, end to end including optics and cables and switches and network interfaces – and can do the work faster and, in the long run, with more resilience and at a larger scale for a given number of network layers, InfiniBand come under pressure. It already has, if the ratio of four wins out of five on fairly large GPU clusters, as Arista Networks has, is representative. Clearly the intent of citing these numbers is to convince us that it is representative, but the market will ultimately decide.

We said this back in February and we will say it again: We think Arista Networks is low-balling its expectations, and Wall Street seems to agree. The company did raise its guidance for revenue growth in 2024 by two points, to between 12 percent and 14 percent, and we think optimism about the uptake of Ethernet for AI clusters – and eventually maybe HPC clusters – is playing a part here.

But here is the fun bit of math: For every $750 million that Arista Networks makes in AI cluster interconnect sales, Nvidia might be losing $1.5 billion to $2.25 billion. In the trailing twelve months, we estimate that Nvidia had $6.47 billion in InfiniBand networking sales against $39.78 billion in GPU compute sales in the datacenter. At a four to one take out ratio and a steady state market, Nvidia gets to keep about $1.3 billion and the UEC collective gets to keep $1.7 to $2.6 billion, depending on how the Ethernet costs shake out. Multiply by around 1.8X to get the $86 billion or so we expect Nvidia to book in datacenter revenues in 2008 and you see the target for InfiniBand sales is more like $12 billion if everything stays the same.

There is plenty of market share for UEC members to steal, but they will steal it by removing revenue from the system, like Linux did to Unix, not converting that revenue from one type of technology to another. That savings will be plowed back into GPUs.



In the meantime, Arista turned in a pretty decent quarter with no real surprises. Product sales were up 13.4 percent to $1.33 billion, and services revenues rose by 35.3 percent to $242.5 million. Software subscriptions, which are within products, were $23 million, so total annuity-like services accounted for $265.6 million, up 45.6 percent year on year. Total revenues were up 16.3 percent to $1.57 billion. Net income rose by 46.1 percent to $638 million, and Arista Networks existed the quarter with $5.45 billion in cash and we estimate somewhere around 10,000 customers. We think Arista had about $1.48 billion in datacenter revenues, and an operating income of around $623 million for this business. Which is what we care about. Campus and edge are interesting, of course, and we hope they will grow and be profitable, too, for Arista Networks and others.
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext