I’m sure it’s no surprise to you that AI has been steadily changing the world, but did you know that optics is a key part of its hardware infrastructure? To explain it, fortunately we have a seasoned product manager who knows both the switching side and the optics side. Lucky for us, he sits next to me at the office and agreed to chat about it.
In Episode 60, we continue our conversation with Paymon Mogharabi, Senior Product Manager at Cisco’s Optics team, also known as the Transceiver Modules Group. We get into data center hardware architectures and traffic patterns for AI applications.
Paymon Mogharabi is a networking industry and Cisco veteran of nearly three decades with Electrical Engineering degrees from UC Irvine and USC. After starting at Cisco as a Technical Assistance Center engineer, he became a Technical marketing Engineer for Cisco's Catalyst switches. He then took product management positions for Cisco's Edge Services Router, Nexus data center switches, and UCS server products. He is now a Senior Product Manager in Cisco's Transceiver Modules Group and has sat next to me for the past 7 years, focusing on data center applications.
Related links
Cisco Optics-to-Device Compatibility Matrix: https://tmgmatrix.cisco.com/
Cisco Optics-to-Optics Interoperability Matrix: https://tmgmatrix.cisco.com/iop
Cisco Optics Product Information: https://copi.cisco.com/
Additional resources
Cisco Optics Podcast: https://optics.podcastpage.io/
Blog: https://blogs.cisco.com/tag/ciscoopticsblog
Cisco Optics YouTube playlist: http://cs.co/9008BlQen
Cisco Optics landing page: cisco.com/go/optics
Music credits
Sunny Morning by FSM Team | https://www.free-stock-music.com/artist.fsm-team.html
Upbeat by Mixaund | https://mixaund.bandcamp.com
[00:00:08] Hello everyone and welcome back to the Cisco Optics Podcast where we talk about pluggable optics for networks. I'm sure it's no surprise to you that AI has been steadily changing the world, but did you know that optics is a key part of its hardware infrastructure? To explain it, fortunately we have a seasoned product manager who knows both the switching side and the optics side. And lucky for us, he sits right next to me at the office and agreed to chat about it.
[00:00:34] In episode 60, we continue our conversation with Paymon Mogherabi, Senior Product Manager at Cisco's Optics team, also known as the Transceiver Modules Group. We get into data center hardware architectures and traffic patterns for AI applications. Paymon Mogherabi is a network industry and Cisco veteran of nearly three decades with electrical engineering degrees from UC Irvine and USC. After starting at Cisco as a Technical Assistance Center engineer, he became a technical marketing engineer for Cisco's Catalyst switches.
[00:01:03] He then took product management positions for Cisco's Edge Services router, Nexus data center switches, and UCS server products. He is now a Senior Product Manager in Cisco's Transceiver Modules Group and has sat next to me for the past seven years focusing on data center applications. And now join me as I talk with Paymon Mogherabi.
[00:01:47] This is a hot industry. Yep. Okay, so, great segue. The reason I asked to speak to you was because AI is such a buzzword. And what I think a lot of people don't know is that optics plays a really big role. But before we get there, can you just give us a broad overview of AI from the data center hardware perspective? Yeah.
[00:02:17] AI from the data center hardware perspective. What's your take on that? Sure. So an AI enhanced or AI specialized data centers is very different than your traditional compute environment. I say that because in a traditional compute environment, you have, of course, your compute nodes, the CPUs, your NIC connecting upstream to your switches.
[00:02:45] And they're either 1 gig, 10 gig, maybe 25 gig, with some, in some cases, transitioning to 100 gig. But the design that you have for all the components getting into your rack is pretty straightforward because the power usage of these components is not astronomical. Whereas when you get into an AI optimized data center, so the nodes, the compute nodes completely change.
[00:03:15] Your traffic pattern is now not just north-south, but it's also east-west. Within your east-west traffic, these are all GPUs. These are servers. But even the traditional data centers have a ton of east-west traffic, right? It's, I mean, there may be. There may be. Like virtualization of a bunch of stuff, right? There may be.
[00:03:41] But in some ways you can say it's like a combination of maybe north-south and east-west. You don't have specialized infrastructure that just handles east-west traffic. Ah, okay. So you're building out a compute infrastructure. You have these servers. You plug in a number of GPUs. And these GPUs are extremely power-hungry, two to three times as much as a CPU.
[00:04:09] So maybe an AI-enhanced server might consume up to 10,000 kilowatts. Whereas in the past... 10,000 kilowatts. 10,000 kilowatts. Right. So if you take a simple AI server, you plug in eight GPUs. So each GPU is about a little over 1,000. So you're looking at about 9,000. And then you add all these other components.
[00:04:35] So you go up to over 10,000 kilowatts of power within a server. Whereas a traditional compute environment, maybe the entire rack itself may be anywhere from 7,000 to 10,000 kilowatts. So from a power perspective, things have changed a lot. From a bandwidth perspective, so historically you would build out your compute nodes.
[00:05:04] You'd have lower speeds in the compute. And then as you go up the layers, as you go from the server to the leaf to the spine, you would start to add bandwidth. You would get higher bandwidth up into the spine. Because it's aggregating all that traffic below. With AI, you're kind of reversing this. So the bulk of your traffic may actually be in the compute. So it's the compute that is doing all the number crunching with the GPUs, all the parallel processing traffic, the workloads.
[00:05:33] And all these GPUs are talking to each other, and that's the east-west traffic that you're referring to. Correct. Correct. So you have to have a networking or optics and networking that does not become a bottleneck. So if you look at the decision factors around what kind of optics do I need, what kind of optics do I need, you have to look at a couple of things.
[00:06:03] You have to look at the bandwidth. Bandwidth is important, very important, because you don't want the network to be the bottleneck. You already have tremendous amount of traffic that goes between the GPUs, among all the GPUs within a node. If that has to traverse the rack, the network becomes the component that has to move the traffic from one cluster to another,
[00:06:32] and you don't want the network to be the bottleneck. So the GPU tends to be the bottleneck. So you use the latest GPU probably. No, the GPU is generally not the bottleneck. It's the network that potentially could become the bottleneck. Okay. So the GPUs are talking to each other. What is the bottleneck at the end of the day? The bottleneck is, let's say, from a compute node
[00:06:57] or from a rack that has, let's say, I don't know, maybe 16 GPUs or maybe more. You have to traverse the rack and go through a top of rack switch to another rack. Okay. So at that point, you have to traverse a switch. So that switch, the networking, the optics, cannot be the bottleneck. You have to be able to traverse a network at high speeds. Okay. Another issue is lossless.
[00:07:27] You cannot have, having a lossy network is catastrophic for AI because your job completion time becomes higher and higher because you have to redo these calculations. So lossless is critical. They enforce in the AI, whatever standards they use. They enforce, hey, you've got to finish the job. You've got to finish the job.
[00:07:56] And if it's not, then you have to redo these calculations, which, of course, will end up costing money. And then the third one, which is probably one of the most important, is the power. Okay. So power is, you know, there's trade-offs with power. So you can potentially build a cluster in a single rack. So your GPU placement,
[00:08:24] maybe you may be able to do this within a single rack. And in that case, your option may be better to just go with cable, with DAC, with copper. Because the reach at higher speeds when you get up to 400 gig and beyond, now you're down to two and a half meters, two meters, two and a half meters. So if you can populate that entire rack with GPUs and just have it all there, then perhaps DAC is your solution. But in most cases,
[00:08:54] as you start to grow your GPU clusters, you cannot have all your GPUs within one rack. Whether it's because of real estate reasons or whether it's because of power constraints, you need to start to populate other racks. Okay. So within the rack, you can still use the copper DAC. Correct. By the way, DAC is D-A-C, direct-attach. Direct-attach copper, yeah. Yeah. And so that's still a thing, as long as it's within the rack and that's the lowest power,
[00:09:23] also lowest cost, right? Lowest power and lowest cost. But you're saying you're not going to get all the GPUs into one rack. So at some point, you're going to have to have the GPUs traverse that east-west traffic to a different rack. Correct. Which means? In that case, then, the two meter copper, I mean, there's ways maybe of maybe placing the switch in the middle of the rack. But in general, once you go beyond two racks,
[00:09:53] copper is no longer an option. So now you have to look at plugables. Or... Plugable optics or some optical-based... Optical solution. Okay. So whether now it could be either a multimode or it could be single mode. So depending on how far the reach is, if it's 100 meters or below, you probably want to go with multimode. It is cheaper in a sense. But then at the same time, we know customers
[00:10:22] that would like to standardize on one optic and they don't want to deal with multiple optics with different reaches. So they'll go with the optics that can do the highest or the longest reach, even up to two kilometers. So those... The design changes, the design changes based on how you're actually populating your GPUs within your data center, within your different racks. And by the way,
[00:10:51] some people choose just one optic because they go through so many of these, right? Like what kind of scale are we talking? How many optics? Correct. So that's a good question. So at a high level, when you consider a GPU cluster, the estimates are that for a GPU cluster,
[00:11:17] you use about maybe up to two times more optics than the GPU. So let's say if you have a GPU cluster of, let's say of 16, you probably use around at least 32 optics. Of course, that number will increase as you get into bigger and bigger cluster sizes. Does that mean for each GPU box, you're going to have a NIC with two ports?
[00:11:47] So in a high-performance AI server, the NIC to the GPU is a one-to-one. Every GPU has a dedicated NIC. Okay. And that NIC will communicate up to the Tor. So if you have an eight GPU system, you have eight NICs. Those eight NICs could be, let's say a single port 400 gig, or they could be a two by 200.
[00:12:15] But in any case, you're probably transmitting out of both ports. So is that where the two optics per GPU comes from? Yes. And we've seen estimates of even more. So anywhere from 1.5 to maybe three optics per GPU. Okay. So this is not a hard rule. It's just a typical usage practice. Typical usage package. And by the way, when you look at the whole infrastructure of an AI data center, it's really three,
[00:12:45] the major expenditures is tied to three different areas. One is the compute side. And that's about, for every dollar, about 60 cents goes to the compute side. On the memory, storage memory, high bandwidth memory, you're looking at about 15 to 20 cents, or 15 to 20%. The networking is about 20% of your expenditure. So just as a high level. Okay. You're looking at, where am I going to spend my money?
[00:13:14] It's going to be majority in the compute, but also a good percentage in the network. And a good percentage of that will be optics. Can we back up a step? Because we're starting to talk about these things. That was the second part of my conversation with Payman Mogherabi. Next time, we'll get into more detail about AI data center hardware architectures and Ethernet.
[00:13:44] We have a new website. It's called optics.podcastpage.io. You can either listen there or use the same podcast platform you've been using all along. Please subscribe. Better yet, leave a review, especially if you use Apple Podcast. Remember, we're part of the Cisco Podcast Network where you can find other great Cisco podcasts too. We also have educational videos on YouTube. Just go to youtube.com and search on Cisco Optics. Thank you for listening. This is Pat Chow in technical marketing at Cisco Optics.
[00:14:14] The next episode is part three of my conversation with Payman Mogherabi. Until next time.
