Cisco Optics Podcast Ep 63. Why some optics are good for AI and some aren’t (5/5)

I’m sure it’s no surprise to you that AI has been steadily changing the world, but did you know that optics is a key part of its hardware infrastructure? To explain it, fortunately we have a seasoned product manager who knows both the switching side and the optics side. Lucky for us, he sits next to me at the office and agreed to chat about it.

In Episode 63, we conclude our conversation with Paymon Mogharabi, Senior Product Manager at Cisco’s Optics team, also known as the Transceiver Modules Group. We continue with the constraints to consider when designing AI cluster racks and optics choices for them.

Paymon Mogharabi is a networking industry and Cisco veteran of nearly three decades with Electrical Engineering degrees from UC Irvine and USC. After starting at Cisco as a Technical Assistance Center engineer, he became a Technical marketing Engineer for Cisco's Catalyst switches. He then took product management positions for Cisco's Edge Services Router, Nexus data center switches, and UCS server products. He is now a Senior Product Manager in Cisco's Transceiver Modules Group and has sat next to me for the past 7 years, focusing on data center applications.

Related links
Cisco Optics-to-Device Compatibility Matrix: https://tmgmatrix.cisco.com/
Cisco Optics-to-Optics Interoperability Matrix: https://tmgmatrix.cisco.com/iop
Cisco Optics Product Information: https://copi.cisco.com/

Additional resources
Cisco Optics Podcast: https://optics.podcastpage.io/
Blog: https://blogs.cisco.com/tag/ciscoopticsblog
Cisco Optics YouTube playlist: http://cs.co/9008BlQen
Cisco Optics landing page: cisco.com/go/optics

Music credits
Sunny Morning by FSM Team | https://www.free-stock-music.com/artist.fsm-team.html
Upbeat by Mixaund | https://mixaund.bandcamp.com

[00:00:08] Hello everyone and welcome back to the Cisco Optics Podcast where we talk about pluggable optics for networks. I'm sure it's no surprise to you that AI has been steadily changing the world, but did you know that optics is a key part of its hardware infrastructure? To explain it, fortunately we have a seasoned product manager who knows both the switching side and the optics side. And lucky for us, he sits right next to me at the office and agreed to chat about it.

[00:00:34] In episode 63, we conclude our conversation with Paymon Mogherabi, Senior Product Manager at Cisco's Optics team, also known as the Transceiver Modules Group. We continue with the constraints to consider when designing AI cluster racks and optics choices for them. Paymon Mogherabi is a network industry and Cisco veteran of nearly three decades with electrical engineering degrees from UC Irvine and USC. After starting at Cisco as a Technical Assistance Center engineer, he became a technical marketing engineer for Cisco's Catalyst switches.

[00:01:03] He then took product management positions for Cisco's Edge Services router, Nexus Data Center switches, and UCS Server products. He is now a Senior Product Manager in Cisco's Transceiver Modules Group and has sat next to me for the past seven years focusing on data center applications. And now join me as I talk with Paymon Mogherabi.

[00:01:46] Higher generation of GPUs versus earlier generation. So anyway, sorry, you were saying you got to know the cluster size. Correct. So the cluster size at a minimum will tell you how many servers you need. Okay. So I have, let's say, I have a cluster size of, let's say, 128. So if I, I know I have servers that can do eight GPUs.

[00:02:16] So if you divide 128 by eight, that'll tell you how many racks you need. Well, that may be too many racks. So what I can do is I say, okay, well, maybe I can have 16 of these GPUs within a rack. So wait, sorry, when you say server, do you mean? These AI servers. You mean a box or an entire rack? No, no. When I say server, I mean, it's a server with, with GPUs within the server. Server meaning like physically?

[00:02:44] Physical server, like a server, like a UCS server, like a server. Okay. Yeah. So it's not the entire rack. No. A rack contains multiple servers. A server can, can contain multiple GPUs. Correct. Yeah. So if you look at a rack, you have top of racks switched generally. It's one switch per rack. And then depending on how many RUs, rack units your servers are, you can start to populate your, your rack. Now in a traditional computer environment, you can probably fully populate that, that rack

[00:03:14] because power is not as much of a constraint, but in an AI world, you don't necessarily have the luxury of populating this with, with multiple or six or eight or 10 AI servers because of, mainly because of power constraints. Power or the thermal? It's both, both, both.

[00:03:40] So you're, you're kind of bound by not necessarily real estate in that rack, but you're bound by the power constraints. Okay. So now you, now your architecture is changing from having, let's say, I don't know, 32 servers down to maybe two servers. Two servers. Maybe, maybe. Right. Two servers per rack. And each having eight GPUs. Yeah. Okay. So eight GPUs. And you said there's one NIC per GPU. Correct. Yeah.

[00:04:08] So now you're looking at six, if you have a 16 GPUs per rack, you're looking at two, two servers. Mm-hmm. The overall number is 16 GPUs. Mm-hmm. So 16 NICs. So, so effectively you have 16 ports that are going upstream to the top rack switch. 16 ports. Or sometimes, in some cases it could be twice as many ports, right? Or it could be twice. But at the end of the day, it has to be non-blocking.

[00:04:36] So from the, whatever bandwidth that's going to the top rack. Leaving the top rack, generally you want it to be the same. So, so it's non-blocking. Okay. So if you say 16 going up at 400 gig, you also have to have 16 going out at 400 gig to, to make sure that it's non-blocking. Okay. So you have a cluster size. Now you, your plan is to populate these racks.

[00:05:02] Again, in the past, you would have just looked at the real estate and said, okay, I have a two RU switch. I have a four RU switch. This is how many I can put into these racks because of real estate reasons, constraints. Now it's a different story. You may have racks that are really sparsely populated. Why? Because from a power constraint, you cannot fully populate these racks.

[00:05:30] So now we have 128 clusters. So you have to spread them among a number of racks. So I don't want to do the math in my head, but, but if you have 128 divided by 16, I think that'll give you what? Eight, I think. Yeah. Eight racks. Yeah. You don't, did you get an even number? Sorry. Yeah. Yeah. You do 128.

[00:05:57] Let's say it's 16 GPUs per rack. Then you're looking at about eight racks that are needed to cover this entire 128 cluster. Okay. Now, so that determines your, how do you actually populate these racks? You put two per rack. Mm-hmm. Now the question becomes, how many switches can I put up there on the topper rack?

[00:06:26] Is one sufficient? Can I have one topper rack switch? I can manage two racks, let's say. Or is it, or do I have to stick with a one topper rack switch with two servers in the rack? In that case, if it's one switch per rack, then yes, copper would be more than ideal. It's a, you have enough reach and you, you have the efficiencies in power and you go with

[00:06:55] the, with the DAC solution. But if you have, let's say a switch that's, that's, that's managing maybe three racks. Mm-hmm. So in that case, copper is no longer a solution. So you have to look at the pluggables, optics. So you could have one switch as a top of rack switch, but it's on top of three, well, not physically on top of three racks, but it's effectively managing three different racks. Effectively managing three different racks. Yeah. Okay.

[00:07:25] So if you're, even if it's a switch that's sitting on a rack with the other two side by side, it's still technically is challenging to be able to have a two meter DAC from a top of rack switch cover three different racks. Right. So now your architecture changes, your, your, your cabling, your, your connectivity model changes. Now you have to look at a, maybe a optics, a multi-mode solution.

[00:07:54] And then that grows. So when I mentioned 128, so that's eight racks, but then you have models that go up to maybe a thousand and maybe more. There's there, I mean, absolutely more than a thousand. And it'll just extend. A thousand what? A thousand, a thousand, a thousand, let's say a thousand 24 GPU cluster. So we have a model that has maybe a thousand 24 GPUs. Now you're, I think looking at maybe 64 racks.

[00:08:23] So as you can see, what's really driving your, your topology is not really the, the leaf spine or the, or the, or the connectivity. It's really, you have to go to the, the, the, the actual, what, what, what model you're presenting to the outside world. And that's the GPU cluster size. So your GPU cluster size dictates how you actually going to arrange these, these racks.

[00:08:53] It does come down to what kind of servers you're plugging in. It does come down to your density of your switches. Are these 800 gig ports, which will allow you to break it out into many, right? You can definitely reach many, many ports. If you have ports that are 800 gig or is it 400 gig, which limits you to maybe 16 ports per, per, per switch. So that again, goes back to looking at the GPU cluster size and then working backwards

[00:09:23] and saying, okay, I have this GPU cluster size. I need this number of servers. Power says I can only put X number of servers within this rack. And that's going to determine how many switches I have and, and the length between the switch and the, and all the different compute elements. Are you seeing any particular optics commonly chosen, like becoming popular? So, yes.

[00:09:49] So we, depending on which customers we talk to in, in the, I guess I want to say maybe the on-premises AI models, it's still, these are smaller cluster sizes. So you have maybe DAC, you have maybe multimode optics. But as you start to talk to the hyperscalers, as you talk to even the tier two hyperscalers,

[00:10:17] now we're talking about much larger cluster sizes. So now you're looking at single mode versus, or single mode and multimode or technologies like AC, which that's, that'll extend the reach of the copper, but it's still limited to, I think, five meters or so. So that, that varies. Now, what we are recently hearing a lot about is the, are the, the, the 400 gig QSFB optics

[00:10:47] that plug into the NICs that also have a form factor of QSFB 112. Okay. So in that scenario, you have a QSFB 112 optic at 400 gig from a form factor perspective, it will plug into that NIC and it will connect upstream to a switch that could potentially have a 400 gig QDD port. It doesn't matter. Right. Optically is still, let's say in this case would be like a DR, DR4.

[00:11:15] So you have that connectivity model that's, that's really picked up actually. So many of the switches out there are still double density QDD. And so that in interoperability is, is, is important. Optical interoperability is. Optical interoperability is important. Correct. Yeah.

[00:11:37] So we see a lot of, we see demand for, for let's say 500 meter DR4. We see for demand for the 50 meter VR4 and even for the FR4 up to two kilometers. These options will plug in in the NIC only. And then you would have the same, from an optical perspective, the same selection at the switch,

[00:12:05] but with the form factor being QDD. There are other form factors and it's not an area of my, my expertise, but you're looking at OSFP. So we have, we have, of course there are vendors that offer OSFP, not just at the switch, but also at the NIC.

[00:12:30] And, and there are, there are trade-offs, there are pros and cons when it comes to OSFP versus QSFP. Notable one being lack of backward compatibility when it comes to OSFP. But it is, it is, it is definitely a desirable option when it comes to connectivity at the NIC and also at the, at the switch.

[00:12:56] So there are various models and, and, and as a, as a customer, of course, you're looking at not just an option, not just based on what the, what the physical nature is, but also you're looking at cost. You're looking at how many vendors offer that, right? Are you, do you want to be stuck into one vendor? Do you want, those are, those are key factors. Mm-hmm. Mm-hmm.

[00:13:24] Uh, and, and also what do you do with all the, um, the legacy infrastructure that you purchased? Uh, when you go with the- Infrastructure like cable infrastructure? Cable, when we talk about, let's say OSFP versus QSFP. Oh, okay. What do you do about that? Do you just throw everything away? Well, that's not really a, I don't think it's a viable option. Yeah. So those all come into play.

[00:13:50] Uh, so it's not just, uh, about deploying or purchasing, making purchasing decisions for a new deployment. Mm-hmm. You also, as, as a, in a realistic way, you have to also consider customers that already have purchased many of these existing, um, um, um, hardware and, and what happens to all that hardware? So. Yeah. Multiple, multiple factors before you decide on whether you go with OSFP or, or QSFP. Sounds complicated.

[00:14:22] And it is. Uh, great. Anything else you, do you feel like sharing on this topic? Um, I, I, I think just in, in general, I think, um, uh, today, uh, when, when you look

[00:14:42] at how, where these GPUs are, where, um, the, um, where the deployments are going, it's more towards the hyperscale and their requirements are very different. And they, they, they, uh, for, for them, it, it truly may be that you need the highest performance GPU at regardless what the cost is.

[00:15:10] But I think eventually, as you look at the market, uh, just like what we saw back in, let's say the nineties, you'll start to see other vendors come into the picture. Sure. They may not have, um, products or GPUs that are as maybe leading edge as the leading vendors, but they may be, um, lower in cost. They, they may be just enough. They may be, um, uh. Because their application is very different.

[00:15:38] Applications may not require to, um, deploy, um, systems that are extremely at high end. Yeah. As a customer, I may just need, uh, maybe a, a middle or I guess, uh, a smaller model. And in that sense, I, I, I probably want to look at other, uh, vendors. Um, and then also the idea of being able to approach multiple suppliers, multiple vendors.

[00:16:08] Uh, it also, um, as a customer, I have more leverage when it comes to paying for the infrastructure. All right. Well, thanks so much, Paymon. Really appreciate it. My pleasure. That was the fifth and final part of my conversation with Paymon Mogherabi. Join us again when we'll have a new guest.

[00:16:37] We have a new website. It's called optics.podcastpage.io. You can either listen there or use the same podcast platform you've been using all along. Please subscribe. Better yet, leave a review, especially if you use Apple Podcast. Remember, we're part of the Cisco Podcast Network, where you can find other great Cisco podcasts too. We also have educational videos on YouTube. Just go to youtube.com and search on Cisco Optics. Thank you for listening. This is Pat Chow in technical marketing at Cisco Optics.

[00:17:07] Until next time.