Cisco Optics Podcast Ep 62. Why some optics are good for AI and some aren’t (4/5)

I’m sure it’s no surprise to you that AI has been steadily changing the world, but did you know that optics is a key part of its hardware infrastructure? To explain it, fortunately we have a seasoned product manager who knows both the switching side and the optics side. Lucky for us, he sits next to me at the office and agreed to chat about it.

In Episode 62, we continue our conversation with Paymon Mogharabi, Senior Product Manager at Cisco’s Optics team, also known as the Transceiver Modules Group. We get into smart NICs and future growth affecting optics.

Paymon Mogharabi is a networking industry and Cisco veteran of nearly three decades with Electrical Engineering degrees from UC Irvine and USC. After starting at Cisco as a Technical Assistance Center engineer, he became a Technical marketing Engineer for Cisco's Catalyst switches. He then took product management positions for Cisco's Edge Services Router, Nexus data center switches, and UCS server products. He is now a Senior Product Manager in Cisco's Transceiver Modules Group and has sat next to me for the past 7 years, focusing on data center applications.

Related links
Cisco Optics-to-Device Compatibility Matrix: https://tmgmatrix.cisco.com/
Cisco Optics-to-Optics Interoperability Matrix: https://tmgmatrix.cisco.com/iop
Cisco Optics Product Information: https://copi.cisco.com/

Additional resources
Cisco Optics Podcast: https://optics.podcastpage.io/
Blog: https://blogs.cisco.com/tag/ciscoopticsblog
Cisco Optics YouTube playlist: http://cs.co/9008BlQen
Cisco Optics landing page: cisco.com/go/optics

Music credits
Sunny Morning by FSM Team | https://www.free-stock-music.com/artist.fsm-team.html
Upbeat by Mixaund | https://mixaund.bandcamp.com

[00:00:08] Hello everyone and welcome back to the Cisco Optics Podcast where we talk about pluggable optics for networks. I'm sure it's no surprise to you that AI has been steadily changing the world, but did you know that optics is a key part of its hardware infrastructure? To explain it, fortunately we have a seasoned product manager who knows both the switching side and the optics side. And lucky for us, he sits right next to me at the office and agreed to chat about it.

[00:00:34] In episode 62, we continue our conversation with Paymon Mogherabi, Senior Product Manager at Cisco's Optics Team, also known as the Transceiver Modules Group. We get into smart NICs and future growth affecting optics. Paymon Mogherabi is a network industry and Cisco veteran of nearly three decades with electrical engineering degrees from UC Irvine and USC. After starting at Cisco as a Technical Assistance Center Engineer, he became a Technical Marketing Engineer for Cisco's Catalyst Switches.

[00:01:01] He then took product management positions for Cisco's Edge Services Router, Nexus Data Center Switches, and UCS Server Products. He is now a Senior Product Manager in Cisco's Transceiver Modules Group and has sat next to me for the past seven years focusing on data center applications. And now join me as I talk with Paymon Mogherabi.

[00:01:21] You mentioned something about, was it overview of the AI, was it market? Or was there like...

[00:02:00] You said you found some numbers. Well, I can say that the, as far as the transition is happening from, I'll take InfiniBand and Ethan as an example. Like I said, right now it's primarily Ethan, sorry, InfiniBand in the back end.

[00:02:20] But the projections are within the next two to three years, the transition will happen fast enough or maybe close to 30, 40% of the installments in the back end will actually become Ethernet based. And the front end is all Ethernet. So that transition is definitely clear. So the front end, they don't need that lossless, low latency property. They don't. That's more like a regular data center?

[00:02:50] Regular data center. Okay. Yeah. It is still, you know, higher speeds, but it's not the same as what you have, what the requirements are in the back end. Okay. Yeah. For, as far as the optics speeds are concerned, I think that is critical.

[00:03:08] What we are seeing today, what maybe one or two years ago, it was customers were talking about 100 gig or even 200 gig. Now it's strictly 400 gig with 200 gig as an option, but it's mainly 400 gig and even beyond 400 gig.

[00:03:31] So you were talking about how each NIC could have either two ports or maybe a single 400 gig port, but it could be a 2 by 200. Correct.

[00:03:43] So right now in the industry, the network interface cards that are mainly used for within the AI type deployments are either 2 by 200, maybe 1 by 200, 2 by 200, or 1 by 400. There are NICs out there that can do 2 by 400, but those are mainly for the front end.

[00:04:11] For the back end, what is currently positioned are either 1 by 400 or 2 by 200. And so these, when you look at what the requirements for a NIC, whether it goes into the back end or the front end, more or less in the back end, it's all about speeds and feeds. You're sending traffic over. It's fast.

[00:04:36] Whereas when you're looking at front end, a lot of these NICs are more of like a smart NIC type adapter where they are programmable. Okay. And they add some intelligence and then they offload some of the networking functions that have to hit the CPU. They offload that and it's... Sorry, did you say this is the back end or the front end? This is for the... So for the back end, it's speeds and feeds. Okay. Your NICs are high speed NICs. There's no need for smartness in the back end NICs.

[00:05:05] No need for like a more like a... There's no need for like a programmability capability on the back end. But in the front end, these are generally NICs that are called DPUs or smart NICs. And these are programmable and these are... There's more intelligence. And of course, they offload some of the network functions that happen either at the server or the switch. They alleviate some of the load off of the server.

[00:05:35] So the idea is you do it like closer to the port and then you don't have to burn power by sending signals through the traces all the way to the CPU. And the CPU doesn't have to do as much. Not removing some of the load, some of the cycles off the CPU. So it's not getting bombarded by traffic that could be handled by the DPU. Okay. Yeah. Is that taking off? Because I remember hearing about smart NICs even before the pandemic.

[00:06:03] In this AI model, it has picked up. It has or not picked up, but it has become a focal point. Okay. Maybe in the traditional model, maybe five years ago, traditional model, it was difficult to justify a smart NIC. But in the AI model, you absolutely need something.

[00:06:30] You absolutely need a front infrastructure that takes advantage of these smart NICs or DPUs. Is that because... Why is that? Why is the demand so high for these front end? Because the amount of traffic, the amount of networking packets that go back and forth are now much, much higher. And it's much higher than the traditional data centers needed. Correct. Yes.

[00:06:59] Any other thoughts you wanted to bring up when it comes to AI? I mean, I think we will definitely see a trend towards higher speed optics. There's talk about... There's 800 gig, of course. We have switches that can do 800 gig or have 800 gig ports.

[00:07:28] You have infrastructures that are actually dependent on having 800 gig connectivity down to the server. So that's an area, I think, that we'll see growth. In other areas... Is there a point where something just tops out? I mean, I'm thinking maybe power consumption.

[00:07:58] At some point... Even with the traditional data centers, before AI was really taking off, power consumption was kind of like the bottleneck. Right. Right? For a lot of things. So, I mean, customers can certainly build out smaller AI models within their existing. But if they're looking to build out AI specialized data centers,

[00:08:26] then at this point, these are new data centers. These are data centers that are completely way out there when it comes to being able to handle larger or bigger or more power requirements. And then you have more efficient cooling systems. Yeah. Actually, sorry. When I meant power consumption, in my head, I'm thinking the thermals.

[00:08:55] Because my impression was that the bottleneck wasn't so much... I mean, it's partially just like, hey, you don't want to draw so much power. You don't want to pay for the power. You don't want to damage the earth, right? But the real... The more difficult limitation was you just can't get rid of that heat fast enough. It's certainly a factor. It's a factor. When it comes to power, like you said, thermal. Real estate is important.

[00:09:24] Because if you can't populate these racks fully, then you need more racks. So real estate becomes a problem. It's many factors. It's many factors. Yeah. Yeah. Have you seen any trends in... Because you mentioned you'll fill up a rack, but you'll still have to have the data traverse over to another rack.

[00:09:51] So have you seen like a big impact on just the cabling infrastructure architectures? Yes. Well, on the cabling infrastructure, it really comes down to your GPU placement. Your GPU size, your cluster size, and your GPU placement. As if all...

[00:10:16] If your infrastructure can be placed into one or two racks, then DAC will be more than enough. So, yeah. I mean, there is... The design will vary, but it ultimately comes down to how large your GPU clusters are.

[00:10:35] So, in fact, one of the ways that we're looking here, when we look at how to factor in what the optics selections should be, you know, what the architecture should be, is that we start with the GPU cluster size.

[00:10:56] So, a cluster size will determine how many servers are within a rack, how many ports are connected from the rack to the top of rack, or from the server to the top of rack switch, and on and on. And the cluster size is defined by how many GPUs? How many GPUs? So, how do you rate a GPU? Like speed or capacity or processing power? I mean, it's all.

[00:11:22] There's various models, various models that, you know, the known vendors provide. Yeah. And, yeah, it is the... So, there's no single metric. Like with CPU, back in the day, it used to be just clock rate, right? Right. With GPU, there's no single metric. You just kind of have to know what's the latest model from a vendor? I mean, there's...

[00:11:50] The GPUs themselves become more and more powerful. What the metrics are, I'm not fully... I don't fully know, but the processors become faster or become more powerful. But then the communication between these GPUs also increases, also becomes faster. So, you have more GPUs. You have higher speed bandwidth between the GPUs. So, what does that mean ultimately?

[00:12:17] It means that if you look at an entire cluster, the amount of traffic that's going in parallel, back and forth is going to be much higher with maybe a higher generation of GPUs versus an earlier generation. Okay. So, anyway, sorry. You were saying you got to know the cluster size. Correct. So, the cluster size...

[00:12:47] That was the fourth part of my conversation with Payman Mogherabi. Next time, we'll continue with constraints to consider when designing AI cluster racks and optics choices for them. We have a new website. It's called optics.podcastpage.io. You can either listen there or use the same podcast platform you've been using all along. Please subscribe. Better yet, leave a review, especially if you use Apple Podcast. Remember, we're part of the Cisco Podcast Network, where you can find other great Cisco podcasts, too.

[00:13:17] We also have educational videos on YouTube. Just go to youtube.com and search on Cisco Optics. Thank you for listening. This is Pat Chow in technical marketing at Cisco Optics. The next episode is part five of my conversation with Payman Mogherabi. Until next time.