Lowering the Cost of Intelligence With NVIDIA's Ian Buck - Ep. 284

38m 15s

The transcription discusses the concept of Mixture of Experts (MOE) in AI models. MOE involves splitting a model into smaller experts to activate only the necessary neurons, reducing costs while maintaining performance. This architecture has become a standard in developing more intelligent AI models. The transcription highlights the evolution of AI models from traditional neural networks to MOE-based models, showcasing how MOE has led to smarter and more cost-effective AI solutions. The discussion also touches upon the importance of hardware advancements, like NVIDIA's GPUs and technologies such as NVLink, in supporting the training and deployment of MOE models. Overall, MOE architecture is instrumental in advancing AI by increasing intelligence scores while lowering costs, shaping the future of AI development and deployment.

Transcription

6851 Words, 37423 Characters

[MUSIC] >> Hello and welcome to the NVIDIA AI podcast. I'm your host, Noah Kravitz. Ian Buck is here with us today. Ian is vice president of Hyperscale and Hyperform's Computing here at NVIDIA. He's here to discuss mixture of experts, the architecture powering the world's leading frontier models, and how extreme co-design can both drive down the cost of generating intelligence today, and future proof your AI platform for whatever advances come tomorrow. Ian, welcome. Thanks so much for taking the time to join the podcast. >> Thanks, Noah. Glad to be here. >> So let's jump right into it. What is mixture of experts, MOE, as we call it? Why does it matter? If you look at the top 10 open models on artificial analysis right now on their leaderboard, they all share the MOE architecture. So can you explain in late terms what MOE is, and why it suddenly become the standard for frontier AI? >> Yeah, it's a great question. I think there's a lot of, it's a term that is used in industry and amongst AI researchers, but it's not really understood what does mixture of experts mean. >> Yeah. We've all heard of neural networks, and that's what these neural networks are. They're neurons, they're parameters, they're components of a AI model, and when AI got started and really became in the zeitgeist of the world, the neural network was simply each parameter represented a neuron of the model. We heard about 1 billion parameter model on a 10 billion, a 100 billion, now trillion parameter models. Those are basically the neurons of the AI brain, that you activate when you ask chat GPT a question. But something happened along the way. As these models got smarter and smarter and smarter, they naturally got bigger and bigger and bigger. In fact, two years ago when Lama first came in the scene, there was a 7B Lama, and then there was a 70B Lama, and now we have a 405 B, B, B, B billion parameter model. That makes them smarter. They have more information, they understand more things, and they give you better answers. But there was a problem as they got smarter and smarter smarter, to get the answer, you actually had to ask and activate every neuron in that brain. As a result, while the models were getting more and more intelligent, they were also getting slower and slower, because you had to ask every neuron and cacklay every neuron, and perform all the math on every neuron on a GPU, and then it was one GPU with lots of GPUs and even more. Along the way, researchers came up with this idea, and they realized just like a human brain, we probably don't need all of these neurons to ask every question. Simple questions, probably just a few neurons, or different parts of the brain being co-different information. Let's just activate those. To make the AI cheaper, or the tokens, which is the piece of data that's flying through that eventually becomes a word on the screen, the tokens cheaper, let's only activate the neurons we need to activate. That's what makes sure as experts is, instead of having one big model, we actually split the model up into smaller experts. Same number of total parameters, but now we only ask the, we train the model to only ask the experts to probably know that information along the way, and that's part of the training process to build that model, but once you do that, you can have a model which has maybe 100 billion parameters, 100 billion neurons, but we only ask or activate about 10 billion. That's a compression mechanism. It's a way of making AI cheaper, but still being able to encode all the possible information and answer all the questions. So today, most models today are achieving higher and higher intelligence scores by taking advantage of having more than lots of experts and able to have the model as it comes up to the answer, ask only the right experts in order to get the right answers. To give you, put some numbers behind it, we have that Lama 405B, 405 billion parameter. That's one big model. On leaderboards like artificial analysis, you mentioned, it gets an intelligence score of about 28. 28 is just a weighted score of the benchmarks they tested, but all 405 billion parameters are going to be active. Now, fast forward to a modern open model like OpenAI's GPT OSS model. It has 120 billion parameters, actually a little bit smaller than total parameters. But when you ask the question, it only activates on the order of about 5 billion parameters. So instead of 405 billion parameters, and all that math and all that cost, it actually only needs to activate about 5 billion parameters. That's like a 10 to 1 or beyond compression, making it cheaper. And then it gets an intelligence score of 61. So it is going from 28 to 61, going from 405 billion parameters to 5 billion parameters, way cheaper. It's not a 10X cheaper. It's still complicated. We can talk why these MLAs are complicated to run. But artificial analysis does measure the cost to run the benchmarks, so it's like how to run and calculate the intelligence score. For a Lama 405B, I think they currently cost about $200 for them to actually ask what cloud service to get all the answers just to create that score. They ask GPT OSS the same thing. It's tokens are cheaper. And it only cost about 75 bucks. So entities are making models, allowing models to get bigger, smarter. They're allowing to get cheaper. And as a result, advancing AI now, of course, across the board, all the deleted boards, they're all these mixture of expert models. Correct me. Bring you back on track if I get off here with the questions. But from layperson use that word standpoint, if I'm trying to wrap my head around this idea of mixture of experts, are the experts divided up in ways that I might think about knowledge? This expert handles math, and this one handles science, and this one handles, I don't know, visual understanding. Yeah, it's a great question. You know, that is the art of training these things. In fact, AI is not like hard coded in there. They don't train a separate model for math, doing math questions in a separate model for telling you how to make a pizza. The beauty of AI is that the algorithms that these researchers and scientists and companies like Anthropic and OpenAI and everybody else who figured out is that they can just give it the data. And they encourage the model to sort of camp, to identify and create these little pockets of knowledge. It's not prescriptive. It's just the data that they're seeing. It naturally clumps the activity of these different questions to different experts. So, and then in front of those experts, there's this thing called a router. And the router actually is able to just look at the string of questions, like what's the answer is forming, what how's it thinking, and then be able to predict, you know what, this one probably goes to that guy or this other guy. In fact, today's experts, they may have on the order of dozens of experts on every layer of the model, and there's little router between and they may actually ask not just one expert, but like at every layer, they may ask two experts or eight experts. And then there's another unit at the model which listens to all the experts. This guy says, I'm pretty sure I got the right answer. Maybe I got the right answer. I don't know, I don't know. I don't know. Combines the answer and then goes to the next one. So, that's actually the architecture of it. You know, it's kind of like you could, you could train one person, one brilliant scientist, you could train an Einstein, be able to answer any question. That's really hard. Takes a lot of energy. This is a very expensive person to hire and have on staff. Instead, maybe I can hire a couple domain experts or teach a couple of different people some stuff. And, you know, I just give them all that question. They can all answer it very quickly in parallel. And the combined knowledge. And that's actually how we work today. We don't work in one person is not a company. Companies exist because we have all this expertise around. And the MOE method is basically applying that to AI. So the models are all trained that way. It used, there's all sorts of training methods to create the condition where information and activations can start grouping and gathering together. And you can train these little routers and combiners. And then you just do that and multiple, multiple layers. And sure enough at the end of it, you've got a chat model like GPT OSS or QMEK2. Yeah. No, MOE isn't new to 2025. The idea that architecture has been around for a few years. So, was it being used, has it been, you know, being used all along and we just weren't so aware of it? And then why has it kind of come to prominently, yeah? The idea of experts is not new in machine learning. You know, before AI, there was an idea of creating, you know, combining multiple machine learning models together and how to do that with statistically to improve the accuracy. There's all sorts of history and math around that. Yeah. Applying it to AI, though, is relatively new. You know, there are early versions of, we now know, where chat GPT, they were a mixture of experts models, but they were not public, publicly on. It really wasn't until the deep seek moment, which was about a year ago, or really blew the doors open. Like because deep seek, those researchers, were the first to really build a world-class MOE-based model. People had written papers about it, but it was one that actually competed and demonstrated the intelligence scores that could be leaving with the close source models. And it was a beast. It was awesome. It had 256 experts in every layer. I mean, it did every single optimization. And as a result, it was extremely cheap to run. It was incredibly complicated, but cheap to run because it was so, it went, took MOE all the way to the extreme. And maybe many people think it's kind of where open that I was, you know, with the original GPT. All right, so now, once we had that moment, you know, the first time deep seek was run on even GPU systems. It actually didn't run that well because we didn't have the infrastructure or even the software to run that well. The deep seek engineers had written all this custom code to make it run awesome. But at that point, every model, every researcher realized, hey, this thing's real. We now can see how we do it. I mean, the whole thing opened. They published the paper. It's a brilliant paper. And it shows the opportunity for MOE. And since that moment, you can see that every model now has shifted to building MOE's deep seek sort of shine the light on how to do it, how to train it, how to do inference and deploy it, and sort of kicked off that revolution of MOE that's been that we've been enjoying. Right. So we know the deep seek moment was used, as you just said, for many reasons. Is that kind of what we're going to look back and say, like, hey, the lights went on then, and, you know, new things will come. But for the moment, is everything MOE? And if not, why? What's kind of the, I don't know, the decision-making process? When would you train a model to be MOE? And when would you not? You know, I think all the models that really are focused on providing an intelligent response, it makes a lot of sense, whether MOE. Yeah. You want to do your best to encode as much knowledge into the neural network. So it just knows things. You don't need to, like, on pencil and paper, write two plus two to work out that it's for. You just need two plus two is for. So the more neurons you can throw into a holistic model, it gives it innate knowledge. It doesn't have to work that out in a reasoning chain or such things. So there's a huge advantage to having models be bigger, as long as we don't increase the cost. And that's why MOEs, we want to be able to push the limits of only activating 10%, 5%, 3% of the neurons, more and more experts. And then you can see that in the research and the way the models are evolving, they're really pushing the limits of seeing some of the modern models, you know, they'll have 300, 400 experts that they're trying to combine. Now, getting all those experts and all that communication is complicated. We'll talk about that. Yeah. But it is innate by, you know, having that foundation model with all those experts allows them to then apply all the other techniques of inference of reasoning. It allows models that are smaller to be distilled and fine tuned for specific tasks. It creates a foundation for the rest of the, for the rest of the AI models around the world. So, only some of the smallest models, they, you know, for the more dedicated individual use cases. You know, I've got to put a box around a stop sign or I've got a ring doorbell, it uses AI to detect if it's a squirrel or not a squirrel. You know, those small models may not, you know, they need to do one specific thing. A probably I can get it squeezed down. I don't need to go to, you know, the complexity of an expert system. But anything that wants to be agentic, any kind of agent. And pretty much most of the AI is that we interact with purposefully. They're all M-O-E's. Because they can be thrown and they need to know and they need to be able to reason about a wide variety of different stuff. And it makes AI cheaper. Yeah. It lowers the cost per token. It's so there's always a driving cost and the continuous, like let's increase intelligence and let's lower cost. We can do calls for M-O-E's. I was going to ask you about that because it seems like there's this focus happening now. You know, a generative has progressed far enough and certainly it's everywhere, you know, including the news, the business section, if you will. And there's this shift kind of from, you know, the biggest models, raw speed, you know, the highest scores. To, as you said, how much does this cost and can we get it to be cheaper while being just as smart if not more intelligent? So we're calling it tokenomics, right? So not in the sense of blockchain or crypto tokens. But as you mentioned, AI systems generating tokens, reasoning tokens up, what tokens would have you? So if we're focused on bringing the cost down, how does a more complex system, and I'm kind of inferring here a little bit, but I would imagine it's more expensive to train, to architect, to train, perhaps not to run, but total cost. How does a more expensive kind of premium system actually drive the total cost out? Yeah, there's a wonderful, symbiotic relationship that happens in the market between the AI hardware and the models that are being created to survey AI. They inherently, in the kind of, have to make sense. Yep. You know, if the hardware offers a certain level of connectivity, a certain GPU performance, a certain memory size, obviously building an AI model that's even bigger is going to be hard to take to market or even not possible to efficiently train. So, you know, since the beginning of the original Kepler GPUs that we used for those cat, those first cat AIs to today's modern GP200, GP300, NVLCIM2RX, you can see a pattern where, you know, with every new platform, we advance the state of the art or what the kid knows what a video is able to offer, the compute performance, the memory performance, the connectivity, I/O, we'll talk about NVLink. Those things enabled the next wave of building, to train the next model, but also to do inference. You know, it's the, they add complexity. You know, when we started, we were doing PCI eCards, little, in our case, graphics cards that plugged into the server equivalent of your PC and used the floating point calculations and the graphics memory in order to do the competition, and they were great. When the AI revolution took off, we saw that by adding more confluent point calculations and building a bigger GPU, adding things like HP memory, adding things like, you know, increasing the power beyond what a typical PCI slot would do. We often would increase the performance of what was capable in the AI, not by just the percentage of more flops or memory bandwidth, but by X factors. And that's really because the model, the AI model zero of the build were bigger, smarter, and could run more efficiently, it could do more things. You know, TCO, people talk about TCO as the cost. And you know, TCO actually is just, it's not a goal. Like in and of itself, it's just the lowest cost. You want the lowest cost, you know, by one GPU. - Sure. - The goal is actually to deliver, to improve intelligence and intelligence per dollar, the cost of intelligence. Or for the same level of intelligence, to say this, you know, 60 score from artificial intelligence, all we were reducing, we were reducing the cost of the intelligence over time. The tokens that people need to buy or the cost in order to run it. That's really the goal in every generation of the video architecture. You know, we're looking to figure out what technologies can we incorporate, expand, double down on, invest in, or pull from the community, or pull from our partners, in order to deliver X factors of performance improvement, where the model, even the existing models like the current employees, could get an X factor of performance improvement. Well, only, you know, we're not afraid to add more cost and more technology on a per GPU basis. You know, the HBM memory is, it's a lot more expensive than the old school graphics memory. But it only increases the cost in percentages, where because you now have HBM, and because you have the bandwidth that it offers to connect to that much floating point, you can deliver an X factor in total end-to-end performance. - Yeah, yeah. - And we saw that actually, you know, when DeepSeeker R1 came out, you know, the GPU of the time was the Hopper H200 system. The Hopper had 8 GPUs in the server. They were all connected with MeeLank through an MeeLank switch. So we could effectively build one giant GPU of 8 GPUs working as one. - Right. - That was really important. The model was so large, it couldn't really fit on a single GPU. It had to be used multi-GPU and the researchers at Bill DeepSeeker took great advantage of that. It also had MeeLink capability. So we could actually put every expert on different GPUs. And you could see that, you could paralyze the work. It'd run any things even more efficient than even faster. And because as those experts all had to talk to each other, they would do that over MeeLink. So that was really important. Before we had MeeLink, you know, you would have to send things over a PCI bus and only one could talk at a time and it was much slower. Because we have MeeLink, all those GPUs can talk to every other GPU at full speed. It's a totally unblocked, you know, literally at gigabytes and terabytes sucking a bandwidth without any concern for collision. It was critical for those DeepSeeker researchers to get good performance. If you fast, so obviously it also happened at a time which now we can say is when we're in the heart of bringing and building what is now the GB200 and VL72, where we scaled up the number of GPUs we can connect from just eight GPUs in a server to 72 GPUs in an entire rack, a 9x multiple. - Yeah. - Now that's a lot more GPUs. So to the cost go up and it's certainly, obviously that many GPUs, entire rack with GPUs versus the server is a lot more money. - Sure. - In fact, we actually even had to add more technology because we needed to take those, all those MV switches and build a separate MV switch plane is more, it does cost more. But because we could, we did that, we can actually paralyze and improve the performance of DeepSeeker one even more. We can take all those experts and instead of having to try to make it all fit and work within only eight GPUs, we could actually get all 72 GPUs working as one. And that improved performance of just going generation to regeneration, being able to further paralyze and run all those experts across it could actually increase the performance so much that we actually got a 15x improvement on running DeepSeeker one versus only if adding, you know, percent, about 50% more total cost of on a per GPU basis. - Wow, okay. - That actually generated a 10x reduction in the cost per token. - Right, right, right, right. - So we do have to add more technology. We wanna keep going more technology, Nvidia technology company, but we turn that technology back into performance, which in the net of it reduces the cost per token because those 72, it's that much faster. And as a result, they can actually get more out of that rack, more out of the on a per GPU basis. And we've taken it down from what was hopper, it cost about a $1 to get around million tokens, roughly a million words. It's now down to about 10 cents. So people look at the rack in the system. - Yeah, right, right. - But the way you do that is actually you put all that investment in NVLink and all the connectivity and all the next generation software. And you also do all that software work to make it all work really well. And generation to regeneration, you get that multiple, the 10x multiple reduction in cost. That's just one model, that same story's playing out for GPU SS and everything else. And those are models that were built and trained and designed for hopper. - Right. - You know, we're entering into the, you know, the starting season models come out that are trained on Blackwell. And you're gonna see that, you know, now raise the bar and go even further. So this is the virtuous cycle that we've been working so fiercely to help make happen. We add, you know, we might add percents in terms of costing complexity on a per GPU basis. But we aim at every generation to deliver X factors of performance. And as a result, dramatically lower the cost of a Protagon by that, by 10x. - As I'm listening to you describe, you know, NVLink and the advances in getting the X versus game the GPUs to communicate and kind of act as one, I can't help but think like, we need NVLink for like teams meetings. So we can get everybody, we're able instead of talking over each other, just communicate it one as one, it's speed of light. - That's right. Now speaking with Ian Buck, Ian is vice president of hyperscale and high performance computing at Nvidia. And we're discussing mixture of experts and why it's become the architecture, well, as it has been for a while, but now getting public prominence, if you will, the architecture behind so many leading frontier models and what goes into not only architecting and training the models, but the infrastructure that really makes them hum. And Ian, I wanted to ask you, you talked about this a little bit as I said with, NVLink and all the technologies you kind of alluded to as you were describing the MOE architecture. But what is it specifically about these Nvidia systems that make them such a good and such a unique fit for these complex MOE models and are able to achieve as you just described, this lowering cost of intelligence measured partoken? - Yeah, it's an interesting and understandable. It goes back to the original idea about having experts. We're reducing the cost per token by not turning on every neuron, but only turning on the ones we need. It's a cost savings. And we talked about Lama, the 405 billion parameter Lama model, you know, that in order to use it, you got to activate all 405 billion of those neurons, even though they're not all needed. Look at GPT OSS, it's a 120 billion parameters, still a lot, 100, but you only need about 5 billion parameters. In rest of the, it is smart and is a cost saving measure, only this five. - She also notices though, it's not, so that's like a 10X less, actually more than 10X, one percent of the number of neurons we're actually doing math on, the cost isn't unfortunate on GPT OSS, it's not one percent actually. You know, it is that it is X factor slower, it's about three X less cost, but it's not, you know, one percent less cost. - Sure, yeah. - There's a hidden tax to MOE. And it's all about how those experts need and need to communicate with each other. In order to get MOE's to run efficiently, those experts are all doing their math very, very, very fast, and they all need to communicate with each other very, very, very quickly. And one of the challenges with MOE's is, and as we go and get sparsher and sparsher and sparsher, which makes the models more and more valuable, and we're saving saving more and more cost, is can we make sure that all that math's happening, and all those experts can talk to each other without ever running going idle, without ever waiting for a message. You're buying those GPUs, you're paying for them, so they can do the math they need to do, not to sit around and wait for someone else to send them something, or worse, the network that connects all these GPUs gets gummed up, and now everybody's sitting idle, and that's gonna go straight to the bottom line of the cost. - Yeah. - So that's the key part, and the hidden cost in memory is communication. We've looked at, can we make it work with just point-to-point, like maybe I can just connect this GPU with this GPU, and this GPU with that GPU. It'll be a much lower cost to actually just directly wire them up. But there's a limit to how much I can do that. If I take one GPU and I connect it to four, well, this GPU now is, I/O is split four ways, and I can only do that so far. And even with our hopper systems, we had eight, and there wasn't any switch chip. Another, we built another chip specifically for this, but we can't scale beyond that eight, because that's the chip. So if you have point-to-point, or a tors-like network, you're fundamentally limited by how much MOE, how cheap you can make those tokens, because the hidden cost in memory is communication. And if you try to go bigger than the, what a neighboring or point-to-point connection, or some kind of loop or message passing thing, or use a fabric like Ethernet, they weren't designed for this. The best answer is no compromises. I want this expert, this GPU, to be able to talk to every other expert at full speed, no limitations, no worry about congestion. I need a network, I want to connect these things so there's nothing blocking. Yeah. And that's what I mean, Link is. In fact, that chip that we built is specifically designed to make sure that every GPU and it's all of its terabytes of bandwidth can talk to every other chip at full speed and never compromise on the maximum I/O bandwidth we can get out of our GPU. We did that with Hopper with 8-way. And one of the big innovations, and obviously, it took a lot of engineering to make that 72 racks. Everyone was 72, because everyone of those X-Bed GPUs at full speed, no constraints. And you can see that taking off. You can see the benefit that allows people to go even further and build even bigger models. The Kimi K2 model is even bigger than the GPT-1. And now we now have open source truly in parameter model, Kimi K2, yet it only uses 32 billion parameters when you answer the question. That's like a 3% activation of the brain. But it's incredibly complicated. It's 61 layers, over 340 experts. They all want to talk to each other. And as a result, we now have open models that are truly in parameter scale levels of intelligence. And the cost is all comparable to what and even lower than what we could ever possibly have with a fully dense model. It's possible because of that em�ling connectivity. And Vidya is committed to this. Keep going down that path, build. We have some of the world's best 30s engineers, single processing engineer, wire engineers, mechanical engineers, to make all that work without having costs explode and make it all connected. Everyone on those GPUs, by the way, is connected with a copper wire to one switch, to another switch. There's a reason why it all sits in the rack is because we're running it 200 gigabits per second on every one of those wires. It's PAMP four signaling. So it's like four bits per wire. It's a 0, 1, 2, 3, and 4, not a 0, 1. We've gone past it binary at this point. And it's going so fast. It's actually-- it's wavelength is about vanilla meter, I think. So we're pushing the limits of physics. Yeah. Keeping it all nice and tight and also doing everything a copper for low cost. We're super happy with GB200 and what it's been able to do for an inference and just keep in the cost and driving the cost of tokens down, down, down, while intelligence goes up, up, up. So is this getting into what we call extreme co-design? Yeah. One of the joys of working in Vidya is that we're the one company that works with every company-- NAI. Right. Yes. And we work with them in building their data centers and getting the latest GPUs to them and explaining the MVL72 architecture in building and help build a lot of the software that they use. We have teams working on PyTorch, on Jax, on SGLang, on VLM, and all the other software that's out there. And as these model makers are building new models of pushing the limits, both in some inside and video actually now, but all around the world, we can co-design with them. How to take the maximum utility out of those 72 GPUs to manage that hidden cost of communication, to make sure every GPU is running at 110% on computing on the fewest possible neurons and doing that seamlessly and incredibly fast. All the while, thinking about the next model. What's that next GPT, that next vision model, next video model, the next SORA, and making smart decisions about how to add more bandwidth, more communication, more MV link, and the right kind of floating point. And all doing so without blowing out cost or blowing out power and keeping leveraging all the work that they've done to date, so that it can be applied moving forward to the future. This is the extreme co-design that we do at a media and some of our folks that I get to work with and probably watching this get to enjoy and we work really, really hard to continuously work on performance, not just to have the fastest and be the fastest, but also to reduce the cost because you would've talked about tokenomics. If our, just our software alone could increase performance by 2X, you've now reduced the cost per token by 2X, directed to the user and the customer or whoever's gonna deploy this AI. I was gonna call this morning. We got a model from a customer, they wanted some help. We applied the latest NVFP4 techniques, the latest kernel fusion, the latest NV link communication I/O overlaps within two weeks, we hit 2X on their model and gave them the code back and you know, and we're not done. There's so many places where we can optimize. I think a lot of people get confused. They see a GPU with certain number of flops and they say, yeah, that's better faster. That's how you, this stuff's pretty complicated. Manage and run 72 GPUs with 348 experts and all the different kernels and all the different AI and all the different math. We need to talk about KB cache and your reasoning models and all the tricks and techniques. That's an end to end problem. It requires extreme code design between the hardware, what's out of the possible, the model builders themselves and the dense and deep software stack that run on it. And video actually has more software engineers than hardware engineers, specifically for that person. - Right, yep. So to kind of zoom out for a second 'cause we've been talking about and kind of get hard getting back to what you just said about, you know, thinking about what's next. We've been talking about MOE in the context of language models, predominantly, you know, now. And the GB200 NVL72 is really well suited to that architecture. But is there a risk of focusing too narrowly on the single model trend of MOE? What happens when we get, you know, sort of beyond MOE? What happens is the architecture still well suited is the cost of token still going down. How do you, how do you think about that going forward and how does the, you know, the design that Nvidia has today, you know, how is it ready for whatever the next trend might be? - Well, there's one clear trend in AI is that intelligence creates opportunity as the models get smarter as they start to learn new things or as they specialize in certain areas. They create opportunities to advance that industry, that science, that application, or just make computers more productive for you and I every day. - Yeah. - And in order to do that, we need to make the models smarter themselves, we need to use techniques like reasoning, which is an only way to generate more tokens. And the only way to advance the state of the R&VI loads lots of ways. One way Nvidia can help is just reduce the cost of tokens. And doing that, MOE, it's just an optimization technique. If you don't need all of the neurons, don't plan computing on. That's an idea, that's not unique to LM's and robots. That's just a good idea. So we see, it made me realize in different ways and how these networks and experts want to communicate or the shape of the models are actually diversifying in lots of ways. There's lots of different techniques. X mixture of experts is certainly one of them that will stick around for a while. There's lots of other hybrid approaches and other things that people are talking about. For a trade-offs, that you can make in order to reduce cost. But we see MOE is happening not just in chatbots, but similar sparsity MOE applications being done in vision models and video models. As the models are expanding into science and not just generating tokens and which turn into words that you and I talk about, but work on proteins or working on material properties or understanding or working on things like in robotics and or path planning or logic or business applications. All of those will benefit from having a large intelligent model that can be sparsely optimized to only use and leverage the part that it's needed for that particular question and that particular use case. You can always go down to the back down to the squirrel detector in a doorbell, but there's usually a benefit to having a model that's actually able to reason about our hasn't some multimodal aspects. Maybe listen to what's going on and see the things around it and be able to make intelligent decisions smartly. That is going to continue to grow. And Nvidia's not just working on MOE's. We've got lots of different irons in the fire. There's lots of different models. The models are diverse. I get to work at HPC as well. The whole supercomputing community is now in brace day eye building all sorts of models for simulating physics and simulating weather and things that look nothing like chatbots, but they're going to use MOE's. They're going to use every trick in the book. Because the opportunity is huge. The ability to revolutionize biology to do drug discovery for cancer research alone is an investment that the whole world is making right now. And they can take these ideas and take our platform and apply them to their domain, their problem, to take an open source model or general model and fine tune it to be a science model or an application specific model or business model. That is possible because they're starting from a really intelligent model that can be that can learn or be used to turn to each another model to make things possible. So I'm super excited about MOE's. I'm super excited and we'll continue to work on reducing the cost per every token. And while that may make our technology bigger, smarter, more complicated at times and we'll make it more expensive, it is going to deliver X factors in capability improvement intelligent as a result dramatically lower the cost per token. Ian, for listeners who want to dive in further, we could talk about this all day, but you have things to go build and customers to take care of and all that good stuff. Where can listeners go online? Which is the best place to start to dive into MOE's, to the infrastructure you've been talking about, to any and all of it. - I check out GTC. You know, one of the things that we started this conference a few years ago when we got over a decade, I guess. I was there for the first one. It's called the GP Technology Conference. - Right. - It's not a business conference, although obviously many business people show up. It's not a demo conference, it's a developer conference. - Yeah. - And if you want to learn more, go check out GTC. We put all the presentations online. Jensen's kingdom is wonderful. He has, he'll explain it even better than I can. And you can, we actually do a few a year now. I encourage you to check out GTC. Go see the old ones. And if you're going to be in San Jose and March, please come and check it out and attend. There's tons of sessions at every level from beginner to deep dive. If you want to go down to the hardware, all the Nvidia experts will be there. All of the different developers are going to be there. It is kind of the go-to place to go learn and also present your work on what you can do with GPUs and the State of the Art of AI. Check it out. - Perfect. - Ian Buck, again, thank you. And you know, for what it's worth, Jensen's an amazing presenter, you did a great job explaining all this. So we appreciate you taking the time. And as always, all the best to you and your teams on continued progress. - Thank you. (upbeat music) (dramatic music) (upbeat music)

Podcast Summary

Key Points:

Mixture of Experts (MOE) is an architecture used in AI models to split them into smaller experts for more efficient activation.
MOE models activate only the necessary neurons, reducing costs and improving performance.
MOE architecture has become a standard in developing more intelligent AI models with lower costs.

Summary:

The transcription highlights the evolution of AI models from traditional neural networks to MOE-based models, showcasing how MOE has led to smarter and more cost-effective AI solutions. The discussion also touches upon the importance of hardware advancements, like NVIDIA's GPUs and technologies such as NVLink, in supporting the training and deployment of MOE models. Overall, MOE architecture is instrumental in advancing AI by increasing intelligence scores while lowering costs, shaping the future of AI development and deployment.

FAQs

What is mixture of experts and why does it matter?›

Mixture of experts is an architecture where a model is split into smaller experts, allowing only relevant experts to be activated for specific tasks. It makes AI cheaper while maintaining high intelligence.

How does mixture of experts differ from traditional neural networks?›

Traditional neural networks require activating all neurons for each task, leading to slower performance as models grow. Mixture of experts selectively activates smaller groups of experts, making AI more efficient.

Why has mixture of experts become the standard for frontier AI?›

Mixture of experts has become popular due to its ability to reduce costs while maintaining or improving intelligence. It allows models to achieve higher scores by activating only necessary experts.

Is mixture of experts being used across all AI models?›

Mixture of experts is mainly used in models focused on providing intelligent responses. While smaller models for specific tasks may not require this complexity, most interactive AI models benefit from the approach.

How does a more complex system like mixture of experts drive down total costs?›

Despite being more expensive to train and architect, complex systems like mixture of experts drive down total costs by improving the efficiency of AI models. By optimizing hardware and architecture, overall costs can be reduced.

When would you choose to train a model to be a mixture of experts?›

Training a model as a mixture of experts is beneficial for AI systems that need to reason about a wide range of topics and achieve high intelligence scores. It is preferred for models that aim to be agentic and interact purposefully.

Chat with AI

Pro features

Go deeper with this episode

Unlock creator-grade tools that turn any transcript into show notes and subtitle files.

AI chapters & timestamps

AI-generated chapters with a short description for each topic — click to jump to that point in the audio and transcript.

Locked

Transcript exports (PDF · SRT · VTT)

Download the full transcript as a formatted PDF, or grab perfectly timed caption files.

Locked

Viral quotes + downloadable graphics

AI finds the punchiest 15–30 second moments — perfect for TikTok, Reels, and Shorts. Each quote can be exported as a customizable image card (8 templates, 3 aspect ratios) ready for social.

Locked

Citation & fact-checking engine

Pulls only the publicly verifiable claims — statistics, named figures, historical events — and fact-checks each with Google Search, citing the real source URLs.

Locked