Cerebras After IPO: OpenAI, AWS and Inference

41m 46s

Cerebras Systems, led by CEO Andrew Feldman, recently went public in a highly successful IPO, raising over $6 billion and achieving a valuation in the high $60 billion range. The company chose to go public rather than be acquired, aiming to gather fuel for growth and meet market demand. Cerebras is known for its wafer-scale engine, the largest AI chip ever built, which processes AI inference 15-20 times faster than GPUs. This speed advantage has driven key partnerships, including with OpenAI, where Cerebras became one of only two hardware vendors in production, deploying within weeks of signing. Another major deal with AWS involves a disaggregated inference solution, splitting prompt processing and response generation to optimize performance, though this approach may be less efficient for non-hyperscalers due to potential hardware idleness. Feldman challenges the notion of a CUDA moat, noting that top AI models like Anthropic and Gemini use no CUDA, and inference is now API-driven. Cerebras distributes its compute through on-premise systems, AWS cloud, and its own cloud, focusing on delivering the fastest AI compute. The company’s culture of tackling hard problems, such as building a chip 58 times larger than a GPU, has been key to its success, enabling rapid innovation and deployment.

Transcription

5990 Words, 33222 Characters

English

[MUSIC] Hello and welcome to the Tech Disruptors podcast hosted by Bloomberg Intelligence. In this podcast series, we speak with company executives and management teams about their views on disruption and how it is driving their decision-making and strategy. Bloomberg Intelligence is Bloomberg's research arm and covers over 2000 companies globally across multiple asset classes backed by Bloomberg and third-party data, supported by nearly 500 research professionals. My name is Koonin Shabani. I'm a senior technology analyst at Bloomberg Intelligence. And I'm pleased to have Andrew Feldman, co-founder and CEO of Cerebrist Systems as our guest today. Andrew leads the team behind the wafer scale engine, which is the industry's only and first wafer scale processor in the largest AI chip ever built. Prior to Cerebrist, Andrew was the co-founder and CEO of C-Micro, a pioneer in low-power server technology, which was eventually acquired by AMD in 2012. He then served as a corporate vice president and GM at AMD, running the server business. Andrew holds both a BA and an MBA from Stanford. Andrew, welcome to the show. Thank you so much for having me. It's good to talk to you again. First of all, I want to congratulate on taking Cerebrist public and what has been a very successful IPO from a valuation perspective. Based on the news flow, if these numbers are correct, it seems you started the road show as sort of a base valuation of about 35 billion, got priced at around 40 billion. And the first trading day, based on the closing price, had Cerebrist valued in high 60s billion. That sounds like a fantastic win-win outcome for any company going public. So maybe let's start from there. How does going public change the trajectory of your company? Change their role for you as a CEO? And also talk to us about the motivation behind going public versus exiting as a part of the merger, because we have heard rumors that there was quite a lot of interest also to acquire Cerebrist. Well, why don't I answer that one first? I mean, we didn't want to sell the company. I think there are always conversations where people are trying to see if you have interest. I think it was our time. There was market demand and we could feel that. And so we weren't interested in that approach. I think an IPO is a fundraising event. It's an opportunity to approach a different type of investor with your story. It's an opportunity to gather fuel for the continued extraordinary growth of your company. And we raised what all of a sudden done and the green shoe is exercised. It will be in excess of $6 billion. This will enable us to continue our rapid growth, to make some investments, to continue to build out our engineering organizations. We thought it was the right time. We would be the first in only AI peer play. We thought that with the large deals we had with OpenAI and the contract and partnership we have with AWS, we thought that we were well positioned to provide Wall Street with what they wanted. And in return, we gather the funds and continue to invest in invent extraordinary things. The only thing that matters is your ability to deliver extraordinary product to customers. We're delivering inference 15 times faster than the GPU can. And there's enormous demand for our product. And that's a really fun place to be. Awesome. And look, I think you have been on our podcast quite a few times now. We have had deep discussions around the technology, the mode that you have. There is now quite a bit of information out there and awareness of how Saturday brash really differentiates how the wafer scale engine differentiates versus a GPU chip. So, but for the interest of first-time listeners, maybe we can touch upon that a little bit. But beyond that, what I want to address is today as the company stands a public company, what are the biggest misconceptions investors or audiences might have about Saturday brash? Well, I think the fundamental differentiator that most people see is that we built the largest chip in the history of the computer industry. We built a chip 58 times larger than a GPU. And for AI work, big chips process information more quickly and deliver results in less time. And as AI has moved from being a novelty to being useful, the speed with which answers are delivered is central. Everybody wants faster answers. And we are the fastest bar none. And so that's sort of what we're known for. I think that's sort of the golden egg. The goose behind the golden egg is one of the world's leading engineering teams. And a fearless culture where we chose to attack problems that others were afraid of. In fact, one large company's lab, the very group that was designed to do innovative work, wrote in 2014 that what we're doing was impossible. And so what would deliver sort of a stream of innovation is sort of a fearless engineering culture, a culture that seeks out the hardest problems with the highest returns and is willing to apply years of extraordinary effort to solve problems. For 70 years, different teams had tried to build a way for scale product and they all failed. And we succeeded. And even after we succeeded, another company tried to do it and they failed. And so there is something we're doing in our culture and the way we attack problems that has allowed us to solve problems other people can't solve. And that's what we seek to hire. We seek to nurture. We seek to imbue in our team. It's amazing. I mean, I've had semiconductor engineering experience for more than a decade. And I still think it's impossible to build what you're today building and selling. Most companies would still find it impossible. But that's great. You know, when you originally filed your S1, which was sort of a year ago, right? And your product has existed now for many years. Talk to us as you stand today, how should we think about service, right? It started from should we think about it's still as a chip company, AI systems, integrator, a cloud service provider, because it seems your business model and go to has evolved over time. So help sort of create an identity of service today. Sure. I think you should evaluate us on our ability to build AI compute that's better than others. That's faster. That's easier to use. And how we sort of distribute that is one of three ways. We build these computers based on our chip and our system. And we deliver them on premise to large enterprises, to governments, to sovereigns. We deliver them through hyper scale clouds via AWS. And we deliver them through our own cloud. Those are delivery vehicles. But what we are is a maker of the fastest AI compute in the world. And we start at the chip and we invented the packaging and the full system and the rack. And it can be delivered sort of to your premise. Or you can benefit it from by the month, by the token, from hyperscalers or from our cloud. Those are just how you get to market that the fundamental thing we're doing and how we wish to be evaluated is our ability to build compute this faster than others for AI. And a big highlight of IPO story in the road show was of course the deal with open AI. So it would be helpful just to talk to us about how did this deal come about? What enabled you to win in this crowded space? I mean, it was not until very recently that AMD, the second sort of GPU competitor, won a deal with open AI and very soon, you know, a private name like yours comes up and wins a significant sizeable team. So talk to us a little bit about that deal, what it entails in terms of roadmap, timing, etc. Not only did we did we win the deal, but we were able to deploy extremely quickly. Open AI today has only two hardware vendors in production, Cerribra, something in video. And we signed a deal with them on the 24th of December and we were in production on February 1. So we were able to integrate an extremely sort of a new model, a model we'd never seen before, delivered to market very rapidly. I think what happened is that that open AI like the rest of the industry realized that there was a fundamental change that was afoot, that starting in early 2025, AI had become sufficiently smart that it was useful. And when people start using the technology, if it's slow, it's uncomfortable. It's painful. And it's no fun. And what happened was you had this sort of tidal wave of demand for inference because the AI was good enough to do interesting things to solve hard problems. And suddenly, the fact that running on GPUs is painfully slow, sort of was a constraint for everybody. And that created this opportunity where there we were, we were the fastest not by an order of magnitude, but by 15, 18, 20 X over Nvidia GPUs. And it was an obvious choice. Others tried to sell open AI road maps. They tried to sell them all sorts of things, but open AI could test our product. They saw that it was the fastest in the world. And by Thanksgiving, we had a term sheet signed and then by just before Christmas, we had a master purchase agreement from 750 megawatts, a huge amount of compute and in addition, an option for another 1.25 gigawatts. So there was just a tremendous appetite for fast inference and we were able to demonstrate that we were the fastest not by a little bit, but by a lot. And I just want to repeat if people missed that because this is really important. So today, open AI only has their compute on two hardware vendors deployed in production in video and service. Correct? That's right. That's amazing. I mean, you know, when we talk about barriers to entry, often the mode that Nvidia has, it's CUDA ecosystem, right? If let's say that wasn't hardware stack, your developers use original Eurotrainier model. It's natural that they'll use that for inference. I don't believe that. I don't think CUDA is a mode. I think the CUDA mode is gone. It's dead. There's no CUDA in inference at all. Everybody gets to an API. And so there's not one line of CUDA in inference. And number two, a year ago, 100% of the large US state of the art frontier models had been trained with a CUDA flow. And they have since lost 70% of the market share. Anthropic, trained on, trained no CUDA, Gemini 3, trained on TPUs, no CUDA. OpenAI uses a flow that does include CUDA. So only one of the top three models has anything to do with CUDA. That's an extraordinary market share loss. And I think this is a bit of a myth that CUDA remains this extraordinary mode. In inference, there's no mode. And in training, if it were so central to training, why are two of the three most important models? Why did they use zero CUDA? So I think it's something to think about. Definitely. And how were you able to enable deployment this fast? I get the hardware perspective. You have your product running for other customers for a while. So you have the hardware supply chain stack figured out. But when you enter, when openAI has a very new developer system, a new architect adopts a different hardware stack altogether, how are you able to get them up and running for their needs this fast? We received the model on January 7th and by January 11th, it was running full speed. So our compiler is now mature and exceptional. We were able to, even though we'd never seen the model, it had all sorts of innovative components that we'd never seen before. We were able in just a few days to get it ready to prepare it and begin testing. It's a result of having a compiler that is ready for prime time and a result of years of careful thought about how quickly, how one can quickly bring up brand new models. And now, how does the SyriBras stack fit into openAI's roadmap going forward? Like any color you can spend there, are they going to use the SyriBras stack for a specific use case for a specific workload? How does the other stack and your stack being compared at the customer and where do they go from here? They identify dozens of use cases for us. I think the truth is, is that nobody wants slow inference. And I mean, we can just ask ourselves, how big is the market for slow search? It's zero. How big is the market for dial-up internet, for slow internet? How much would I have to pay you to take out broad bin from your house and put in dial-up? It's just, there is no market. And there will be no market for slow inference. It is as these technologies become woven into our daily life, the willingness to wait evaporates. And how long will you, your listeners, wait for a website to resolve? Imagine a website takes seven seconds to resolve. Will you wait for it? Or will you click away? In a heart-bid? Everybody clicks away. And that the same thing's going to happen with inference. Is it if it can't provide you an answer quickly? You're gone. And this is true in an agentic flow. This is true with human in the loop. In an agentic flow, if it's fast, you can do more work per unit time. That benefit compounds. And you will crush your competitors who can do less per unit time. I think it is sort of the ability to do more work in less time is sort of fundamental. And it's the benefit of fast inference. Let me switch gears to the second key partnership that you announced, which was with AWS. This is very interesting to me, not just from a perspective of service winning this major hyperscaler validation, but also sort of the future of disintegrated inference if you will. And how this will shape. So maybe we start with again, how did this deal came about? Who else sort of you are competing for this? Because this is an innovative approach. It's not just like a lab choosing another hardware stack. They have their own ASIC compute that they have been successfully using. They do. They train them. It's a fine part. We've been working with them for many, many months. I think they saw an opportunity after in video to solve the need to augment its GPU by buying GROC. They saw that I think everybody came to realize that the GPU simply can't do fast inference. The memory architecture, the use of HBM, cripples a GPU's ability to generate tokens quickly. And so just as Nvidia saw this, I think the leadership that at AWS saw this. And they came to us about a disaggregated solution. And a disaggregated solution is a way to break up the inference work into two pieces. The first part is processing the prompt. And the second part is generating a response. And because we like to complicate things, we give them obscure names. We call the first part pre-fill and the second part decode. Now, you break up a problem such that each part has some different characteristics. Gen- uh, processing the prompt is a problem that can be paralyzed. And generating the output is a process that can't. It's what's called strictly sequential. And so one can use different hardware, one that's good at paralyzing problems, and one that's good at sequential problems, for these two stages in the inference work. And that can be a very good solution. It has some real strengths, but it also has some important weaknesses. The strength is when you deploy a disaggregated solution, you make an assumption about how much pre-fill compute and how much decode compute you deploy. And if you get that assumption right, that the workload needs that amount of pre-fill in that amount of decode, you're going to go great guns. You're going to generate, you're going to be able to generate more tokens at a fast speed. But if you get that wrong, or if the workload changes, then you have the wrong amount of pre-fill in the wrong amount of decode. and in a rapidly moving environment, that's very possible. And so in that environment, you'll have idle hardware. That's very expensive. Now, this problem, sort of that as you get more specialized, as you divide up a problem, you get less flexible as fundamental to engineer. I mean, you know this, you've designed hardware. This is fundamental. You can always go faster if you give up flexibility. So this sort of asks banks the question, who wins? I think the hyperscalers win, because they can deploy some of the solution in a disaggregated form, but they have a whole fleet of other compute. And if the workload shifts, they can move some of that compute to other, some of that workload to other compute. Who has a challenge? Well, I think the NeoClouds are challenged, because they build for one customer, huge amounts of compute, assuming one workload. And if the workload moves, disaggregated compute is going to be extremely inefficient. So that's what happened. We, that's sort of the big picture. They saw this opportunity at AWS. We left on it because we know the training guys. We know that team and have a great deal of respect for them. And saw the opportunity to build a really interesting solution. And that's where we are. We signed a binding term sheet in March. We're moving forward. It's very exciting. And just so that this clear, because this deals seems a little bit different than your other deals like with the open AI or meta, which seem much more straightforward where you are either the cloud or the token provider to this customer, A. What is something from a business or strategic perspective? We need to understand about the AWS deal. They're, they're buying or leasing hardware that will be deployed in AWS data, sir. For the end customers being the cloud, though. The end customer delivered through bedrock. So this is no different than if they were to buy AMD GPUs or buy Nvidia GPUs or ARM CPUs. They are bought, deployed at an AWS data center. They're used to build an AWS solution and then sold to the customer as a solution. Look, we, we saw and we're going that route. We're seeing now AWS going that route with you guys. Do you see this is sort of the trend that's going to be more prevalent as inference adoption rises from here where a lot of the compute deployment will be a disagreed, disintegrated inference approach. And the reason I ask is because it seems that could create a landscape with multi-wender, multi-hardware stack deployment. Well, we certainly think that that disaggregated solution will take a part of the market, especially the more mature part of the market when you can predict in advance what the workload will be, the characteristics of the inference. In environments where the characteristics of the inference are still fluctuating and changing, I think disaggregated solutions are extremely expensive if you're not a hyper-scaler. They will create if they are your only way to generate inference responses. They will frequently have idle time and they'll have load utilization, which is sort of the enemy of a profitable infrastructure. And so I think the hyper-scalers will do well because it will be one part of many parts of a solution, whereas those who can't afford that, I think will really struggle with disaggregation. Maybe switching gears, I want to pivot to a few questions I like to call them hot seat questions. They're not designed to challenge the speaker, but these are the difficult questions. Me, as an analyst, always get from investors and other corporations. And you are in a unique vantage point to address them, whether it's from a very risk perspective or an industry perspective. So starting, you're sort of mix of different go-to market strategies, whether selling and hardware for on-prem, integrating and running sort of your own data center yourself or just a token price per token service, cloud service, right? I think gives you a lot of flexibility and inambleness and opens a lot more avenues, which has resulted in a very fast success and adoption for you. On the flip side, though, a question I get quite a bit is that does come at a cost of margins, right? When we compare you with the AI chip companies, one of the things that differentiate, one of the reasons they get very rich evaluation premiums is they have fairly high operating margins, then a cloud provider, a NeoCloud provider or a system integrator, right? So how do you balance this mix when it comes to profitability? Well, I think NeoClouds are in a very challenging spot. They buy extremely high margin product from Nvidia. And they have to borrow money to do it. They need Nvidia backstops. And then Nvidia makes a difficult, I think, for them to choose and work with other vendors. That's a difficult spot to be. I think our view is a little different. Our margins are the same, whether we sell through our cloud, through the HyperSkale or Cloud or on-premise. I think the idea is, is there different market segments? If you're a large enterprise, you like to buy through AWS, right? Your procurement team will deliver to us sort of a master purchase agreement, the size of a Bible, right? Hundreds of pages of this and that that takes months and months with lawyers to get through. Or you can say it's available through AWS and you get credit for your annual commitments. So you should think about it as a way to distribute to large and medium-sized enterprises. It is an extremely cost-efficient way to distribute. Very low cost to sales. Now, your own cloud is a different thing. But I think what we have found is there is a demographic group of engineers under 30 that are an extremely fertile part of the AI world. An enormous amount of the most interesting thinking is being done by young teams. Teams newly out of college teams, newly out of their doctorate or from finished their postdoc. These were some of the biggest day ideas are coming from. And they have grown up in a world where compute was always on demand. It was never something you bought in metal. It was something you turned on like water. And for them, they don't need some of the things the hyperscalers offer. They don't need the security. They don't need some of the software layers that large enterprises need. What they want is they want fast inference and they want it quickly. And for them, you want to deliver to them in your own cloud. And then finally, you have a group, farm, oil and gas, sovereigns, government, military intelligence, where the data is proprietary and extremely important. And these companies and organizations are unwilling to put that data even in the cloud. And so for them, if you want to build a business, you need to do it on their pros. And so that's sort of the way the world is unfolding. And if you don't sort of deliver the way your customers want to buy, they don't buy. You've covered in pretty much detail, right? How you were able to push beyond the initial hurdles or beyond the so-called CUDA mode. You know, today it makes sense for these very high-scaled, sophisticated, whether it's a frontier lab or a hyperscaler. They have massive engineering teams for these kind of companies to undertake a new hardware stack adoption, especially a very complicated engineering perspective, complicated solution like yours. How do you scale from here in terms of diversification of customers to more of a medium size or not even medium size from a wallet perspective, but less sophisticated customer base, which is used to deploying computer in a ready-made software ecosystem turn on the box ready to run approach. I think the answer is different for inference and training. For inference, they can move from Nvidia GPUs to Surrey Burst with 10 keystrokes, 10 keystrokes less than 30 seconds. It is unbelievably simple because there's no CUDA. And that's what they do. They sign a contract with us. You can jump on and try it for free at Surrey Burst.ai. You can see how fast it is. You can sign up for, for, or by the token offerings, right, you know, a shared service. Or you can sign up for dedicated compute through our cloud and you can move in literally 30 seconds. And so it takes you no work whatsoever. Training is a more complicated problem. And many people are still accustomed to using sort of the CUDA framework. There our compiler is, there's more pressure put on our compiler. But even there, it only takes a few days. And we should be able to set you up. And once you're set up, the workflow inside of a Surrey Burst training environment, not only is faster, but it's much, much easier and more intuitive. We never run what's called model parallel. We never have to break up the large matrix multiplies and distribute them over GPUs. That's where this sort of training work spills over into super compute into distributed compute. And that's where the number of people who can do it and do it well falls off a cliff. We never need to do it. Our chip is so big that we can run strictly in a world called data parallel. And that is dead easy. If you want to change the number of machines working on the problem, you can do it in a few keystrokes. If you want to change the model, you can do it in a few keystrokes. That's a real advantage of being in our framework. You can begin with models that were trained on GPUs and you can fine tune them on Syribers. You can do inference on them in Syribers and vice versa. If you'd like to train or do slow inference, you can take models that were trained on Syribers and do inference on them with GPUs. 100% compatibility dead simple to do. One of the big risks facing merchant, I don't want to call it GPU, but merchant accelerator, hardware or semi-providers is most of the customers have built such a massive semiconductor program within them that there's this risk of outsocating your supplier. Today, of course, you are at a position that technologically none of your customers can match. But do you still think strategically that also being a risk for you going forward, where a Google or an Amazon or anyone in OpenAI would replace you with their own solution? OK. I think if you can't build better hardware than your customers, your competitors, you're going to lose. I don't think that's-- there's any grand wisdom in recognizing that this will be a hard fought battle. But I think we're not just starting here, some are just starting. We're a decade in. We are vastly faster. And we achieve that by solving problems that others wish they could solve. By solving problems that allow us to use faster memory, and that allow us to do things on chip that other people have to do off chip. And on chip is 10,000 times faster than off chip. And so these aren't easy things to overcome. I think that we have seen mixed results from software companies developing hardware. Google is obviously the success case. They have built systems. They first did it with switches and networking devices. They continue that into servers. And they started their TPU program. Now remember, they started that a long time ago. What are they on? TPU 9 or 10 or 11? I mean, this doesn't happen quickly. In the first one, wasn't a great product. And so you can expect the same from others. And, you know, trainings on V3, it's a good product. Let's see where the others fall. It's a hard problem. And we are focused on continuing to improve what we deliver. And we're very confident that we'll put our roadmap against others. You brought up memory. The current times right now, it seems memory is the current bottleneck, right? Especially when you think of majority of the ASIC and GPU solutions are they rely on high bandwidth memory, which is a limited supply, because it's recently new, be the most expensive type of memory. Now, I think where your advantage is, your memory is on your way for, on your chip, massive chip, you're able to use a lower, less expensive memory, I should say, which is not as supply constrained today. How do you see this memory dynamics evolving for the industry, but also for service itself? Look, there are two types of memory. There's memory that can store a lot, but is slow. And we call that HBM or D-R-O. There's memory that can't store very much per square millimeter, but is blisteringly fast. And we call that S-R-O. Those are the two memories. Now, Nvidia chose HBM, because it was perfect for graphics. Remember, it's a graphics processing unit. That's where its entire architecture was originally pointed. Now, in graphics, you move a lot of data once, and then you do a lot of compute on it. And the amount of time that it took to move the data is overwhelmed by the amount of time it takes to do compute on it. And so it was OK to use a memory that could store a lot, but it was slow at moving data. Unfortunately, that's not how inference works. Inference moves a huge amount of data and then does a little bit of compute. And the bottleneck is the moving of data. And that's why we built a big chip and stuff it to the gills with SRAM. SRAM can't store very much. And we couldn't make it store more per square millimeter, but by building a chip 58 times larger, we could use more square millimeters. We use 46,000 square millimeters in our chip compared to the GPU of 800. So we had all this extra space to put SRAM. Now SRAM is etched into the chip by TSMC. It's not a separate chip. It's etched into your wafer, along with through the same process that your logic is made. On the other hand, HBM is built at a different foundry by one of three companies, sort of micron Samsung high-nix. Right now, it's in extremely short supply. There's massive, massive lead times and extraordinarily expensive to get HBM, whereas we can make as much SRAM as we want through TSMC. And so by going big, we were able to use a different type of memory. That type of memory is extremely fast. It moves information to and from compute, you know, more than 2 1/2,000 times faster than can be done via HBM. And that's one of the sources of our advantage. Now not only do we avoid this limitation, the supply issues around HBM, but TSMC is also constrained on their free nanometer, lying in their constrained on what's called COOS, a process that Nvidia uses. We don't use COOS and we don't use free nanometer. So we avoid some of the most binding, some of the most sort of problematic supply chain issues through our innovation. You've covered so many advantages of your favor scale engine, the current memory headwind sort of put you in a position of benefit. So let me ask you this. What are the challenges that are. You're growing, of course, very fast. So I don't want to take that away. But what is stopping you from growing faster from your adoption, increasing multiple falls, than what it is already? What is a current sort of barrier? I think we're going to be one of the fastest companies in semiconductor history growth. So we're growing pretty fast. In 22, we did 25 million in revenue, and 23, we did 79, and 24, we did 290, and last year we did 510. That's pretty fast growth in the hardware world. That's one observation. The second observation is that right now data centers are constrained. And we are out in the market acquiring data center capacity as fast as we can. And what specifically that is it just power? Is it just the shell? Well, it's all everything that turns bareland plus power into a working facility. It's the shell, the transformers, the generators, the chillers, the CDUs. All of those are increasingly long lead time items. And I think that we've made huge leaps and bounds there, and are pleased with our pipeline of data centers, but that is a binding constraint right now. And before we get to us wrapping up the episode, you have from a technology perspective, you have made leaps and bounds of progress, getting a chip to be a wafer being an entire chip, right? So I think there's years for anyone to catch up. But how do you continue out innovating yourself from here? Like what's next after a wafer scale engine chip? Look, I think there is a pattern that sometimes happens. As you get larger, you get more conservative. And you get more afraid of doing extremely hard engineering work. And more and more time enough for this panel. on incremental improvements. I think you have to put a stake through the heart of that thinking. I think that is why so many companies have trouble in the third or fourth or fifth generation being great products. Is when they were small, they were fearless, and as they've gotten larger, they become tentative. And I think we are interested only in doing fearless engineering. We are interested in solving hard problems that other people are afraid of. And I think that is the mentality that drives sort of continued innovation. We're only interested in hiring people who want to work on really hard problems. I mean, lots of good engineers who don't want to work on really hard problems. If you like to work for us and we have hundreds and hundreds of openings, have a love for hard problems, have a love for solving problems that other people are afraid to solve. Love building things. And I think our ability to keep this passion and to drive and this sort of engineering a spriticalr, right? This engineering culture that knows what customers want and is unafraid to try and do it differently, do it better. We'll continue to stay ahead by orders of magnitude as we are today. Vandru, thank you again for joining us today. We are glad and honored to have you so soon after being IPO just last week. I'm sure you have a business schedule. And thank you for sharing the detailed perspective, not only on the Cerebris, but the state of the infrastructure and what we expect going in the future.

Podcast Summary

Key Points:

Cerebras went public via a successful IPO, raising over $6 billion and achieving a valuation in the high $60 billion range, driven by market demand and a desire to avoid acquisition.
The company’s core differentiator is its wafer-scale engine, the largest AI chip ever built, which delivers inference 15-20 times faster than GPUs due to its size and architecture.
Cerebras has secured key partnerships with OpenAI and AWS, deploying its technology rapidly—e.g., with OpenAI from deal signing in December to production in February.
The CEO argues that the CUDA moat is dead for inference, as top models like Anthropic and Gemini use no CUDA, and inference relies on APIs, not CUDA code.
Cerebras uses three distribution models
Disaggregated inference (splitting prompt processing and response generation) is an emerging trend, with hyperscalers like AWS benefiting, but it poses efficiency risks for smaller players.
The CEO emphasizes a fearless engineering culture that solved problems deemed impossible (e.g., wafer-scale chips), enabling continuous innovation.

Summary:

This speed advantage has driven key partnerships, including with OpenAI, where Cerebras became one of only two hardware vendors in production, deploying within weeks of signing. Another major deal with AWS involves a disaggregated inference solution, splitting prompt processing and response generation to optimize performance, though this approach may be less efficient for non-hyperscalers due to potential hardware idleness. Feldman challenges the notion of a CUDA moat, noting that top AI models like Anthropic and Gemini use no CUDA, and inference is now API-driven.

Cerebras distributes its compute through on-premise systems, AWS cloud, and its own cloud, focusing on delivering the fastest AI compute. The company’s culture of tackling hard problems, such as building a chip 58 times larger than a GPU, has been key to its success, enabling rapid innovation and deployment.

FAQs

What is Cerebras Systems known for?›

Cerebras Systems is known for building the largest AI chip ever, the wafer-scale engine, which is 58 times larger than a GPU and delivers inference up to 15-20 times faster than GPUs.

Why did Cerebras go public instead of being acquired?›

Cerebras chose an IPO to raise over $6 billion for continued growth, as a fundraising event to attract new investors and fuel expansion, rather than selling the company.

How does Cerebras' wafer-scale engine differ from GPUs for AI inference?›

The wafer-scale chip processes information faster due to its large size, enabling quicker AI answers. In inference, Cerebras is up to 15-20 times faster than Nvidia GPUs.

What is the biggest misconception about Cerebras for investors?›

A key misconception is that CUDA is a moat; however, Cerebras' CEO states CUDA is irrelevant for inference, as 70% of top frontier models now use non-CUDA flows.

How did Cerebras win the OpenAI deal and deploy so quickly?›

OpenAI chose Cerebras due to its superior speed. Cerebras received the model on January 7th and had it running by January 11th, thanks to a mature compiler and quick integration.

What is the AWS disaggregated inference deal with Cerebras?›

AWS is deploying Cerebras hardware in its data centers for a disaggregated solution, splitting inference into pre-fill and decode stages, to be sold via Bedrock to end customers.

Chat with AI

Pro features

Go deeper with this episode

Unlock creator-grade tools that turn any transcript into show notes and subtitle files.

AI chapters & timestamps

AI-generated chapters with a short description for each topic — click to jump to that point in the audio and transcript.

Locked

Transcript exports (PDF · SRT · VTT)

Download the full transcript as a formatted PDF, or grab perfectly timed caption files.

Locked

Viral quotes + downloadable graphics

AI finds the punchiest 15–30 second moments — perfect for TikTok, Reels, and Shorts. Each quote can be exported as a customizable image card (8 templates, 3 aspect ratios) ready for social.

Locked

Citation & fact-checking engine

Pulls only the publicly verifiable claims — statistics, named figures, historical events — and fact-checks each with Google Search, citing the real source URLs.

Locked