Intelligent Robots in 2026: Are We There Yet? with Nikita Rudin

66m 37s

Today, we're joined by Nikita Rudin, co-founder and CEO of Flexion Robotics to discuss the gap between current robotic capabilities and what’s required to deploy fully autonomous robots in the real world. Nikita explains how reinforcement learning and simulation have driven rapid progress in robot locomotion—and why locomotion is still far from “solved.” We dig into the sim2real gap, and how adding visual inputs introduces noise and significantly complicates sim-to-real transfer. We also explore the debate between end-to-end models and modular approaches, and why separating locomotion, planning, and semantics remains a pragmatic approach today. Nikita also introduc...

Transcription

10465 Words, 57360 Characters

my hot take on that, and I'll be happy to be proven wrong, but I think there is not a single human fluid robot today that actually generates value, meaning there might be a robot that does something fairly close to what is supposed to do in a factory or in a warehouse, but it's not the exact that, so in the end, it's not generating value because it's not doing the actual thing. All right, everyone. Welcome to another episode of the Tornal AI podcast. I am your host, Sam Charrington. Today, I'm joined by Nikita Rudin. Nikita is co-founder and CEO of Flexion Robotics. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Nikita, welcome to the podcast. Thank you. We're excited to be here. I'm excited to have you on a show, and I'm looking forward to digging into our topic for the conversation, which is really digging into the gap between where we are today with robotics and where we need to be to fulfill the vision of the technology. You've been working in this space for quite a while. You did your PhD at ETH Zurich and spent some time at Nvidia. I want to actually share a little bit about your PhD and the focus of your research. When I started, we were trying to use simulation with reinforcement learning to teach a legged robot's very simple things. Just walking on flat ground, and when the robot could take a few steps that was already a big success, and the core focus was to reduce the training time needed to achieve that. And when you say, do you like it like a quadruped? It's like a quadruped, big robot dog. We're not using bus dynamics spots. We're using antibiotics animal. Any botics is a Swiss startup that was a spin-off from our lab. Very similar to a spot, but it's rather made in Switzerland. We're really trying to reduce the training time needed to achieve that. Before I started, there were some results of reinforcement learning for such quadrupeds, but it would take weeks of computation to achieve anything. Using GPUs and massively parallel simulators, we managed to reduce that to just a few minutes. Actually, we had a demo on stage at some conference where we were running training live on a laptop, while I was holding robot, and every 15 seconds the laptop would send the latest policy to the robot. You could really see how it went from just falling over to taking a first step, and then after three or four minutes, it would be able to walk around the stage. That was a pretty cool visual demo for everyone to see exactly how the learning process happens. From there, my PhD was pushing the agility of that robot. Using the similar techniques, it was still training neural networks, in simulation, and then transferring them to the real world. But the inputs got more complicated, the tests got more complicated. By the end, we could go to a search and rescue facility, here in Switzerland. You have to imagine, collapse buildings, a lot of muds, moss, gravel, big rocks. Training that is very hard to navigate even for a human. We would just start robot to go from point A to point B, and we would use its whole body, so we would use the knees to climb on top of big rocks and then jump over gaps. Again, all autonomously, all end-to-end using images and the state of the robot to plan its next actions. In telling that story about deploying this robot, a search and rescue context, envisioning the demo. I've seen similar things. The robot dog is going, maybe opening some doors, maybe that wasn't part of your demo, but I've seen similar demos of the robot dog like climbing hills and crossing rubble. I think those demos attempt, in many cases, to land the idea that flag and a ground we're done here. Talk a little bit about the distance between what you were able to accomplish with that demo and what you think needs to be done to deploy one of these robot dogs in a real search and rescue scenario, for example. I had this debate so many times, even with my colleagues at NVIDIA and ETH. The general question is, is the locomotion solved or not? We already had this debate five years ago. When the robot could just barely walk on the flag ground and people were saying, "Yeah, locomotion is solved. We don't need to focus on anything you want." My take was always, no, until the robot can really go anywhere a human can go and you don't even need to think about can it do it or not how reliable it is, my take is that locomotion is not solved. So where we are today is that anything that is blind can be very robust. Blind meaning that the robot does not perceive its terrain. It's just reacting. That makes the training much easier because you just have to throw a lot of things at it. It always tries to be stable, especially for a quadrupede is fairly easy to remain stable. And for example, you can even walk upstairs, it will just hit the first step, realize there is something and then climb up, doing it with perceptions. So now you want it to actually react. You don't want it to hit the first step. You want it to place the feet much more carefully. That's harder. Today, in the end of 2025, we're going to say we can do that. We can have fairly good policies that plan the sequence of actions based on the perceptive inputs. Because we're training everything in simulation, this means that we need to be much more careful in how we simulate those sensors as well. We need to put a lot of effort into creating complicated trains that can be seen by the cameras and also all sorts of noise models to simulate all the disturbances and defects of those images. So what I'm here you say is that the addition of the additional information that you would think would help actually makes it more difficult because it introduces a lot of noise. And whereas previously the robot could kind of stumble through the terrain, now it's trying to incorporate this visual input to plan, but it ends up making it. What actually happens when it is trying to do this? Do you see it stuttering or does it just not work? Is it just hard to train from a model perspective like what happens? So, I mean, in the end, what happens is that the final behavior is better if you do everything right, but that's a big, that's a big if. The typical thing we refer to is the so-called seem-to-real gap. If you're training things in simulation, then deploy them in real life. It's not the same. And things that work pretty well in simulation might not work at all in real life. That seem-to-real gap is much larger once you have perception in the loop. Once you're simulating either depth images or RGB images is even worse. So that makes the job of the researcher of the engineer harder. You need to cross that seem to real gap for both the physics of the robot and the perceptive inputs. You have to simulate the sensors carefully. But if you do it right, then the final behavior is actually much better because you can see that drawable is not simply reacting to whatever is happening under its feed, but actually planning accordingly in advance. Is this problem of locomotion solved then with vision, at least for, let's take quadrupeds as an example, or are there still kind of outstanding issues like is there a generality gap, or how do you think about it? It's interesting. It really depends where you draw the line on the locomotion. So once the robot can cross complicated terrain, the next step goes more into navigation, which means where should it go? Should it climb that thing or should it avoid it? For now, everything I've described so far was mostly using geometry, so there are no semantics. But now if you imagine the robot- meaning you get a point A and a point B, and it's going to take the straight line and like kind of plow through whatever is between here and there as opposed to think about which way to go. More or less, if there is a huge wall it might avoid it, but then mostly we'll try to climb on whatever is in front of it. Now if you imagine a robot walking behind me in the office, you don't really want it to climb on every single desk. You want to avoid some things, but also walk on others, right? So if there are stairs, you want to take the stairs, but you don't want to hit plans or whatever else, which means that suddenly you have to add semantics to the policy. And what does it mean to add semantics to the policy? Once again, if you go from simulation to reality, it makes this into real gap bigger, because suddenly you have to simulate all these offices, all these different objects. Plus you probably need to give it images, not just depth images, if you give it RGB images, which means you do it to simulate it in a photo real way. Or the other option, which is probably the more correct one in the short term, is to split the problem. So you're trying to be very good at walking on anything, but you don't give it some adding information and you're trying another thing on top, which will be the, you can call it the planner or a higher level policy that will still steer it around. And now, historically in the conversations I've had with for a bodice like this has been a big debate, whether we should be using end-to-end deep learning models that can figure all this stuff out or using a more modular approach. It sounds like what you're saying is that a more modular approach can still be a pragmatic way to overcome the challenges of end-to-end training. Yes, it's interesting, because my whole page view was about going more and more end-to-end specifically for locomotion. And still, I'm here arguing that we should not do anything end-to-end. I think at some point we'll get there. In the short, medium term, as you said, the more pragmatic approach is to split the problem and use different techniques for different parts of the problem. We're splitting the problem into kind of the locomotion model and a planner model. Does the planner, like, what's the objective for the planner when you're trying to, the, you know, English language objective is, I want this thing to intelligently choose the, like, the best path. But what is best path? How do you define that? How do you create an objective around that? Is it the path that maybe it's the path that uses the, that's the most power efficient if you're on a robot or maybe it's the path that, you know, gets you there faster or at least distance? Like, how do you balance all that? There are different ways to do this. If you still choose the RL route, and we're doing that, you can train these planners with reinforcement learning. And typically, you have to define this reward function. It would include things like don't hit anything, avoid objects. Don't, and don't move too fast because that's one thing that makes robot seem very dangerous. This reinforcement learning policy will try to optimize everything so they will go very, very quickly to the goal. This is not really what you want with a robot that operates around humans. Typically, we're actually trying to slow them down as much as possible. And yeah, that's mostly it. And then it depends really on what is on the, what is in front of the robot, there might be things that it's okay to work, to walk on the others that it's not. There is also another approach. Once you split the problem in half, you could train the look emotion with pure reinforcement learning and simulation, but you could train the planner with other data. For example, videos of humans walking around the office and you can extract the trajectories from that. And then you don't really need reinforcement learning anymore. You train behavioral cloning, imitation learning policy that will just steer the robot just like a human with walk. And thus far, speaking about how a human would walk thus far, we've been talking primarily about quadrupeds, how much of all of this translates from quadruped to humanoid robots. All of the transfers. This is the magic of reinforcement learning that these policies don't really care if they're controlling a quadruped or a humanoid. And this was the big switch from my PhD to flexion, where we're working mostly on humanoid. And we've seen that the exact same technique transfer. There is one interesting thing that happens with humanoids. That makes sense to me for a planner, but it's less intuitive for a locomotion model. And maybe I'm thinking of the locomotion model that maybe the locomotion model itself that I'm thinking I was like split into multiple components. But let me be more specific. I'm including in locomotion like the outputs that control stepper motors and all that kind of stuff. Is that part of what is trained in the locomotion model? And then I would think that you would need to at least tune it or do something else to if you've got to change the form factor of your robot. No, this is a very good point. When I say transfers, the general techniques transfer, the models with themselves don't. So for sure, you need to retrain a new policy, a new controller for the new robot. But if your all the simulation pipelines are general enough, it can be as easy as changing the input file. The URDF that describes the robot, retraining it, then you're ready to deploy. The tuning part is an interesting one because it is a little bit harder for humanoids compared to quadrupeds. And I don't think it's really related to the fact that it has two or four legs or anything like that. Personally, I think it's mostly related to the fact that we have very specific expectations of how a humanoids robot should walk. Whereas a quadruped will, if it walks slightly differently from a dog, it's completely fine. But a humanoid, if it doesn't move the arms in the right way, if it bends the knees too much or walks a little bit sideways, humans have a very strong reaction to that. I saw a tweet kind of touching on the same idea just the other day. And it was essentially this humanoid robot that was kind of locomoting like a quadruped, like it was on its back with the arms like this and moving really quickly. And the main thrust of the tweet was that humanoid robots move like humanoid robots because that's our expectation, but that there are potentially other, even with that form vector, there are potentially other more efficient ways for these things to move, but they just seem wrong to us. Yeah, that's true. But if we wanted robots operating around humans, we have to create some trust. So we have to take the less optimal route if it makes humans feel a bit more comfortable. We've been working on robots for a really long time, but it seems like over the past year, like we're seeing advancements of the video demos coming very quickly. I kind of asked this question before, but I want to have you kind of parse through how to think about these videos. We were seeing humanoid robots walking at the beginning of the year and now they're running, now they're doing dishes and all these things. What goes into creating a demo like that? And what are the limitations of what it says about what the robot is capable of? First of all, I would say that it's really exciting that the whole ecosystem is moving so quickly. It feels like every single day, there is a new video of a new robot doing something, and that's really, really good. We see a lot of progress both in hardware, but also how it's on the AI side as well, what these robots are capable of doing. Having said that, typically when I personally see a demo of a robot doing something, my approach is to think about what is the easiest way to achieve that, and that is typically how it's done. So we are showing, as an ecosystem, we're showing a vision of what robots should be doing, but behind the scenes is sometimes a little bit different. For example, if you have a robot standing and doing some sort of manipulation on a table, folding sheets or something that typically one of two things is true. Either there is someone hiding behind robot, behind a curtain, somewhere in another room, teleporting the robot, so it's not really autonomous. That's, let's say, one third of the cases. And the other two thirds are where the robot is actually autonomous, but to get there, 100 people had to deloperate 100 robots to collect a lot of data hours and hours, in some cases, thousands of hours of robots doing very, very similar things in probably the same environment. I collect data, train, a policy that can imitate that data, and then they're able to deploy robots autonomously, which is not exactly what you would see when you watch a video because it feels like robots can just adapt to anything and come to your home and do everything you're doing. On that phone, we're just not there yet, but we'll get there soon. Remind me the name of the company that started taking pre-orders for a human-order robot that is, you know, ready for the home quote unquote, referring to it to 1x. 1x? I had looked at that and you know, I've had a lot of conversations along these lines about kind of where robots are and and looking at that, I question like, okay, are we like really a lot further than I think we are, or you know, something else happening here, and this is, you know, you know, the early buyers are going to be beta testers and it may or may not work once it gets once it gets to the house. You have any takes on that specific company necessarily, but like, there's the readiness of human-order robots for the home. Also, 1x is not the only one. There are a few more who announce similar things. I mean, in a way, it's good. It's very, it's very, very ambitious to sell robots next year into people's homes. I think they have a big challenge ahead of them. So let's see how far they can get next year, but usually these companies are already honest that it is a beta of a beta, an alpha program. It will be just for early adopters. It will take a few more years before you can really buy these robots and send them into homes. That's partially why as a company we're focusing more on industrial use cases. There are a lot of other challenges with industrial use cases because now suddenly performance is really important. You need to be very fast. You cannot slow everything down. At least in most cases, but you have a little bit more control of what's happening. So, for example, if we want to deploy 10 robots in a new warehouse, it's easier for us to send an engineer or one or two days to check that everything in order is in order as the operator does as they should if they don't fine-tune a few things and then let robots work, which is not something you can do in everyone's home. And are the tasks in the industrial setting, I'm imagining they're more repetitive, more consistent, less variation than run to the fridge and grab me a coke. Yeah, that's right. And we get to decide which tasks we take on, which ones we live for the future. So we can start with simpler things. A lot of it is moving objects around, moving objects from point A to point B, moving boxes, opening boxes, taking items out, or putting objects into boxes, putting the box in a truck, sending it further. And this seems really within reach for next year or maybe a year after that. But even though you might have seen a video of a robot doing that, that doesn't mean that it's ready now and people are doing it now without, you know, with it not being in a development phase, is that fair? Yeah, that's fair. My hot take on that. And I'll be happy to be proven wrong. But I think there is not a single humanoid robot today that actually generates value, meaning there might be a robot that does something fairly close to what it's supposed to do in a factory or in a warehouse, but it's not exact tasks. In the end, it's not generating value because it's not doing the actual thing it's supposed to do. Meaning it's doing some variant of the thing, or it's like there's a handler that's fixing up, you know, cleaning up after the robot as it makes a mess across the. Exactly. And typically you would have more handlers than you had people before. You could argue the value is negative. Once again, we'll fix that. We talked about like how you create these demos and the idea that there's either real-time teleoperation or, you know, many, many people doing teleoperation to collect training data. You know, talk a little bit about then, you know, after that training data is collected via teleoperation. Like what the approach is for training is that data then used as part of RL, or is that more supervised learning type of an approach? Typically it is supervised. So you record the data is, you know, images on the camera and how the commands that the tele operator send to the robot, which typically is like how should you move your hands in space and how should you move your fingers. So that is recorded and then a big transformer is trained to produce the same actions from the same pictures, the same, the same images. Now what's interesting is the whole field shifted a little bit from just training these transformers from scratch to using vision and language encoders that were pre-trained on internet scale data. But VLM, soft-to-shelf VLMs? You take a VLM, you remove the output and you train a part of a new part of the network on top and then you call that a VLA. So it's a vision language action model. Where the vision language part comes was pre-trained before and the action part is trained from scratch. Got it, got it. So as opposed to predicting next language token, you're now predicting an action token which is then translated into, you know, a separate motor motion or something like that. Yeah, exactly. Kind of compare and contrast that approach with folks what folks were doing before. Are we doing that because it's cool or are we doing that because it, you know, how much does it, how much is having a pre-trained model to start with like save us from the generic transformers? It's a very good question. The general thought is that it helps with generalization. The sense the language and vision encoders were trained on internet scale data, they're supposed to generalize. A typical case was that if you don't do that, you would, you know, train a robot during the day, then you turn, like, if the lights go down at nights, you won't be able to perform anymore. Even though, I mean, they're still light. Everything should just work as human, wouldn't even see the difference because the image embedding changes a bit, the positive performance. I believe that this gets better with the pre-trained encoders. To be completely honest, this is overall the generalization capabilities are still need to be pro. I think raw static has had an amazing talk at Stanford. We're talking about their efforts that tell you the research institute, where they were comparing training policies on a very specific task with little data versus training more generalist with a lot of data. And there were seeing some signs of generalization, but I don't want to quote him directly, but it seemed like it's not fully understood yet how much of the generalization is coming from that. Now, Matt, those two little data, lots of data to the transformer of VLM discussion. The VLM would be the little data, and the transformer was a lot of data because, like, we're assuming the VLM was pre-trained, or was it reversed? It's actually reversed. If you include the pre-training as data that you get for free, then you have a massive amount of data for pre-training, and then you can add less data for fine-tuning this action head. I guess it doesn't matter which one is which because the results were somewhat inconclusive is what I'm hearing. We need to go in more detail, and I'm causing other people here, so it's a bit hard to say. I think it makes a lot of sense to have pre-trained vision encoders, language encoders, because you don't want to relearn language every single time you want to do something like languages language. By the way, we have amazing VLM's now, so might as well use them. There's more of a question of this action head. Should you train it on a lot of random data, or just on the thing that you want to do in the end? This is still an open question. One question that that raises for me is I just had a conversation where we were talking about how with VLM's generally they ignore a lot of the visual information and really rely more heavily on the language information. It seems like in a robotics scenario that that would be even more harmful to what you're trying to accomplish. Do you run into that as a challenge? I've heard the same thing. I haven't seen it in the VLA case. I would guess it's because the robot cannot ignore the visual input. It's the main source of information. What tends to happen, however, is that they ignore the language inputs. If you train the robot to always do the same thing, I don't know. If you have a box and you have an object inside it, it always has to take it out. It will completely ignore the language. It will just do the same thing. We'll try to guess from the image what it's supposed to do. You mentioned the sim to real gap. We've been making, this has been a known issue for many, many years. We have been making good progress on closing that gap. Talk a little bit about your experience. What is required today to create a model and sim and have it run in real? Are you doing things explicitly or specifically to address real world or is it just the models or better? The process is better and you don't really think about that anymore and it just kind of works. You need to do a lot of things very explicitly. The challenge is that to cross the sim to real gap, you need to have a very deep understanding of both worlds of the simulation and how it works and of the real world, which means that if you want to have a robot that walks around as it should in sim, you need to go very deep. You need to know exactly what's happening between a command that the policy outputs and then all the way down to torque in the motors. There are typically 10 different layers of transformations, even just on software, of how we go from a high level command to actual current in the motor. It's very tempting to ignore that, but by understanding every single layer and knowing all the different transformations, then you can properly simulate it. This really unlocks better performance. So that suggests the level of simulation that you're doing isn't like you pull up your sim environment and get generic humanoid robot and you're going to train some model and deploy it to something else. It's like you have a digital twin of your humanoid robot in a simulation, like a high fidelity simulation environment and you're training to a very fine level of detail, which sounds very computationally expensive. So we're much closer to what you described first. So we have a very generic submission environment, but there's some very specific things that are important. One clear example is what are the torque and velocity limits of a motor. You cannot expect it to do something that is not possible on the real robot. You need to add those limits and there are a few more things like that, like what kind of delay can you expect between a command? So you need to identify a few of those parameters and we are actually doing it usually in what we call a real-to-sim process. So we take the real robot, really hang it in the air, we let it shake a little bit, collect data of all the different motors, and then we know which are those important effects that we need to identify. We identify them and add them to the simulator. But simulation speed is the most important thing. So you cannot afford to simulate all those different effects, currents, magnetic fields, et cetera. You need to abstract all of it away. That sounded very hard and expensive. My personal take is that you still need to understand them even though you're not simulating them. Got it. And so it sounds then that it sounds then like the result of that process is not a general model that you could deploy to any human order of art, but one that is specific to the humanoid robot for which you took the real-to-sim, you know, those key parameters, but that, you know, but because you're able to abstract it out to these, you know, some handful of, or, you know, several handfuls of kind of key parameters, like it's relatively easy to do new robots. Yeah, and this was also the surprising part of one of the key learnings of this year is switching robots is fairly easy, as long as the hardware performs reasonably well. So now as a company, we work with a few different suppliers of robots and a few different partners as well with whom we're working closely. We've deployed controllers on, let's say, between five and ten different robots. And we see now that making a new robot walk as a few days of work, and it should be less, it should be less than one day of work if we optimize some of our processes. Now bringing them to a new task, this is a bit more challenging. This requires more, more engineering today, and this is what we're focusing on to our, one of our key metrics is how much human effort is involved in bringing a new robot to a new task. New robot is very easy today, new task is something we're working on. And in this context, like how we've kind of talked a little bit about this, but how specific is a task, meaning like is a task pick in place or is a task robot in this warehouse picking off of this line and placing it to these bins? That's a great question. More like pick in place. But there is an interesting concept there, which is we are trying to leverage the information contains in large VLMs to orchestrate and break down complex tasks into clear sub-tasks. Even though that's not what we're focusing on today, cooking is a great example and great metaphor. If you wanted to cooking, yes, if you wanted to train the robot to cook every single meal on the planet and you would say each meal is its own task, you would never finish that. The set of tasks is huge. But what you could do is you give the recipe to a VLM. You also give it images of what robot sees. If the recipe says cut a cucumber, the VLM would say grab the knife, grab the cucumber, and do this sort of motion to cut it. And then you can break it down into much simpler primitive, like cutting things, holding a pan, putting it down somewhere, filling a glass of water, pouring it, things like that. And suddenly the set of these primitives is not infinite anymore. The challenge is that now we need a higher level intelligence that will orchestrate all these primitives. But what's interesting is that that part is basically solved with the VLM. It's not 100% there, but it's moving much faster than the actual physical interaction of doing all these motions. So can you elaborate on that? The orchestration is solved. If so, how is it solved? And what's the relationship between that orchestration and what we talk about is the reasoning capabilities of these large models? Is it the same thing or related? It's similar. Maybe one way to describe this is on our website, we have two videos. We have a video of a robot walking in a forest and picking up trash. This is mostly there to showcase what's possible and also play a little bit on our Swiss angle using our nature. We have another video where robot is doing the same thing in our office. And in that second video, it's 100% autonomous. So you give it a text prompt. I think we're saying something like pick up the toys in front of you and drop them in the basket at the end. And we're using an off-the-shelf VLM for that. The way this works is we're giving the images to the VLM and we're allowing it to do tour use to call specific skills of robot. So the VLM would say, "Oh, I see a toy there on the ground. Let's walk to the toy." And this let's walk to the toy is a skill that is actually triggered and executed by the robot. Once we're there, it will trigger and pick up the toy. Then go to the basket and drop it off. And so by having a few of those primitives which are walk two things, I mean, the walking is locomotion as we discuss itself very complicated. You can walk on stairs, you can walk on a bunch of different complex terrain. By having like walking, picking things up from the ground and then dropping them somewhere else, we can recombinance in many, many different ways without any retraining, just by prompting the an off-the-shelf VLM. When I hear a tool use, I hear like separate process or module or model. Is that the case? And like, how do you think about you know, then I think of like, you know, if we got a bunch of tools, you've got a bunch of these, you know, separate models or modules like, are they actually, does this architecture imply that they are in fact separate and, you know, trained separately or are they more universal somehow? That's another great question. So that in our case, specifically today, they are separate. But we are actively working on merging them together into one single model, a more general general model. And the hope with that again is that you see some generalization. You see some interpolation between those different models. And the way we would do that is actually by still training those different modalities is primitive separately. And then using them as data generators to collect a massive amount of data in simulation to then train one of those larger VLAs across the whole data set. So I should that it can perform everything. We're seeing early results on that in our in our company. So things are going in that direction, but we still need to prove that this actually leads to the generalization we're talking about before. And here you describe that, you know, often when I hear like these kind of student teacher types of approaches or, you know, I think of like distillation and trying to get to smaller models, you're not necessarily trying to do that, but talk a little bit about the hardware capabilities from a model inference perspective, and like where we are in terms of, you know, model size, that kind of thing. In our plan, if we go through that whole process, we train, let's say, 50 of those primitives, we distill everything. We are still developing a hierarchical pipeline where you have three models interacting with each other. It would start with a relatively large, let's say, VLM, that would be allowed to represent. And the output of that one would be that VLM would go from a very abstract task to clear sub-task, such that if you have the robot here and I tell it, go across, like, I don't know, go pick something I've been to in the fridge, and would say turn around, go through the door, open the fridge, grab the thing, close the fridge, etc., clear instructions. Then those clear instructions would go to that VLA that we were describing before, where if it receives us as instruction, open the fridge, it will plan a kinematic motion for the arm to grab the fridge handle and open it, just a few seconds into the future. And finally, we have what we call a whole-buddy tracker, which will receive this plan of how the hand should move and how the whole body should move, and then we will control the motors of that specific robot to execute that motion. Okay. Now, the size and frequencies of these models are very different, and that's why we think it makes sense to have three of them. The final one, the whole-buddy tracker is a very simple, very small model. Typically, it's a very small transformer. You completely understand this doesn't even need to be a transformer, but today everything should be a transformer. And that can run very easily, which you can run them at 50 hertz, 50 times per second, even on the CPU of the onboard computer of the robot, not even the GPU, because it takes more time to send it to the GPU and get it back. Then, I'll skip the VLA for now, going to the VLM. That one is typically fairly hard to run onboard Android. For now, it's running off-board either in our office, in a server rack, or even in the cloud, which creates some challenges. Once you want to deploy 100 robots in a warehouse, either you have an amazingly good internet connection, or you have to install server racks in that warehouse. We are hoping that robot compute keeps progressing, so we can finally feed those things. Android robots themselves, typically on a Jetson. And then the VLA, this is where computers, the most limiting today, because we cannot really put it off-board. It still needs to run fairly fast, let's say 10 times per second, with minimal delay, so it needs to be onboard. And they also typically use diffusion, which means that you don't infer it just once, at this part of the network is inferred multiple steps. So this is where computers are the most critical. The onboard computer of the robot is the most critical. Yeah, I hadn't thought about the case of the one we were talking about, and he's other robots in the home that they're essentially their brains are in the cloud. I assume that somehow we were able to get these models small enough to run locally, which seemed like a lot, but just from a latency perspective that, I don't know, it's a target to imagine that being particularly tenable and consistent, as well. So some of these models, just to clarify, some of these models can feed on the robot. What we're seeing today is if you want to do some of this more abstract reasoning, so you give it a very abstract task, and then it has to orchestrate something for multiple minutes. There you would really benefit from larger models. Yeah, and I would imagine that you would want even more abstract models in the home, you know, for consumer tasks than you would require in an industrial setting. Is that true? Yeah, as I would say that's true, because in industrial setting, if the task is repetitive, you can more or less pre-compute those very abstract instructions or go from very abstract instructions to clear abstractions. In a home, if you're a human just telling you something that for sure you need, you need the scale of large models. Are you using off-the-shelf RL environments or as part of what you're creating, you know, the simulation environment for creating these models? It is a big step of what we're creating, so we are not building simulators ourselves, we're using existing simulators, including once from Nvidia, we also test and experiment with many others. But one of our key know-how is how to properly build those simulation environments, and we have our own custom RL algorithms on top to benefit from that as much as possible. Got it. So, the simulator and the simulation environment are distinct. Is the simulation environment, when you say that, is that the configuration of the environment and the simulator? Like, the simulator is like the platform and the simulation environment is like the thing that you create about your scenario? Or is there- Yeah, exactly. Are there just those two levels of abstraction or their three levels of abstraction, I guess? I guess you would add the RL algorithm as a third component that interacts with both, but that's the simulator itself is basically a physics engine and a render. And then you have to put a robot in there, if you wanted to walk on stairs, you have to create stairs, but you cannot just ask nobody to randomly figure out how to walk on very complex stairs, so you have to create a whole curriculum of difficulties. You would start with very small stairs and progress really makes them harder. And the same is true for all sorts of tasks. If we're training a robot to open a door, well, we have to create a simulated version of the door, then we have to help the robots, we have to figure out exactly all these training processes on top of Joseph's scenario itself. And so as we've talked about this, you've kind of positioned RL and imitation as these two alternatives, but is it also possible to use imitation in conjunction with RL to kind of bootstrap learning and like, you know, help the robot get over these, you know, you know, figure out stairs more quickly? Like, is that still a research problem or is that something that, you know, we're able to do in practice now? That's a good question. It's still a research problem, but we are seeing good signs of life a bit, I would say. They're out. I can talk about two different ways to combine them. One way is use a few demonstrations to help the RL process, and this is something we're doing very actively. So if you have a human showing, like, doing the task, you can extract just a little bit of information from that to help the exploration process of the reinforcement learning, such that the robot is not, you know, just randomly shaking and trying to figure out everything from scratch, but you're guiding it a little bit towards the right solution. And a completely different way to approach imitation learning plus RL is what we're seeing a bit more in in other companies and in an email, which is doing more imitation learning for pre-training, and then adding some flavor of RL on top to try to improve the behavior after effect. A big question there is if the imitation learning pre-training was done without any simulator, do you suddenly need to add a simulation again to do the fine tuning or not? And I think this is still a very open question. You know, when I think back to some of the earliest conversations I had in robotics, like, you know, these are folks like Peter or Beale, and I remember, I think, you know, this was like pre-VLM stuff and maybe even pre-transformer, I don't remember, but like, you know, some of that earliest work was like they'd have hundreds of robots, you know, real robots at Google, like, and they were a starting experiment with RL, I believe, you know, but like it was, the robots were RL-ing, and it was very expensive. It used a lot of, you know, you needed to have a hundred, you know, time on a hundred robots, you know, but they didn't have to deal with this RL to sim gap, you know, you're more focused on, you know, the simulation and other ways to close that gap, but there are still proponents of, you know, RL in real life. You know, how do you think about like comparing and contrasting those approaches? There are definitely people who are big proponents of RL in real life or another way to put it or against simulation at all. In some cases, there are good reasons for that, because some things are fundamentally hard to simulate. I can talk about a specific task we're focusing on, which involves the robot manipulating cardboard boxes. The robot has to walk, pick up the box, bring it somewhere, put it on a table, open it, take what is inside out of the box, and put it on a shelf, for example. In other case, most of the, if you break it down into sub-tasks, most of them are very well-simulatable, except one very specific piece of it, which is opening the box, because you can imagine there is maybe tape on the box if you take a knife cut through the tape to be able to open it. And it's possible, but it's still a lot of effort to simulate the interaction of the tape or the cardboard and exactly how the knife cuts through it. So what we are trying to do there is identify those specific cases where simulation is still limited and use real data, but only for those very specific cases. And then mix it with simulation, simulated data of everything else. And we think this is how we basically get the best of both worlds, or get as much simulation as possible. And as simulators develop, they'll take more and more of the whole set of tasks. But while there is a gap, we'll reuse real data for those specific cases. You know, for the folks that say that, you know, RL in real is better, you know, in what ways would it be better for that particular scenario? It sounds like you would just go through a lot of boxes, but you still, like unless you're giving your robot a utility knife, like, it seems like it's the same problem in real, right? I agree with you. I think it is much harder to do it in real life compared to simulation, especially with reinforcement learning. What you could do is do a little bit of imitation learning for that specific case. And that's much easier than letting robot learn everything from scratch. I would guess the only argument for real life, for pure real life for RL would be that you don't need to deal with simulation, which can be very, very hard, especially if you don't have expertise in house in terms of how to create those simulated environments and how to tune the simulators to behave nicely. When everything is in real life, well, you already have the perfect simulator in a way. But on the other hand, you have a very expensive hardware and then any failure and then your reset is way more expensive than in simulation. Yeah, I think it's clear why it's compelling and also why it is aspirational, like the you know, the idea that as humans, we don't simulate the world to learn things. We like explore in the world, and we learn that way. And so I'd want my robot to be able to do that. But we're nowhere near the sample efficiency in robots as we are in humans. So you would end up breaking a lot of robots and our boxes to get there. And then you've only solved one task. I think another interesting point is that the human reward signal is extremely complicated. If you're doing some tasks with your hands, the amount of information you're getting from all the nerves and all your skin, also from your muscles that are tired, et cetera, is extremely complicated. And we don't have that information at all with the robot, typically tactile sensing is very pretty. So if the robot is slowly damaging itself, you wouldn't know until a motor breaks. So if you're doing reinforcement learning in real life, getting rid of behaviors that would damage the motors, it's okay, one word, because you'll get maybe one event per week. The reward is not there. And this is where simulation helps once again, because in simulation, we have perfect information about everything. We can design reward functions that will avoid breaking motors, damaging the mechanics of robots. And you know, based on your earlier point about incorporating vision, reducing performance or making it more difficult to convert on a model, one approach to that is, let's just add skin or let's add additional sensors, but all the any additional sensors or each additional sensor rather increases the burden, the computational burden on these models. Absolutely. Plus, mechanics makes everything more brittle. So with cameras, we're, I would say we are there today. We can add cameras to our robots. They're very cheap. They're very reliable. The tactile sensing is just not there today. And thinking about crafting that reward function, talk a little bit about that process. That is key in any type of RL is figuring out what that objective function is a reward function. Are, is it, you know, how standardized are they for a given task, or do they vary, you know, very widely and require a lot of hand tuning? And maybe as a secondary question on the the language side or in like coding agents, there's a lot of talk now about trying to incorporate value functions that allow, you know, the provide signal for, you know, positive behavior before the end objective. And I'm curious if value functions are a practical thing in robotics today as well, or a conversation that, you know, are they, you know, is it research, or is that something that we're using today? Certainly, value functions are part of the RL algorithm itself. So we are training, for example, we mostly using a some variant of PPO, which is an actor critic algorithm, which means that you're training both an actor and the critic. The critic is basically a value function. It's not user deployment. It's only used to help the training of the actor during the the training process. And now we're seeing some research into how it could actually be used even in deployment. I think this is more of, this is more on the research side. It's not, it's not proven yet. And then about reward tuning, it is a big topic. We have 35 people in a company and quite a few of these people spend, still spend hours tuning rewards. And typically this is a, obviously, referred to in a negative way. You don't want people tuning rewards. But I think you have to make a distinction between two types of tuning. There are general rewards that just simply come from the task itself. So if we think again about how a robot is just a locomotion, how robots are walking, you would say the task is just, you know, go from point A to point B. But in reality, it's a bit more complicated. You want a robot to go from point A to point B. You don't want it to use too much energy to do that. You don't want it to hit the ground too hard. You probably don't want it to slip everywhere. You don't want it to, arms doing completely crazy motions around it. By the time you describe it even in text, like what you actually want, you already have, I don't know, maybe 15 lines. And so that translates to 15 different reward functions that you have to come up with in tune. And I think my personal take is that that part of tuning is fine. There is another kind of tuning that tends to happen a lot in RL, which is related to exploration. Once you've described the perfect task that you want, how do you guide the policy, the training process towards that? For example, if you want a robot that opens a door, you might need to tell how I put your hand close to the handle, then close your fingers, then pull on the door. And these are really things that don't scale across the asks. And this is something we're trying to avoid as much as possible. And this is why we're working on other techniques where we can use one or two demonstrations from a human to help the learning process instead of all these manuality and reward functions. You know, we're at the end of the year and this is kind of a natural time for folks to make predictions. Do you have any predictions for the upcoming year or, you know, suffer use, whatever horizon you'd like to offer in terms of robotics? Like how do you think about the future? You know, I said before that I think there isn't a single humanoid robot providing value in the world today. My prediction is that will change our end of next year, maybe beginning of 2027. So we won't, again, a prediction, hard to predict, hard to say exactly what's going to happen, but I don't think we'll get this, you know, chat GPT moment or suddenly have a billion robots everywhere, just because you need to build the hardware. It doesn't scale like getting access to chat GPT, right? But I would predict that around the end of next year we'll start seeing robots doing actual work. It will be just a few sharing there. And then in 2027, 2028, I'll scale both the numbers of robots per task, but also the set of different tasks that these robots can do, which means that in the coming years we'll go from, I don't know, hundreds of robots to thousands and then very quickly to millions, tens of millions, etc. And presumably you see that happening first in industrial settings and then consumer. This is my prediction, yes, that we'll see that first in industrial, then then consumers at home. And I hope that after that we can go to more, to crazy applications like we should really send, she went to robots tomorrow to build colonies before humans land there. When you think about the currently available robots that you have seen and or worked with, are they all kind of the same? Are all the dogs the same? All the humanoids are roughly the same, or do you see big differences between them from a hardware perspective? And if so, are there ones that are particularly exciting for you now? There are probably three or four different strategies you can take in terms of how you're designing your human and robot, specifically what kind of actuators are used with kind of gearboxes. And since there are many companies exploring that space, everything is happening in Pro-L, no matter which of those strategies you take, there are a few companies in the US, maybe one in Europe and probably 15 China building that exact thing. The competition is a very fierce. One part where hardware is not there yet today is on the end of the factors, on the hands. It's still debatable if you actually need very dexterous hands. I think one of the big reasons why many companies develop hands with high dexterity, so more than 20 degrees of freedom in a hand, is because they're using imitation learnings, they're imitating humans, which means that you need to be able to imitate everything a human does. Once you go the around, you can learn other kinds of behaviors with much simpler grippers, but the debate is still open on that. Again, we've been talking about dogs and humanoids, but there's this broader question, which is, is humanoid the best form factor for a robot? Should we be making robots with two arms, two legs? Do you think that that's the way to go? Or are we kind of anchored on this because it's our form, but there are better forms that you've seen or think about. Some other great questions. I worked on more than 25 robots, so I think by now, so any number of legs and arms can imagine, from zero to probably four or five, six legs. There is room for all sorts of robots in the world. Honestly, I mostly use, and as a company, we use the word humanoid for the lack of a better word. What we mean by that is not the human form factor, but human capabilities. So, very basically, we want robots that can go where humans go, and can manipulate the environments in a similar way to how humans do that. So, probably, you would need at least two arms with some sort of an effector to interact with the environment. And then whether you have legs or wheels, both are fine, both have their own applications. So, you can go a long way, especially in industry with a wheeled platform, and we're working with robots I got as well. I think said that it's surprising how quickly wheeled platforms get stuck. A very common thing that is easy to imagine is if the floor is not perfectly flat, if you have something cables or, of course, stairs, your wheel platform is stuck. But another very important part is also the footprint. With those wheeled platforms, you have two choices. Either you make them very large, and then there's stable-beddie fault because they were very large footprint, but they don't fit through tight spaces anymore. And very quickly, especially in slightly older industrial settings, you have tight spaces. And the alternative is maybe some gyroscopic segue-like thing. That could be one. You just have a small platform, which means that you have to be very, very careful how you move the torso on top, because if you lean too far, it just falls over. I haven't seen the gyroscopic platform yet. I still think it's really like a robot is probably the coolest robot I've seen so far, but it's maybe not the most applicable for industrial tasks on Earth. And then maybe maybe one more question. There are quite a few now, like robotics kits, or robots are getting more accessible for folks that are interested in the space and want to play, but don't have access to a humanoid robot, or don't have any robots. What are some cool things that someone can order now, maybe get by the holidays or soon thereafter, meaning not pre-order for 2027, and start playing around. What if you were advising someone who was excited about getting their hands dirty, what would you tell them to start doing? There's an amazing community around hugging face and their little robot project, where they have very cheap robot arms, and they can help you learn about the whole tail operation, data collection, training, and deployment pipeline with those arms. That's a really good way to learn about that. For the more locomotion and reinforcement learning aspect, it's a little bit harder because you probably want a robot with legs, which also means that the robots would be able to fall and stand up without completely breaking. I think the best bet there is the Chinese quadrupeds that are really getting fairly cheap. It's still multiple thousands of dollars, but it's affordable for a university or it's cool, or if you really want to go much deeper into that album as well. You get your quadruped and you unbox it, what can you do with it? Or where do you start with trying to do some experiments with it? When you unbox it, typically it can already do quite a lot, so it won't be able to walk. In some cases, they even have things like slam pipelines that can do some navigation and avoid obstacles, things like that. But then the challenge is that you want to get rid of all that software and basically recreate it through scratch. There are many communities online, there are many GitHub rebuzz that help you get started. The unit we go to is probably the most standard platform, so we'll start there. And there are people who open stores already, everything from training to deploying code, deploying these policies on those robots. Okay, cool, awesome. Well, Nikita, thanks so much for jumping on and sharing a bit about what you're up to. Very cool stuff. Thank you so much. We really enjoyed this. Thank you. [BLANK_AUDIO]

Podcast Summary

Key Points:

Summary:

Chat with AI

Pro features

Go deeper with this episode

Unlock creator-grade tools that turn any transcript into show notes and subtitle files.

AI chapters & timestamps

AI-generated chapters with a short description for each topic — click to jump to that point in the audio and transcript.

Locked

Transcript exports (PDF · SRT · VTT)

Download the full transcript as a formatted PDF, or grab perfectly timed caption files.

Locked

Viral quotes + downloadable graphics

AI finds the punchiest 15–30 second moments — perfect for TikTok, Reels, and Shorts. Each quote can be exported as a customizable image card (8 templates, 3 aspect ratios) ready for social.

Locked

Citation & fact-checking engine

Pulls only the publicly verifiable claims — statistics, named figures, historical events — and fact-checks each with Google Search, citing the real source URLs.

Locked