You're one of the godfathers of AI. What's your kind of view of the path of progress here? Five years complete world domination. The best way to get breakthrough research is you hire the best people and you get the [ __ ] out of the way. >> Pardon my French. >> You shared the Turing Award with two others. When did your views start diverging? In 2023. How do you know it was time to leave Meta? It sounds like you were thinking through some of these things over a period of time. >> There is a big misconception about my role, my relation to AI and how AI was run at Meta. What's like one thing you've changed your mind on in the last year? I mean, the whole idea of uh Yann LeCun is one of the godfathers of AI. He's an absolute legend in the field, uh someone I've admired for a long time. And so it was such a treat to get him on on Unsupervised Learning. Uh he's been a noted skeptic of of LLMs in many ways, and so we dug into what LLMs can do, what they can't do, uh some of the limitations he sees, and why he ultimately decided to pursue a different architecture. Uh and we also talked about his time at Meta, um you know, the things he's proud of in in setting up FAIR, how the last few years proceeded, and what ultimately led him to uh spin out and start his own company, uh me. Um I think it's just fascinating to get Yann's thoughts on everything happening in the AI ecosystem today, this tension between basic research and then pushing LLMs forward, and how that's happening in in a bunch of organizations today, as well as his thoughts on just where the the whole space is headed. Uh he's just an absolute giant in the field, and when I started this podcast, I hoped we'd get guests like him, so it is just such a treat. I think folks will really enjoy hearing the conversation we had. Without further ado, here's Yann. Yann, this is such a pleasure. You're one of the godfathers of AI. I feel like when I started doing this podcast years ago, I was really hoping we might one day get someone like you on. You know, I don't like that term because I live in New Jersey, when you're a godfather in New Jersey, it [laughter] doesn't mean the same thing. Very fair, very fair. You know, obviously, you know, your bet on on neural nets when everyone doubted them is legendary, and I feel like today you're making uh a similar bet in many ways against LLMs and the kind of predominant generative architectures that that so many believe in.
Uh you've recently started a new company uh behind this theme. And so, you know, our goal today in the conversation is to leave our listeners with a lot for more information about AMI, what you're doing there, some of your work at Tapestry, um you know, why you think the rest of the field is is is pointed in the wrong direction around some of these generative models, and then also just get your reflections on the way the field's unfolded, your time at Meta, and and all that. So, you know, modest goals for uh for for for a single podcast episode. I feel it'd be great to start with AMI um because the company feels like the clearest statement of your technical thesis going forward. And so, you recently launched the company that's focused on world models uh and scaling the Jeff bar architecture, which you obviously pioneered uh over at Meta. And so, I'm wondering if you could talk a little bit about the origins of that architecture and the extent to which you drew inspiration from the human brain and the way that works. So, first of all, I want to say there's nothing wrong with LLMs in the sense of LLMs, you know, are the basis for a lot of uh very useful AI products that all of us use, including me. Uh They're great, okay, for what they do. They're just not a path towards human-level or human-like intelligence or even animal-like intelligence. Uh so, that's my claim, okay? I'm not saying LLMs are useless, right? I'm I'm just saying they're not a path towards I mean, you helped build some of the first major open-source ones. Right. Right.
>> [laughter] >> Right, absolutely. So, what is uh AMI? So, AMI really stands for advanced machine intelligence. And the the the kind of subtitle, the motto, if you want, is uh AI for the real world. So, basically, a lot of you know, AI techniques that people know about today are good for language manipulation, either human language or computer code or mathematics or or legalese, which barely qualifies as human language. >> [laughter] >> Unfortunately, a lot of human language is for it. Right. Right, sadly.
You know, language is very special in a way, and it's uh particularly well suited for the type of uh you know, architectures that have been so successful uh recently, the the you know, large language models, GPT-style architectures. But what about the real world? What about like understanding the physical world? Turns out reality is way more complicated than language. Uh because it's high-dimensional, it's continuous, it's noisy, it's messy. And uh training a system to understand the real world is much, much harder. So that's really what we're after. That's what I've been after for most of my career and really kind of, you know, working on in an accelerated fashion over the last uh 5, 6 years or so and making significant progress over the last 2 years. And so, it made sense to really do a startup around it and sort of go to into high gear, you know, in pushing that. And it became clear, you know, by the end of last year that Meta was really not the right place for that. So, which is why I left and started I mean, Labs. I think it's an interesting like, you know, trend that we're seeing across the board, right? Where it feels like um there you're there's there's many folks spinning out of, you know, either some of the large companies or research labs, you know, that have a particular direction of research they're excited about and you'd have some interesting vantage point of this from your time at Fair of this uh almost tension that exists between, you know, go pursue as many different research directions as possible in these companies versus hey, something's really working. This is the thing that we're going to sell for the next 6, 12 months. Like, go focus on that. You know, I'm curious your your thoughts on that and and what you kind of seen in the industry at large. Well, it's a strange uh trade-off. There's really two modes of operating, right? There's a lot of exploratory research, a lot of research directions, right? And sometimes something kind of seems to work and you you need to push it further. And it's not research anymore. I mean, the people working on it are still researchers or they're called researchers at least in the press, but uh but really it's becoming more engineering and pushing for for products, right? So, that happened a number of times at Meta because of things that were started at FAIR, such as seeing happened in, you know, early 2023, essentially. Uh, when, you know, Llama, which was developed at FAIR, Llama 1, um, was very promising. And, uh, Meta created a a whole organization, GenAI, to turn it into something real and a series of products. Uh, and produce, you know, Llama 2, Llama 3, Llama 4, which was a bit of a disappointment. Uh, and because, you know, Mark Zuckerberg was disappointed by it, he kind of rebooted the entire organization, reorganized it, and hired new people, etc. But, what also happened, uh, over the last year is that uh, basically the company, Meta, realized that, um, they'd fallen behind a little bit, and so that kind of refocused the the strategy on trying to catch up with the industry. And the sad side effect of it is that a lot of the exploratory research was basically not given high priority anymore. I mean, it didn't concern the stuff I was working on, all the Jeppa and world models, uh, cuz, you know, Mark himself and and do buzzwords, the CTO, and a bunch of other people in the company were really interested in that project and really believed in the long-term impact. But, the rest of the company was just, you know, totally entirely focused on LLMs, and made it clear to me that Meta was really not the the right place to push on that project anymore. And then we started to have good results, and so it was clear that, you know, we had to kind of make that transition between research and actually kind of uh, developing the technology, scaling it up, and building products out of it. And we realized also that most of the applications were probably for things that Meta was not particularly interested in. A lot of applications of the kind of stuff that we've been working on is in the industry, like manufacturing industry and stuff like that. Obviously, you're you're kind of pursuing world models and and and in that broader world. And I think there's other people that have come at the world model pace from a more like generative approach. And so I think you've got folks, you know, you've got the Google folks and Genie in the video models. You've got folks, you know, building VLAs on the robotic side. You've got Feifei and and kind of like the 3D spatial models. As you think about kind of the the body of of of of evidence that got you excited about the Japa models and how you kind of compare them to what the generative folks have done, you know, where do you think we are today in in terms of like comparing these architectures and approaches? Okay, so world model is quickly becoming a buzzword right now, right? Certainly in research, but also in industry to some extent. And uh and then there are two factions, if you want. I'm not going to talk about VLA because VLA is clearly now being seen as not going anywhere. Like it's really not working. Uh so VLA is, you know, vision language action models, right? So basically use the LLM technology to train a system to produce actions for like controlling a robot or something like this, right? So you have vision in, language in, action out. Maybe language out, too. Um and that's pretty much now seen as a failure.
>> [laughter] >> Uh not being reliable enough, requiring too much training data, you know, things like that. Okay, then there is world models. Okay, so what is a world model? Uh a world model at a regional level is something that allows an agentic system to anticipate the consequences of its own actions. Okay, predict the consequences of its own actions. From my point of view, I cannot imagine how you can even think of building an agentic system without that system having the ability to predict the consequences of its actions. I I that's pretty essential, right? When we act in the world, we have this ability. And when we uh take an action without thinking about the consequences, we're taking a big risk. And very often, you know, other people think we're we're an idiot. Uh we have plenty of examples on the international political scene at the moment of people who have complete, you know, no ability to predict the consequences of their actions. So, that's the one model. That's what it is, right? Ability to predict the consequences of your own actions. If you If you have this ability, then you can plan a sequence of actions to accomplish a task, to you know, satisfy a goal. And you do this by planning, reasoning, uh by a process of search and optimization. You don't do this by predicting one action after the other autoregressively, like a real AI we do. Uh you do this by searching for a sequence of actions that will accomplish the task you set you set for yourself. So, the blueprint for this is completely different from what, you know, LLMs uh can do at the moment. Uh LLMs do not have the ability to predict the consequences of their actions, and they do not have any planning abilities. Because inference is by predicting the next token, right? It's not by search. Okay, so right there, you have the two characteristics that I think are essential for intelligent behavior. Ability to predict consequences of your actions. And second, uh ability to plan by optimization, by search. Um find a good sequence of actions that will produce the correct outcome. And then there is a third characteristic, which is uh how do you pre- how do you predict the consequences of your actions? Okay, so, you know, if uh if I have a water bottle in front of me. I realize some people would just listen to this and not have the picture. So, I have an open, uncapped water bottle in front of me. If I push at the bottom, it's going to slide on the table. If I push near the top, it's probably going to flip. We can't predict exactly how the the bottle will will fall in which direction. Uh we can't exactly predict how it's going to slide, you know, how the water will spill, you know, whether the table is tilted in one way and the water will uh you know, kind of flow in one direction or another. There's no way we can predict this at the pixel level. So, our mental model of the world predicts that at an abstract level of representation. So, as you were working on this architecture, was a lot of it inspired by the human brain? I mean, obviously, like the you know, the way you're articulating things is exactly how how we do things. Right, or at least by, you know, cognitive science, right? Whether you can sort of translate this into a neural architecture and things like this, that's there's a big gap there. Um okay, so that that, you know, certainly uh cognitive science was a bit of a motivation or or, you know, what uh psychological system two, which is this idea of the way you behave in sort of deliberate reflective behavior is that you do imagine, predict the consequences of your actions, and you plan uh accordingly. Contrary to system one, where you just act, you know, reactively and instinctively. So, yeah, there is an inspiration, but also there is a lot of empirical evidence that you don't want to generate pixels. Okay, I've been I've been really interested in that problem of learning models of the world by prediction for a very long time. And then had an epiphany about 5 years ago, realizing that all of the architectures that have have been successful to learn representations of images and videos are non-generative architectures. And all the generative ones basically have been failures, right? So, VAE, right? Variational autoencoders, or auto encoders more generally, uh is kind of a natural way to think about like learning abstract representations of inputs, right? So, you put a an image at the input of a of a neural net, and then you train it to just reproduce the input on its output. Uh now, with a big neural net. Now, if you just do it this way, your neural net will not do anything interesting. It will just learn the identity function. Yeah. Completely uninteresting. It doesn't work. Now, if you train a VAE to learn representations of images, you get something, but it's really not that great. Same with sparse auto encoders.
Then, you have another set of techniques, uh and it's kind of derivative of something called denoising auto encoder, uh masked auto encoder is a version of this. BERT is a version of this for NLP. So, you take the image, you corrupt it in some way, and then you train this big neural net to recover the original uh the original image. There's a huge project that at FAIR on this called MAE, masked auto encoder. It was very disappointing. A lot of computation, and not not really great satisfying result. Simultaneously, uh some of the same people working on MAE, and and some other people in Paris and in New York were working on other techniques using non-generative architecture, joint embedding architecture. So, take an image, corrupt it in some way, and then run the two images through encoders, and then try to predict the representation of the original image from the representation of the corrupted one. Uh that's JEPA. Yeah. Okay. So, JEPA means joint embedding predictive architecture, right? So, you have one encoder that makes an observation, another encoder that makes a different observation. You try to predict the representation of the first one from the second one with a predictor. And those techniques turned out to work much better for representing images and video. So, things like DINO, uh DINO V1, V2, V3, um project that is still going on at at fair in Paris. Projects like I Jepa and then V Jepa and then before that there were like Sim Siam and Moco and a bunch of different techniques mostly from Meta. There was a bunch of others from other groups. Um, but that turned out to be a much better way of learning representations of images than predicting pixels. Yeah. And so it just clicked in my in my mind but you know, not just mine. That this was the way to go and predicting pixels was kind of a a losing proposition. You know, it feels like there's all these robotics demos that are released you know, from from some of the model companies that are feel increasingly impressive and maybe you know, seem to resemble things like planning and reasoning when you know, they maybe haven't seen a a room or or a specific and you know, a version of a task before and are still able to execute that task.
You know, what would you say to our listeners I guess that that observe that stuff and feel like it feels like we're trending toward some real progress with some of the general approaches. Well, there is real progress and some of those demos are really impressive. Um, but >> [laughter] >> they are trained with enormous amounts of data collected either from teleoperation or from just you know, human action with things you hold in your hand that look like grippers. Grippers that you know, and you and you collect the data for that. Or just you know, tracking hands and fingers of of a person.
And then translating this into kind of commands for for a robot. And so those things are trained with imitation learning mostly, right? And a little bit with you know, reinforcement learning to fine tune in mostly in simulation. So the issue with this is that you need a lot of data to train the systems to to imitation. And it it becomes expensive and it's a little brittle in the sense that you know, you need to collect lots of data for every task you want the robot to uh uh to solve. Whereas, if the system had a world model that allowed it to predict the you know, the outcome of an action, it would just plan an action to solve a new task without actually having to be trained to accomplish this task. So, the degree of generalization you would get with a world model-based system is much, much larger uh you know, kind of wider spectrum of of tasks with less training data that would be required than a a system trained with imitation learning and and you know, fine-tuning >> No doubt those approaches require more data. And I guess this question of generalization really is is the big question, right? Of you know, and I think you know, some folks have have uh have shown some results around, you know, uh getting better at task A helps with task B, but that obviously feels like there's still the big unanswered question uh you know, around those architectures. I mean, you get this uh you know, synergy between tasks. So, the more tasks that you train the system to solve, the more tasks it's being it's going to be able to acquire with with small amount of data, regardless of what what technique you use. But, but the hope with uh world models is that the system can solve new tasks at zero shot, which humans are completely capable of doing, right? And many animals as well. So, uh so, that's really the the hope.
Like, you know, solving a lot more problems with uh either a small amount of training data or or no training data at all. And just a little bit of maybe, you know, RL style uh fine-tuning. Yeah. Like, you know, how how is it that a 17-year-old can learn to drive in like, a dozen hours or maybe 20 hours? Uh we have millions of hours of training data of you know, people driving cars. We still don't have level five self-driving cars, right? So, imitation learning obviously does not work even for just the task of autonomous driving. Yeah, I guess it'll be a race between the ability to develop some of those capabilities, which may take time and lots of data versus this kind of architecture. I feel like there's this dream of using video models to just generate like tons of synthetic data for for, you know, simulation and, you know, even if it's not perfect, these video models from a physics perspective, it's like helpful enough to, you know, improve robotics and in the underlying physical world. What have you made of some of those approaches? Obviously, I think Nvidia's been focused there. Google seems to be going down that road. >> I'm sort of asking you again the question, you know, why can 17-year-old launch a driving 20 hours? You don't need millions of hours of demonstration. And you don't need synthetic data. Uh you don't need any of that. So, you know, I I want a system that can learn as fast as that. If we crack that, then we don't need, you know, generated data, right? I mean, we might need to train the system in simulation, but not with the same amount of uh uh you know, of time or or trials as as current systems require. It's really a question of data efficiency. >> You know, I was interviewing Jerry Tworek on the podcast. He was at OpenAI and spun out to start his own lab, and you could sense a similar tension where I think he actually might even agree that, you know, if you continued scaling RL the way we're scaling, you get more, you know, you continue getting very impressive results. But, I think he felt, "God, there's just got to be some like way more efficient way to do this." And it's interesting. It's an interesting tension because you could imagine if you're OpenAI and you know something is going to continue like you could continue scaling it and it will keep getting better. There's not a ton of incentive necessarily from a business perspective to do something more data-efficient. >> Right. And there's there's no incentive for the other companies to do anything different either because they're all chasing the same like they can't afford to kind of fall behind the others, right? So, they all work on the same thing. Yeah. And And there's a bit of this sort of, you know, kind of herd behavior uh and and, you know, in in mostly in Silicon Valley where everybody is digging the same trench. Yeah. Uh and you know, so I pur- purposely set up the headquarters of Amy Labs in Paris. Yeah. >> [laughter] >> Uh the American office being in New York, not Silicon Valley. >> [laughter] >> It's really interesting cuz I think it it it it points to a tension that, you know, it it exists in the broader ecosystem today where uh you could imagine the other side being sure, maybe there are more data-efficient methods out there, but like almost who cares because we can keep scaling what we have to to better and better results. And then obviously I think from both, you know, new things you can accomplish from these models as well as just the joy of being a researcher and finding these new things. I get why there's such an attraction to to to these other architectures as well. >> And it's a bet. But, you know, we're pretty confident because, you know, we we have results already, actually. >> And as you think about like the the kind of um the initial spaces you're most excited about for the Amy technology, like what gets you know, where do you think you know, the the technology goes and and what are you most excited about? Well, I mean, you know, AI for the real world. Um like, you know, can where is your domestic robot? Where is your level five self-driving car? Yeah. Where is uh and that's you know >> When am I going to get a domestic robot? I'm excited about this. Well, so this is several years down the line. Okay? Despite the fact that there is like huge number of companies building robots, none of those companies actually has any idea how to make them smart enough to be useful, right? Or trusted around with a baby in the house or something or >> Certainly not that. Uh but but even for like, you know, relatively narrow manufacturing task, right? You know, I mean uh none of them really knows how how to do this reliably other than you know, for by imitation learning for a small number of tasks. Uh so, how how do we make those things useful? So, that's kind of a relatively long-term objective. Shorter term, there is a huge amount of applications in industry where you need to have a a system, an intelligent system that has the ability of you know, predicting what's going to happen if I change this control variable on this complex system, be it uh a jet engine, a chemical plant, a power plant, a some manufacturing line, a patient, a human cell, right? Those are systems that are sufficiently complex that you can't model their behavior with a small number of equations. Right? So, the traditional way of modeling does not work. And what you need to do is train a neural net, deep learning system, uh to to um you know, model the dynamics of that system from data. And what you get at the end is a a phenomenological model of of that uh process, of that uh system. Um and if it's action condition, then you get basically a a world model of that system that allows you to control it optimally for whatever purpose you have. And I think the number of applications of this in industry is mind-boggling. Where do you think we'll be with uh you know, general models over the next couple years? Are there like, you know, milestones you'd point to or like, what what's your kind of view of the path of progress here? Okay, couple of years is a little short. Like, 5 years, complete world domination, essentially. [laughter] Okay. So, somewhere between on the path to world domination in 5 years. I mean, this is kind of a joke, obviously, but uh this is a quote from Linus Torvalds, right? You know, when people ask him, "What's your goal with Linux?" He said, "Total world domination." [laughter] Um he actually managed to do that. >> Yeah, very fair. To first approximation, every computer in the world runs Linux, right? So, um so, that's kind of a joke.
But but in the end, I think this is the blueprint for intelligent systems of the future. There still be a a small place for LLMs, you know, for like a language interface, basically. But uh but what we're designing are are systems that are capable of thinking. They They may not be capable of talking or listening initially, but they'll do the thinking. And then you can add the talking and listening uh on top of that. I'm sure you and the team are are are eagerly working to kind of, you know, get the early proof points of this. And obviously, you've already had some in the work you've done. How do you think about like the interim steps of what you'll be able to show on that path to to 5-year world domination? Well, so I think uh you know, within a year or so, um we'll have I think a a general methodology to train hierarchical world models on, you know, a a very wide variety of modalities. We know we can do a good job on video uh with some techniques that we're not completely happy with because they have some shortcomings, but um and we have sort of small-scale demonstration of a methodology that we think is really what we want. So, we need to scale that one up and get it to the same level of performance as the the other techniques that are not as uh satis- satisfying, if you want, on on things like video, but also on other types of data sets that we would get from industry partners. Okay, so we'll have demonstrations that we can train world models, perhaps action-conditioned world models that allow us to plan for uh a number of different use cases. Some of them will be robotics, some of them will be industrial process control of various types, maybe some of them in health um health care as well cuz we have partners in that >> Yeah. in that domain. And that should be within a year or two, 18 months. Um And then we'll push the this methodology and those models into uh those use cases with partners, some of which are investors already, you know, in our company, and gain experience on how to kind of essentially build a somewhat universal world model if you want. I mean, you've obviously had this uh you know, this experience before of of kind of making this really contrarian bet on neural nets and and being certainly uh proven abundantly right uh in the in in the history books. I guess as you think about this bet which I think, you know, if you talk to the majority of people uh maybe at at at at the cutting edge of various parts of AI maybe would would say is contrarian today. In what time frame do you think it will become apparent like, you know, that this was right? I think it'll happen faster than expected perhaps because I mean, you can see that world model is already becoming a buzzword, right? At least at the research level. Uh and it's starting to kind of permeate into the industry. Yeah. And a lot of people are realizing like VLs suck and, you know, LLMs don't work for real world data. Industry has realized this already. Certainly on the on the on the user side. And I think because of the importance of the robotics industry um you know, a lot of people are kind of trying to figure out like how how do we how do we get there? How do we get how you make those robots uh useful. So so I think it's I think the realization that you need a change of paradigm is is happening as we speak and will become completely obvious to people by early 2027, I think. Yeah. Now, that doesn't mean we'll have a solution by then. We hope we will, but you know, we'll see. I guess, you know, switching gears to the LM side you mentioned some of this work you're doing with uh with Tapestry which I think would be really interesting for our listeners. And so maybe to speak to that a little bit.
Okay, so this is kind of a little bit orthogonal to uh to ML Labs. Yeah, as if that wasn't enough to keep you busy. >> [laughter] >> Well, it's a it's a kind of an idea I've I've been uh forming over the last uh three years or so is the fact that uh people increasingly use AI assistants for various things, right? I mean, uh you see a decrease in the use of general traditional search engines and you just ask a question to your favorite AI assistant. Um and you know, if the plan that Meta and others are are developing of, you know, having smart devices like smart glasses and stuff like that, uh you know, is realized, basically you'd just be talking to your AI assistant, you know, by voice with, you know, to your smart glasses or maybe some other smart device.







