I think we're probably looking at AGI 2030, around the time that we're going to be releasing like maybe Arc 6 or Arc 7. You're not going to stop AI progress. I think I think it's too late for that. And so the next question is okay, like AI progress is here. It's actually going to keep accelerating. How do you make use of it? How do you leverage? How do you ride the wave? That's the question to ask. >> Today we're lucky to be joined by Francois Chollet, founder of the Arc prize, a global competition to solve the Arc AGI benchmark. His latest project is Indium, a lab exploring a new paradigm in frontier AI research. Francois is one of the best people in the world to help us understand the current AI moment and where all of this is going. Francois, thank you so much for joining us today and congrats on the launch of Arc AGI V3. >> Thanks so much for having me. I'm super excited to be here. Super exciting times to talk about AI. >> So Francois, tell us a little bit about Indium. So what exactly is it and what are you guys trying to achieve? >> Right. So Indium is this new AGI research lab and we are trying some very different ideas. And so our goal is basically to build this new branch of machine learning that will be much closer to optimal. Unlike unlike deep learning. >> All of us right now are sort of taken by what's going on with code. I have sort of this viral moment right now where I got to 40,000 stars this morning on G stack. So it's like oh, this is an open source project that now is one of the biggest ones and I have more than 100 PRs from contributors to deal with. I guess you're, you know, one of the best people to talk to about this because you're you're actually literally coming up with something that is a totally different pathway. >> That's right. That's right. So what we're doing at Indium is we're doing program synthesis research. And when I talk about program synthesis, often people ask me, "Oh, so are you doing like code gen? Are you building an alternative to coding agents?" And it's actually not at all what we are doing. We are working at a much, much more, much lower level than that. What we are actually doing is that we are trying to build a new branch of machine learning, an alternative to deep learning itself, rather than like coding agents. Coding agents are like this very, very high-level last layer piece of the stack. And we are actually trying to rebuild the whole stack under a different foundations. So, we are building a new learning substrate that's very different from, you know, parametric learning, deep learning. So, if you go back to the problem of machine learning, you have some input data, some target data, and you're trying to find a function that will map the inputs to the targets that will hopefully generalize to new inputs. And if you're doing deep learning, what you're doing is that you have this parametric curve that serves as your as your function, as your model, and you're trying to fit the parameters of the curve, yeah, gradient descent. And this is basically what we are doing. Except, we are replacing the parametric curve with a symbolic model that is meant to be as small as possible. It's like the simplest possible model to explain the data, to model what's going on. And of course, if you're doing that, you cannot apply gradient descent anymore. So, we are building something that we call symbolic descent, which is like the symbolic space equivalent of gradient descent. The idea is to build this new machine learning engine that's giving you extremely concise symbolic models of the data you're feeding into it. And then we are going to make it scale. And so, everything you're doing with machine learning today, with parametric curves, we should be able to do it with symbolic models in the future. In a in a way that will be much much closer to optimality. Much closer to optimality in the sense that you're going to need much less data to obtain the models. The models are going to run much more efficiently at at inference time because they're going to be so small. And because they're so small, they will also generalize much better and compose much better. You know, the the minimum description length principle that the model of the data that is most likely to generalize is the shortest. And I think you cannot find a model like this. If you're doing parametric training, you need to you need to try something else. That's fascinating. >> So, the rest of the industry is just pouring more and more billions of dollars down an approach that was set years ago. Can you like help make the case for why you think that it's the right thing to explore alternative approaches instead of just to keep putting more money into the current approach? >> I mean, everybody's is you know, building on top of the LLM stack these days, which makes sense because, you know, the the returns are there. Like it's actually working. So, it would seem very sensible for everybody to just be doing what seems to be the the the currently most productive path. But I think it's actually it's counterproductive to have everybody working on the same thing. Like, I personally don't think that machine learning or AI in 50 years is still going to be built on this stack. I think this is a stack that is very nice. Maybe it even gets us to AGI. But it's not as efficient as it should be. I think it's inevitable that the world of AI will trend over time towards optimality. And so, I'm trying to sort of like leapfrog directly to optimality. Like to build to build the foundations of optimal AI today. But in general, you know, our vision is very ambitious, and I'm not saying that we're going to be successful. Like we have maybe a 10 or 15% chance of success. But that is enough that it's worth trying, right? And I think in general, like among among listeners, if you have a big idea and it is very low chance of success but uh if it works it's going to be big and no one else is going to be working on it, right? It's It's not something popular. It's not something If you don't do it no one else will do it. And this is basically our situation. If you're in this situation then you then you should you should should try a chance, you know, you should should go and work on it. I mean that's almost like the mission statement of Y Combinator, the thing that you just said. Yeah, the reason it's important is that again, if we don't do it no one else will do it, right? So it's worth trying even if we don't succeed. >> It's worth trying. >> Has the success very specifically of the coding agents I guess built on top of the LLM stack like has their success surprised you at all and in particular like say over the last 6 months or so? >> Yeah, absolutely. I think they surprised many people and it definitely did surprise me. If you look at why everything is is starting to work so well with coding agents, it's really because code provides you with a verifiable reward signal. And I think right now we're in this situation where any problem where the solutions you propose can be formally verified and you can actually trust the reward signal. It's not just some guess made by a model. Any domain like this can be fully automated with current technology with with the LM based stack. And code is sort of like the first domain to fall but there will be many others in the future. I think mathematics is also is also primed to see a a revolution next few years for the same reasons again because the domain just gives you verifiable rewards. >> I guess the challenge for a formally verified domain is you have to somehow take a domain and make it verifiable which is the trick. I mean code is very natural. You can test, there's bugs, compiles, etc. and mathematics as well where there all the theorems and proofs work out. I guess it becomes more nebulous when you go couple degrees off where there fields that are not naturally formally verified. You need to come up with a again with some some of a function to come up with that reward that makes it verifiable. With very fuzzy things like, let's say, English language and composing the perfect essay, how do you make that formally verifiable? >> Yeah, yeah. Absolutely. I mean, writing essays is, you know, the typical example of the domain that's not verifiable. And so, what you're going to see is that progress of reasoning models in in base elements on this type of of of domain is is, you know, is going to be very slow because the stack we're using, like the LLM stack, is very very reliant on its trained data. It's basically just operationalizing the trained data. And for writing essays, the trained data is coming from human experts, like annotating answers. And that's costly. So, you're going to see this very very slow progress. Maybe maybe it's even going to stall. But for any any verifiable domain, like take code for instance, what was the big unlock is when when people started creating these code-based like training environments for for post-training, where the the the reward signal, the verification signal, is provided by things like unit tests and so on. And so, that means that the model was not just working from human-provided annotations. It was actually trying its own things, verifying the answer, and generating a lot lot more trained data in the process. So, a much denser coverage of the problem space. And not just coverage in terms of like is is the answer right or wrong, but also starting to build models of the execution traces, right? So, that the models could start incorporating a an execution model, very much the way that human programmers, you know, when they look at code, they're sort of like executing the code in their minds. They they keep track of the value of variables and so on. It's also what the models are trying to do now. And this is why it's working so well. And it's possible because you're working with this very a fully verifiable environment. You cannot do that with this. You cannot do that with you know law or many other problems. >> I think I really like how you define intelligence and how to measure it, which brings to the question of also sharing having you share the history of AGI. >> Yeah, so my my definition of general intelligence, you know, many people around the industry these days they say AGI is going to be a system that can automate most economically economically valuable tasks. And to me that definition is it's it's about automation. It's not about intelligence. It's not about general intelligence. So my definition is AGI is basically going to be a system that can approach any new problem, any new task, any new domain and make sense of it like model it, uh become competent at it uh with the same degree of efficiency as a human could. So meaning it's going to need basically the same amount of training data uh and training computes as as a human would, which is which is very little. Like humans are really really uh data efficient. So general intelligence is human-level skill acquisition efficiency on the on the same scope of tasks that humans could potentially uh learn to do. >> Do you think it's possible that we will accomplish the first definition of AGI, the automate most economically useful work, before we accomplish your definition? >> Absolutely. I think that's that's the trajectory that we're on right now. And I think it's already true that in principle current technology can fully automate at human level or beyond any domain where you have uh verifiable rewards, right? And code code being the first one. And I think figuring out AGI, figuring out like human level uh you know, learning efficiency over arbitrary tasks, that's probably going to take uh a different sort of technology, a different a different mindset, a different approach.
>> Do you think that LLMs can be bent to have the same sample efficiency as humans or do you think it's like fundamentally just impossible and we need a new approach and that's that's the thing that you're hoping hoping to solve. >> With enough compute everything starts looking like everything else. Every like computer is going to look like every approach starts looking the same. And I think it's possible in principle to build something that looks a lot like AGI on top of the LLM stack. Uh but it's not going to be LLMs per se. It's going to be this new layer perhaps you know it's going to be even a few layers above not just one layer above but a few layers above. Uh but it you you can build it on top of LLMs because LLMs are kind of computer right? >> Exactly. >> Uh I do believe however this would be the wrong thing to do because it would be very inefficient. I think AI AI research will have to trend towards not just efficiency but in fact optimality over time. And for this reason future AI in a few decades uh it's not going to be this harness on top of a reasoning model on top of a base LLM. Uh it's going to be much much lower than that. >> To Diane's question do you want to talk about how you actually designed our KGI and why it's a good barometer of that?
>> I mean I I you know I've been doing deep learning for a very very long time and initially my my my take my mindset was that deep learning was going to be able to do everything. >> You were the creator of Keras before even all the other frameworks became very popular. >> That's right that's right. I was training deep learning model uh for natural language processing in fact. In uh 2014 and uh from that work uh you know I actually started uh developing this open source library which I I released uh in fact uh exactly 11 years ago uh March March 2015. Uh so it was Keras and and then it got popular and then I ended up uh sort of like doing less of the research that I that I had started Keras for and more of working on the framework itself just because it has really really good product market fit. And so my my take, you know, around that time, around like 2015, 2016, was that deep learning was extremely general, that you could do everything with deep learning, that you didn't need anything else. It was Turing complete. So, uh my take was basically that deep learning was differentiable programming. Uh so, anything you would do with software, you could in principle train a deep learning model on the right inputs and outputs to do the same thing. And uh in uh 2016, I was doing uh research at Google Brain on trying to train deep learning models to help with uh reasoning problems. And in particular, uh first-order logic problems, uh uh theorem proving, and so on. And I started finding that you could not really get gradient descent to encode uh uh sort of like reasoning-style algorithms. It was not because the models could not represent these algorithms. It was because gradient descent could not find them, right? So, the problem was that it wasn't about deep learning not being Turing complete or anything like that. Like, that was not the problem. The problem was gradient descent, right? Gradient descent would not find generalizable programs. It would instead uh end up doing uh overfit pattern matching, right? Uh over over sequences of uh uh input tokens. >> So, I guess people could argue like that's what's happening. >> I mean, this this this is still what's happening today in a in a in a slightly >> It's It's just It's slightly higher-level version of the >> With a lot of data. So, it doesn't feel like overfitting because the data has a lot more distribution. >> Yeah. With a lot more data, and also I I think models today uh they are a lot more compressive after that. That's why why they they generalize better. >> All models are wrong, but some models are useful. And then I guess what I'm hearing is like your method might find the right model. >> That's right. That's uh that's uh where where the idea came from. And I was like, you know, at the time, you know, back in 2016, 2017, I was like, "Okay, we're going to need a a benchmark to capture the ideas." >> Uh we're going to need a program synthesis benchmark. And uh my my mental model for that was ImageNet. >> Mhm. >> I was like, "Oh, I'm going to make the ImageNet of reasoning." So, I started brainstorming a few ideas around like 20s, 2017. I explored many different things. Uh I tried working with uh in particular cellular automata, like a setup where you show a model uh cellular automata outputs and it must recreate uh the program that generated them, like that sort of thing. Uh and eventually I settled on the uh ARC format uh around like early 2018. You know, I was doing this on the side. It was a side project. Like my main project was uh developing Keras at Google. I wasn't moving very very fast uh on that. Uh so, summer 2018, uh I wrote the ARC task editor.
And then I started just making lots of tasks by hand. And so, I wrote up uh the paper that was explaining what this was about, what the big idea was, like intelligence is as uh skill acquisition efficiency. Uh and I published all of that in uh in 2019. >> In parallel, GPT-3 2020 was coming out and starting to show signs until the ChatGPT moment around 2022, end of the year. And the industry took off with that. And this was one of the benchmark that was really performing really badly. And it was very obscure. I don't think many people knew about it. It was mostly niche research communities that maybe read your paper. >> Yeah, people who worked on program synthesis knew about it. Uh but a lot of people who worked on on deep learning, on scaling up LLMs, they didn't really care for it. And part of the reason why is because LLMs did not work well or at all on the benchmark. For a benchmark to capture the attention of the research community it needs to start working a little. Uh if it's too hard, people are going I'm just going to dismiss it. >> You're just ahead of your time clearly because we're not on Arc AGI 1 anymore and then 2 is reaching saturation. And then 3 is out now.
>> Yes. >> And I think the cool thing about Arc AGI it has been a very good barometer for the industry of the big changes that happened because 1 was not working at all for a long time until 2025 when reasoning models came out, right? >> Yeah, absolutely. If you look at frontier AI performance on Arc V1 first and then V2. So basically LMs were scoring extremely low on V1 like sub 10% basically. And I mean it was true of the original like GPT-3 which was scoring zero. But that's even true of the latest base LMs today, you know, as of as of March >> Without reasoning. >> Without reasoning. >> Without reasoning. >> Yeah, so it's the base models. So performance of base base LMs on on V1 stayed very very low even though in the meantime, you know, we had scaled up these models by 50,000 X, right? So it was really telling you that you know, more scale scaling up pre-training alone was not going to crack the benchmark. This was not enough to demonstrate that the model had true intelligence. And then the moment models started performing well on Arc 1 was with the first reasoning models. In particular the OpenAI 01 and then 03 models which by the way they were demonstrated by OpenAI on Arc because it was the one unsaturated reasoning benchmark that was really showing that this model was different. It had new capabilities that we had not seen before. And so with reasoning models, you start seeing this sudden like step function change on on ARC-1. And so, ARC-1 was really the benchmark that signaled that at this moment in time something was happening.
And so >> Something big. >> Yeah, something big. Like new capabilities were emerging. Like reasoning was new and different. And it was actually not obvious at the time. Like you know, I don't know if you remember when the when the O3 preview was was announced by OpenAI. >> That was end of 2024 actually. >> Yeah, December 2024. And like short it was like huge like step function progress on ARC. But it was very expensive. It did not really have product market fit effectively. But if you looked at at ARC results, you knew that this was big and important.
And then we released ARC-2, which was the same format but more difficult like with more composition the level of the the reasoning chains. And what happened is that so the the earliest reasoning models started very very low on ARC-2. And then around the same time as coding agents started working, you saw this >> Yeah. So very very recent just few months ago, you saw this very very fast like saturation of ARC-2. And so again like ARC-2 signaled that yes, there was this this new set of capabilities emerging. So I think the benchmark did a really good job at capturing the advent of reasoning models and then the advent of agentic coding. Like this this new paradigm where if you have very favorable rewards, then you can basically fully automate the domain. Which by the way is true of ARC. Like ARC does provide a verifiable reward. >> I guess for V2 what what caused the So one was clearly reasoning. Two, a benchmark doesn't care how you solve it. I guess embedded in what you said like were people using code gen to then solve? >> That's right. So not not necessarily code gen per se but uh frontier labs have been targeting ARC V2. And uh the progress you saw on ARC V2 is actually results uh of this very very large-scale targeting. So, what you can do to solve ARC V2 is you ask your reasoning model to make more tasks like those in the benchmark. Uh and then you try to solve them using let's say let's say program induction for instance. Uh still using your reasoning model. Then you verify the solution. Again, it's verifiable. So, you can you can trust uh the answer. Um and then you fine-tune the model on the successful reasoning chains. And then you keep repeating like generate new tasks, you solve them, you verify the solution, you fine-tune the model on the reasoning chains. And um you can keep doing this millions of times, right? Like the the you just need to spend more money. >> This is the RL loop that is happening, yeah. >> And the the new paradigm in AI is basically that any domain where this is true, where you have uh the ability to generate these uh these uh true uh uh verification signals, you you can run this this kind of loop, right? If you can run this kind of loop, you can mine uh uh you can brute-force mine effectively the entire space and get extremely high performance. This is basically the the process through which ARC 2 was saturated. So, what it tells you is that it's not so much that the models have higher fluid intelligence uh than than they did with the with the first reasoning models. It's just that you have this new paradigm of post-training. And this is exactly what led to agency coding. So, it does matter. It is it is valuable. It is useful. >> It's not that the mar- models are smarter, it's that they're suddenly more useful. And it's possible to be more useful in particular domains without being smarter. Yeah, clearly because that's means good things for me. I'm not getting any smarter right now like you know, age 45, but you know, I can learn how to do things. And that's sort of what's happening with the models as of like late. >> Yeah, absolutely. When when it comes to a competency, there's always a trade-off between intelligence and knowledge. If you have more knowledge, if you have better training, uh you need less intelligence to be competent. And that's exactly uh what happened with the the rise of coding agents, right? The models don't have higher fluid intelligence per se. They don't have like a higher IQ, so to speak. It's just that they're way better trained. And they're way better trained in in two ways. So, they're not just trained to to complete coding more. They're actually trained via trial and error in these RL post-training environments with, you know, true what's signals. And also, they're trained uh to embed these uh model of code execution, right? Where they they they they they learn to keep track of the value of variables uh over an execution cycle. And that's what what's leading to this extremely strong product market fit uh of agent-side coding today. And three, it's completely changing software engineering. >> This has happened not too long ago, the saturation. We actually had the founder of Poetic that came and spoke about the approach, which is really sounds like this new way of getting LLMs to perform is building this agent harness, right? And the harness is basically structuring a problem domain into something that can be formally verified. And they did that basically for Arc V2.
Which that when they released it, they were at the top of the benchmark. But then the crazy thing is I actually worked with the company in the winter 26 batch not too long ago called Confluence Lab, which actually ended up saturating the V2 results with 97% and I think their task cost was a lot more efficient, too. And the approach they basically took is similar to this. I think they built the harnesses on top of it in order to get the LLMs to to go and build different tasks and program through it.







