what is intelligence?

Moontower Munchies #136

Dec 10, 2025

Friends,

Follow up to the SLV calendar from last Monday

Last week, I published a video talking about calendar spreads in general and the fat one I saw last Monday in SLV. These are the slides from the presentation.

On Monday, I made this short video explaining how you’d track the performance of the spread assuming you delta-hedged daily.

Learning by doing

A couple months ago, Dwarkesh had Richard Sutton on the podcast. It was an interesting interview, especially since it was contentious in a healthy and enlightening way. The disagreements strike at the heart of what learning is. It’s also a reminder for the fertile crescent on the topic of learning is the intersection of AI, computer science, and cog sci.

[You should also check out Dwarkesh's outstanding interview with Scott H. Young if you are interested in learning science.]

Dwarkesh Podcast

Richard Sutton – Father of RL thinks LLMs are a dead end

Listen now

7 months ago · 115 likes · 31 comments · Dwarkesh Patel

I listened to the interview with Sutton twice and I’ll share the excerpts that linger (emphasis mine).

Dwarkesh intro

Richard Sutton is the father of reinforcement learning, winner of the 2024 Turing Award, and author of The Bitter Lesson. And he thinks LLMs are a dead end.

After interviewing him, my steel man of Richard’s position is this: LLMs aren’t capable of learning on-the-job, so no matter how much we scale, we’ll need some new architecture to enable continual learning.

And once we have it, we won’t need a special training phase — the agent will just learn on-the-fly, like all humans, and indeed, like all animals.

This new paradigm will render our current approach with LLMs obsolete.

In our interview, I did my best to represent the view that LLMs might function as the foundation on which experiential learning can happen… Some sparks flew.

What Is Intelligence?

What is intelligence? The problem is to understand your world. Reinforcement learning is about understanding your world, whereas large language models are about mimicking people, doing what people say you should do. They’re not about figuring out what to do.

Do Large Language Models Have a World Model?

Dwarkesh Patel 00:01:19:

You would think that to emulate the trillions of tokens in the corpus of Internet text, you would have to build a world model. In fact, these models do seem to have very robust world models. They’re the best world models we’ve made to date in AI, right? What do you think is missing?

Sutton:

I would disagree with most of the things you just said. To mimic what people say is not really to build a model of the world at all. You’re mimicking things that have a model of the world: people. I don’t want to approach the question in an adversarial way, but I would question the idea that they have a world model. A world model would enable you to predict what would happen. They have the ability to predict what a person would say. They don’t have the ability to predict what will happen.

What we want, to quote Alan Turing, is a machine that can learn from experience, where experience is the things that actually happen in your life. You do things, you see what happens, and that’s what you learn from. The large language models learn from something else. They learn from “here’s a situation, and here’s what a person did.” Implicitly, the suggestion is you should do what the person did.

The Limits of Imitation

[Imitation is the] large language model perspective. I don’t think it’s a good perspective. To be a prior for something, there has to be a real thing. A prior bit of knowledge should be the basis for actual knowledge. What is actual knowledge? There’s no definition of actual knowledge in that large-language framework. What makes an action a good action to take?

The Need for Continual Learning and Feedback

You recognize the need for continual learning. If you need to learn continually, continually means learning during the normal interaction with the world. There must be some way during the normal interaction to tell what’s right. Is there any way to tell in the large language model setup what’s the right thing to say? You will say something and you will not get feedback about what the right thing to say is, because there’s no definition of what the right thing to say is. There’s no goal. If there’s no goal, then there’s one thing to say, another thing to say. There’s no right thing to say.

There’s no ground truth. You can’t have prior knowledge if you don’t have ground truth, because the prior knowledge is supposed to be a hint or an initial belief about what the truth is. There isn’t any truth. There’s no right thing to say. In reinforcement learning, there is a right thing to say, a right thing to do, because the right thing to do is the thing that gets you reward.

We have a definition of what’s the right thing to do, so we can have prior knowledge or knowledge provided by people about what the right thing to do is. Then we can check it to see, because we have a definition of what the actual right thing to do is.

An even simpler case is when you’re trying to make a model of the world. When you predict what will happen, you predict and then you see what happens. There’s ground truth. There’s no ground truth in large language models because you don’t have a prediction about what will happen next. The next token is what they should say, what the actions should be. It’s not what the world will give them in response to what they do.

The Importance of Goals

Let’s go back to their lack of a goal. For me, having a goal is the essence of intelligence. Something is intelligent if it can achieve goals. I like John McCarthy’s definition that intelligence is the computational part of the ability to achieve goals. You have to have goals or you’re just a behaving system. You’re not anything special, you’re not intelligent. You agree that large language models don’t have goals?

Dwarkesh Patel 00:07:25

No, they have a goal.

Richard Sutton 00:07:26

What’s the goal?

Dwarkesh Patel 00:07:27

Next token prediction.

Richard Sutton 00:07:29

That’s not a goal. It doesn’t change the world. Tokens come at you, and if you predict them, you don’t influence them.

Dwarkesh Patel 00:07:39

Oh yeah. It’s not a goal about the external world.

Richard Sutton 00:07:43

It’s not a goal. It’s not a substantive goal. You can’t look at a system and say it has a goal if it’s just sitting there predicting and being happy with itself that it’s predicting accurately.

RL on Top of LLMs?

Dwarkesh Patel 00:07:55

The bigger question I want to understand is why you don’t think doing RL on top of LLMs is a productive direction. We seem to be able to give these models the goal of solving difficult math problems. They are in many ways at the very peaks of human-level in the capacity to solve math Olympiad-type problems. They got gold at IMO. So it seems like the model which got gold at the International Math Olympiad does have the goal of getting math problems right. Why can’t we extend this to different domains?

Richard Sutton 00:08:27

The math problems are different. Making a model of the physical world and carrying out the consequences of mathematical assumptions or operations, those are very different things. The empirical world has to be learned. You have to learn the consequences. Whereas the math is more computational, it’s more like standard planning. There they can have a goal to find the proof, and they are in some way given that goal to find the proof.

The Experiential Paradigm

The scalable method is you learn from experience. You try things, you see what works. No one has to tell you. First of all, you have a goal. Without a goal, there’s no sense of right or wrong or better or worse. Large language models are trying to get by without having a goal or a sense of better or worse. That’s just exactly starting in the wrong place.

Learning vs. Training

I don’t think learning is really about training. I think learning is about learning, it’s about an active process. The child tries things and sees what happens. We don’t think about training when we think of an infant growing up.

Animal Learning and the Absence of Supervised Learning

These things are actually rather well understood. If you look at how psychologists think about learning, there’s nothing like imitation. Maybe there are some extreme cases where humans might do that or appear to do that, but there’s no basic animal learning process called imitation. There are basic animal learning processes for prediction and for trial-and-error control.

It’s really interesting how sometimes the hardest things to see are the obvious ones. It’s obvious—if you look at animals and how they learn, and you look at psychology and our theories of them—that supervised learning is not part of the way animals learn. We don’t have examples of desired behavior. What we have are examples of things that happen, one thing that followed another. We have examples of, “We did something and there were consequences.” But there are no examples of supervised learning.

Supervised learning is not something that happens in nature. Even if that were the case with school, we should forget about it because that’s some special thing that happens in people. It doesn’t happen broadly in nature. Squirrels don’t go to school. Squirrels can learn all about the world. It’s absolutely obvious, I would say, that supervised learning doesn’t happen in animals.

I like the way you consider that obvious, because I consider the opposite obvious. We have to understand how we are animals. If we understood a squirrel, I think we’d be almost all the way there to understanding human intelligence. The language part is just a small veneer on the surface.

The Experiential Stream

This is great. We’re finding out the very different ways that we’re thinking. We’re not arguing. We’re trying to share our different ways of thinking with each other.

[Imitation] is a small thing on top of basic trial-and-error learning, prediction learning. It’s what distinguishes us, perhaps, from many animals. But we’re an animal first. We were an animal before we had language and all those other things.

Sensation, Action, Reward

The experiential paradigm. Let’s lay it out a little bit. It says that experience, action, sensation—well, sensation, action, reward—this happens on and on and on for your life. It says that this is the foundation and the focus of intelligence. Intelligence is about taking that stream and altering the actions to increase the rewards in the stream.

Learning then is from the stream, and learning is about the stream. That second part is particularly telling. What you learn, your knowledge, is about the stream. Your knowledge is about if you do some action, what will happen. Or it’s about which events will follow other events. It’s about the stream. The content of the knowledge is statements about the stream. Because it’s a statement about the stream, you can test it by comparing it to the stream, and you can learn it continually.

Reward and Intrinsic Motivation

The reward function is arbitrary. If you’re playing chess, it’s to win the game of chess. If you’re a squirrel, maybe the reward has to do with getting nuts. In general, for an animal, you would say the reward is to avoid pain and to acquire pleasure. I think there also should be a component having to do with your increasing understanding of your environment. That would be sort of an intrinsic motivation.

Even when they have extremely sparse rewards, they can still make intermediate steps having an understanding of what the next thing they’re doing leads to this grander goal we have. This is something we know very well. The basis of it is temporal difference learning where the same thing happens in a less grandiose scale. When you learn to play chess, you have the long-term goal of winning the game. Yet you want to be able to learn from shorter-term things like taking your opponent’s pieces.

You do that by having a value function which predicts the long-term outcome. Then if you take the guy’s pieces, your prediction about the long-term outcome is changed. It goes up, you think you’re going to win. Then that increase in your belief immediately reinforces the move that led to taking the piece.

Sign of the times

I was taking my 9-year-old to school and as we were putting our shoes on he noticed this device:

He looks at me and asks “Can I talk to it?”

It literally laughed out loud for a few seconds before saying no. I mean, I can’t blame him, 20 ft away is the Alexa he asks about the weather or for spelling help. It also reminded me of when his older brother was a toddler and tried to “swipe” on the TV.

Speaking of Sign of the Times, that’s also the title of a Harry Styles tune that gets mucho spins on the playlist. Quite dreary for a pop song if you look up what it’s about.

Stay groovy

☮️

Advertise at the Moontower

Refer a friend

Need help analyzing a business, investment or career decision?

Book a call with me.

It's $500 for 60 minutes. Let's work through your problem together. If you're not satisfied, you get a refund.

Let me know what you want to discuss and I’ll give you a straight answer on whether I can be helpful before we chat.

I started doing these in early 2022 by accident via inbound inquiries from readers. So I hung out a shingle through the Substack Meetings beta. You can see how I’ve helped others:

📲Book A Meeting

Moontower On The Web

📡All Moontower Meta Blog Posts
👤About Me

Specific Moontower Projects

🧀MoontowerMoney
👽MoontowerQuant
🌟Affirmations and North Stars
🧠Moontower Brain-Plug In

Curations

✒️Moontower’s Favorite Posts By Others
🔖Guides To Reading I Enjoyed
🤖Resources to Get More Out of AI
🛋️Investment Blogs I Read
📚Book Ideas for Kids

Fun

🎙️Moontower Music
🍸Moontower Cocktails
🎲Moontower Boardgaming

Upgrades

Give a gift subscription

Get 20% off a group subscription

Discussion about this post

Ready for more?