In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.
Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.
LeCun's argument is this:
1) You can't learn an accurate world model just from text.
2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.
He and people like Hinton and Bengio have been saying for a while that there are tasks that mice can understand that an AI can't. And that even have mouse-level intelligence will be a breakthrough, but we cannot achieve that through language learning alone.
A simple example from "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/abs/1906.01327) is this: Learning the size of objects using pure text analysis requires significant gymnastics, while vision demonstrates physical size more easily. To determine the size of a lion you'll need to read thousands of sentences about lions, or you could look at two or three pictures.
LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.
The energy minimization architecture is more about joint multimodal learning.
(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)
My mental model of AI advancements is that of a step function with s-curves in each step [1]. Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found. Examples of steps include, AlexNet demonstrating superior image labeling, LeCun demonstrating DeepLearning, and now OpenAI demonstrating large transformer models.
I think in the past, at each stage, people tend to think that the recent progress is a linear or exponential process that will continue forward. This lead to people thinking self driving cars were right around the corner after the introduction of DL in the 2010s, and super-intelligence is right around the corner now. I think at each stage, the cusp of the S-curve comes as we find where the model is good enough to be deployed, and where it isn't. Then companies tend to enter a holding pattern for a number of years getting diminishing returns from small improvements on their models, until the next algorithmic breakthrough is made.
Right now I would guess that we are around 0.9 on the S curve, we can still improve the LLMs (as DeepSeek has shown wide MoE and o1/o3 have shown CoT), and it will take a few years for the best uses to be brought to market and popularized. As you mentioned, LeCun points out that LLMs have a hallucination problem built into their architecture, others have pointed out that LLMs have had shockingly few revelations and breakthroughs for something that has ingested more knowledge than any living human. I think future work on LLMs are likely to make some improvement on these things, but not much.
I don't know what it will be, but a new algorithm will be needed to induce the next step on the curve of AI advancement.
[1]: https://www.open.edu/openlearn/nature-environment/organisati...
The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.
Energy minimization is more of an abstract approach to where you can use architectures that don't rely on things like differentiability. True AI won't be solely feedforward architectures like current LLMs. To give an answer, they will basically determine alogrithm on the fly that includes computation and search. To learn that algorithm (or algorithm parameters), at training time, you need something that doesn't rely on continuous values, but still converges to the right answer. So instead you assign a fitness score, like memory use or compute cycles, and differentiate based on that. This is basically how search works with genetic algorithms or PSO.
To answer your question, think about how we train LLMs: We have them learn the statistical distribution of all written human language, such that given a chunk of text (a prompt, etc.) it then samples its output distribution to produces the next most likely token (word, sub-word, etc.) that should be produced and keeps doing that. It never learns how to judge what is true or false and during training it never needs to learn "Do I already know this?" It is just spoon fed information that it has to memorize and has no ability to acquire metacognition, which is something that it would need to be trained to attain. As humans, we know what we don't know (to an extent) and can identify when we already know something or don't already know something, such that we can say "I don't know." During training, an LLM is never taught to do this sort of introspection, so it never will know what it doesn't know.
I have a bunch of ideas about how to address this with a new architecture and a lifelong learning training paradigm, but it has been hard to execute. I'm an AI professor, but really pushing the envelope in that direction requires I think a small team (10-20) of strong AI scientists and engineers working collaboratively and significant computational resources. It just can't be done efficiently in academia where we have PhD student trainees who all need to be first author and work largely in isolation. By the time AI PhD students get good, they graduate.
I've been trying to find the time to focus on getting a start-up going focused on this. With Terry Sejnowski, I pitched my ideas to a group affiliated with Schmidt Sciences that funds science non-profits at around $20M per year for 5 years. They claimed to love my ideas, but didn't go for it....
Will Titans be sufficiently "neuroplastic" to escape that? Maybe, I'm not sure.
Ultimately, I think an architecture around "looping" where the model outputs are both some form of "self update" and "optional actionality" such that interacting with the model is more "sampling from a thought space" will be required.
These long horizon (agi) problems have been there since the very beginning. We have never had a solution to them. RL assumes we know the future which is a poor proxy. These energy based methods fundamentally do very little that an RNN didn't do long ago.
I worked on higher dimensionality methods which is a very different angle. My take is that it's about the way we scale dependencies between connections. The human brain makes and breaks a massive amount of nueron connections daily. Scaling the dimensionality would imply that a single connection could be scalled to encompass significantly more "thoughts" over time.
Additionally the true to solution to these problems are likely to be solved by a kid with a laptop as much as an top researcher. You find the solution to CL on a small AI model (mnist) you solve it at all scales.
https://arxiv.org/abs/2502.09992
https://www.inceptionlabs.ai/news
(these are results from two different teams/orgs)
It sounds kind of like what you're describing, and nobody else has mentioned it yet, so take a look and see whether it's relevant.
I think what Lecun is probably getting at is that there's currently no way for a model to say "I don't know". Instead, it'll just do its best. For esoteric topics, this can result in hallucinations; for topics where you push just past the edge of well-known and easy-to-Google, you might get a vacuously correct response (i.e. repetition of correct but otherwise known or useless information). The models are trained to output a response that meets the criteria of quality as judged by a human, but there's no decent measure (that I'm aware of) of the accuracy of the knowledge content, or the model's own limitations. I actually think this is why programming and mathematical tasks have such a large impact on model performance: because they encode information about correctness directly into the task.
So Yann is probably right, though I don't know that energy minimization is a special distinction that needs to be added. Any technique that we use for this task could almost certainly be framed as energy minimization of some energy function.
I say schedule because the “static data once through” is the root of the problem in my mind is one of the root problems.
Think about what happens when you read something like a book. You’re not “just” reading it, you’re also comparing it to other books, other books by the same author, while critically considering the book recommendations made by your friend. Any events in the book get compared to your life experience, etc…
LLM training does none of this! It’s a once-through text prediction training regime.
What this means in practice is that an LLM can’t write a review of a book unless it has read many reviews already. They have, of course, but the problem doesn’t go away. Ask an AI to critique book reviews and it’ll run out of steam because it hasn’t seen many of those. Critiques of critiques is where they start falling flat on their face.
This kind of meta-knowledge is precisely what experts accumulate.
As a programmer I don’t just regurgitate code I’ve seen before with slight variations — instead I know that mainstream criticisms of micro services misses their key benefit of extreme team scalability!
This is the crux of it: when humans read their training material they are generating an “n+1” level in their mind that they also learn. The current AI training setup trains the AI only the “n”th level.
This can be solved by running the training in a loop for several iterations after base training. The challenge of course is to develop a meaningful loss function.
IMHO the “thinking” model training is a step in the right direction but nowhere near enough to produce AGI all by itself.
The paper would be a strong argument against your point: if neural architectures are already constraining the amount of information that a text generation system delivers the same way a human (allegedly) does, then I don't see which "energy" measure one could take that could perform any better.
Then again, perhaps they have one in mind and I just haven't read it.
If two nodes are on, but the connection between them is negative, this causes energy to be higher.
If one of those nodes switches off, energy is reduced.
With two nodes this is trivial. With 10 nodes it's more difficult to solve, and with billions of nodes it is impossible to "solve".
All you can do then is try to get the energy as low as possible.
This way also neural networks can find out "new" information, that they have not learned, but is consistent with the constraints they have learned about the world so far.
I learned it from: https://youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyo...
Yann LeCun, and Michael Bronstein and his colleagues have some similarities in trying to properly Sciencify Deep Learning.
Yann LeCun's approach, as least for Vision has one core tenet- energy minimization, just like in Physics. In his course, he also shows some current arch/algos to be special cases for EBMs.
Yann believes that understanding the Whys of the behavior of DL algorithms are going to be beneficial in the long term rather than playing around with hyper-params.
There is also a case for language being too low-dimensional to lead to AGI even if it is solved. Like, in a recent video, he said that the total amount of data existing on all digitized books and internet are the same as what a human children takes in in the first 4/5 years. He considers this low.
There are also epistemological arguments against language not being able to lead to AGI, but I haven't heard him talk about them.
He also believes that Vision is a more important aspect of intellgence. One reason being it being very high-dim. (Edit) Consider an example. Take 4 monochrome pixels. All pixels can range from 0 to 255. 4 pixels can create 256^4 = 2^32 combinations. 4 words can create 4! = 24 combinations. Solving language is easier and therefore low-stakes. Remember the monkey producing a Shakespeare play by randomly punching typewriter keys? If that was an astronomically big number, think how obscenely long it would take a monkey to paint Mona Lisa by randomly assigning pixel values. Left as an exercise to the reader.
Juergen Schmidhuber has gone a lot queit now. But he also told that a world-model, explicitly included in training is reasoning is better, rather than only text or image or whatever. He has a good paper with Lucas Beyer.
I may have an actual opinion on his viewpoint, however, I have a nitpick even before that.
How exactly is 'LLM' defined here? Even if some energy-based thing is done, would some not call even that an LLM? If/when someone finds a way to fix it within the 'token choice' method, could some people not just start calling it something differently from 'LLM'.
I think Yann needs to rephrase what exactly he wants to say.
This is obviously an extremely high level simplification, but that's the core of it.
The physics of human consciousness are not implemented in a leaky symbolic abstraction but the raw physics of existence.
The sort of autonomous system we imagine when thinking AGI must be built directly into substrate and exhibit autonomous behavior out of the box. Our computers are blackboxes made in a lab without centuries of evolving in the analog world, finding a balance to build on. They either can do a task or cannot. Obviously from just looking at one we know how few real world tasks it can just get up and do.
Code isn’t magic, it’s instruction to create a machine state. There’s no inherent intelligence to our symbolic logic. It’s an artifact of intelligence. It cannot imbue intelligence into a machine.
For example if a prompt is: “what is the Statue of Liberty”, the LLMs first output token is going to be “the”, but it kinda already “knows” that the next ones are going to be “statue of liberty”.
So to me LLMs already “choose” a response path from the first token.
Conversely, a LLM that would try and find a minimum energy for the whole response wouldn’t necessarily stop hallucinating. There is nothing in the training of a model that says that “I don’t know” has a lower “energy” than a wrong answer…
As a result, you'll never be able to get 100% consistent outputs or behavior (like you hypothetically can with a traditional algorithm/business logic). And that has proven out in usage across every model I've worked with.
There's also an upper-bound problem in terms of context where every LLM hits some arbitrary amount of context that causes it to "lose focus" and develop a sort of LLM ADD. This is when hallucinations and random, unrequested changes get made and a previously productive chat spirals to the point where you have to start over.
Assuming that text only models will hit a bottleneck, then to have next generation models, in addition to a new architecture, we also have to find rich dataset which is even more generic and much richer in modalities and the architecture being able to natively ingest it?
However something that is not predictible is how well the emergent properties can scale with model size further. Maybe few more unlocks like model being able to retain information well in spite of really large context length, ability to SFT on super complex reasoning tasks without disrupting weights enough to loose unsupervised learning might take us much further?
The short answer should be that it's obvious LLM training and inference are both ridiculously inefficient and biologically implausible, and therefore there has to be some big optimization wins still on the table.
I don't know about you, but I certainly don't generate text autoregressively, token by token. Also, pretty sure I don't learn by global updates based on taking the derivative of some objective function of my behavior with respect to every parameter defining my brain. So there's good biological reason to think we can go beyond the capabilities of current architectures.
I think probably an example of the kind of new architectures he supports is FB's Large Concept Models [1]. It's still a self-attention, autoregressive architecture, but the unit of regression is a sentence rather than a token. It maps sentences into a latent space via an autoencoder architecture, then has a transformer architecture in which the tokens are elements in that latent space.
Disclosure: I am the author of this paper.
Reference: (PDF) Hydra: Enhancing Machine Learning with a Multi-head Predictions Architecture. Available from: https://www.researchgate.net/publication/381009719_Hydra_Enh... [accessed Mar 14, 2025].
Attention works, yes. But it is not naturally plausible at all. We don't do quadratic comparisons across a whole book or need to see thousands of samples to understand.
Personally I think that in the future recursive architectures and test time training will have a better chance long term than current full attention.
Also, I think that OpenAI biggest contribution is demostrating that reasoning like behaviors can emerge from really good language modelling.
The token approach is inherently flawed because the tokens pre-suppose unique meaning when in fact they may not be unique.
Said another way, it lacks properties that would be able to differentiate true from false because the differentiating input isn't included and cannot be derived from the inputs given. This goes to decidability.
Which is just to say, it feels to me like there's a danger that the stochastic nature of outputs is fundamental to true creative intelligence and all attempts to stamp it out will result in lower accuracy overall. Rather we should be treating it more like we do actual humans and expect errors and put layers of process around things where it matters to make them safe.
Top-end LLMs write better and faster than most humans.
Top-end stable diffusion models can draw and render video much faster and with much more precision than the best human artists.
Most problem with current approach, to grow abilities, need to add more neurons, but this is not just energy consuming, but also knowledge consuming, mean, at GPT-4 level all text sources of humanity already exhausted and model become essentially overfitted. So looks like multi-modal models appear not because so good, but because they could learn on additional sources (audio/video).
I seen few approaches to overcome problem of overfitting, but as I understand not exist universal solution.
For example, tried approach to create from current texts some synthetic training data, but this idea is limited by definition.
So, current LLMs appear to hit dead end, and researchers now trying to find exit from this dead end. I believe, nearest years somebody will invent some universal solution (probably, complex of approaches) or suggest another architecture, and progress of AI will continue.
There’s no real rule worthy of any respect imho that LLMs can’t be configured to get additional input data from images, audio, proprioception sensors, and any other modes. I can easily write a script to convert such data into tokens in any number of ways that would allow them to be fed in as tokens of a “language.” Convolutions for example. A real expert could do it even more easily or do a better job. And then don’t LeCun’s objections just evaporate? I don’t see why he thinks he has some profound point. For gods sake our own senses are heavily attenuated and mediated and it’s not like we actually experience raw reality ourselves, ever; we just feel like we do. LLMs can be extended to be situated. So much can be done. It’s like he’s seeing http in 1993 and saying it won’t be enough for the full web… well duh, but it’s a great start. Now go build on it.
If anything the flaw in LLMs is how they maintain only one primary thread of prediction. But this is changing; having a bunch of threads working on the same problem and checking each other from different angles of the problem will be an obvious fix for a lot of issues.
If he was Hinton's age then maybe he would also want to retire and be happy with transformers and LLMs. He is still an ambitious researcher that wants to do foundational research to get to the next paradigm.
Having said all of that, it is a misjudgement for him to be disparaging the incredible capabilities of LLMs to the degree he has.
Fire up Emacs and open a text file containing a lot of human-readable text. Something off Project Gutenberg, say. Then say M-x dissociated-press and watch it spew hilarious, quasi-linguistic garbage into a buffer for as long as you like.
Dissociated Press is a language model. A primitive, stone-knives-and-bearskins language model, but a language model nevertheless. When you feed it input text, it builds up a statistical model based on a Markov chain, assigning probabilities to each character that might occur next, given a few characters of input. If it sees 't' and 'h' as input, the most likely next character is probably going to be 'e', followed by maybe 'a', 'i', and 'o'. 'r' might find its way in there, but 'z' is right out. And so forth. It then uses that model to generate output text by picking characters at random given the past n input characters, resulting in a firehose of things that might be words or fragments of words, but don't make much sense overall.
LLMs are doing the same thing. They're picking the next token (word or word fragment) given a certain number of previous tokens. And that's ALL they're doing. The only differences are really matters of scale: the tokens are larger than single characters, the model considers many, many more tokens of input, and the model is a huge deep-learning model with oodles more parameters than a simple Markov chain. So while Dissociated Press churns out obvious nonsensical slop, ChatGPT churns out much, much more plausible sounding nonsensical slop. But it's still just rolling them dice over and over and choosing from among the top candidates of "most plausible sounding next token" according to its actuarial tables. It doesn't think. Any thinking it appears to do has been pre-done by humans, whose thoughts are then harvested off the internet and used to perform macrodata refinement on the statistical model. Accordingly, if you ask ChatGPT a question, it may well be right a lot of the time. But when it's wrong, it doesn't know it's wrong, and it doesn't know what to do to make things right. Because it's just reaching into a bag of refrigerator magnet poetry tiles, weighted by probability of sounding good given the current context, and slapping whatever it finds onto the refrigerator. Over and over.
What I think Yann LeCun means by "energy" above is "implausibility". That is, the LLM would instead grab a fistful of tiles -- enough to form many different responses -- and from those start with a single response and then through gradient descent or something optimize it to minimize some statistical "bullshit function" for the entire response, rather than just choosing one of the most plausible single tiles each go. Even that may not fix the hallucination issue, but it may produce results with fewer obvious howlers.
A little bit of engineering and fine tuning - you could imagine a model producing a sequence of statements, and reflecting on the sequence - updating things like "statement 7, modify: xzy to xyz"