Any insider takes on Yann LeCun's push against current architectures?

Question

So, Lecun has been quite public saying that he believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors -- these can't be damped mathematically.In exchange, he offers the idea that we should have something that is an 'energy minimization' architecture; as I understand it, this would have a concept of the 'energy' of an entire response, and training would try and minimize that.Which is to say, I don't fully understand this. That said, I'm curious to hear what ML researchers think about Lecun's take, and if there's any engineering done around it. I can't find much after the release of ijepa from his group.

bravura · Accepted Answer

Okay I think I qualify. I'll bite.LeCun's argument is this:1) You can't learn an accurate world model just from text.2) Multimodal learning (vision, language, etc) and interaction with the environment is crucial for true learning.He and people like Hinton and Bengio have been saying for a while that there are tasks that mice can understand that an AI can't. And that even have mouse-level intelligence will be a breakthrough, but we cannot achieve that through language learning alone.A simple example from "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/abs/1906.01327) is this: Learning the size of objects using pure text analysis requires significant gymnastics, while vision demonstrates physical size more easily. To determine the size of a lion you'll need to read thousands of sentences about lions, or you could look at two or three pictures.LeCun isn't saying that LLMs aren't useful. He's just concerned with bigger problems, like AGI, which he believes cannot be solved purely through linguistic analysis.The energy minimization architecture is more about joint multimodal learning.(Energy minimization is a very old idea. LeCun has been on about it for a while and it's less controversial these days. Back when everyone tried to have a probabilistic interpretation of neural models, it was expensive to compute the normalization term / partition function. Energy minimization basically said: Set up a sensible loss and minimize it.)

jawiggins · Answer

I'm not an ML researcher, but I do work in the field.
My mental model of AI advancements is that of a step function with s-curves in each step [1]. Each time there is an algorithmic advancement, people quickly rush to apply it to both existing and new problems, demonstrating quick advancements. Then we tend to hit a kind of plateau for a number of years until the next algorithmic solution is found. Examples of steps include, AlexNet demonstrating superior image labeling, LeCun demonstrating DeepLearning, and now OpenAI demonstrating large transformer models.
I think in the past, at each stage, people tend to think that the recent progress is a linear or exponential process that will continue forward. This lead to people thinking self driving cars were right around the corner after the introduction of DL in the 2010s, and super-intelligence is right around the corner now. I think at each stage, the cusp of the S-curve comes as we find where the model is good enough to be deployed, and where it isn't. Then companies tend to enter a holding pattern for a number of years getting diminishing returns from small improvements on their models, until the next algorithmic breakthrough is made.
Right now I would guess that we are around 0.9 on the S curve, we can still improve the LLMs (as DeepSeek has shown wide MoE and o1/o3 have shown CoT), and it will take a few years for the best uses to be brought to market and popularized. As you mentioned, LeCun points out that LLMs have a hallucination problem built into their architecture, others have pointed out that LLMs have had shockingly few revelations and breakthroughs for something that has ingested more knowledge than any living human. I think future work on LLMs are likely to make some improvement on these things, but not much.
I don't know what it will be, but a new algorithm will be needed to induce the next step on the curve of AI advancement.
[1]: https://www.open.edu/openlearn/nature-environment/organisati...

ActorNightly · Answer

Not an official ML researcher, but I do happen to understand this stuff.
The problem with LLMs is that the output is inherently stochastic - i.e there isn't a "I don't have enough information" option. This is due to the fact that LLMs are basically just giant look up maps with interpolation.
Energy minimization is more of an abstract approach to where you can use architectures that don't rely on things like differentiability. True AI won't be solely feedforward architectures like current LLMs. To give an answer, they will basically determine alogrithm on the fly that includes computation and search. To learn that algorithm (or algorithm parameters), at training time, you need something that doesn't rely on continuous values, but still converges to the right answer. So instead you assign a fitness score, like memory use or compute cycles, and differentiate based on that. This is basically how search works with genetic algorithms or PSO.

chriskanan · Answer

A lot of the responses seem to be answering a different question: "Why does LeCun think LLMs won't lead to AGI?" I could answer that, but the question you are asking is "Why does LeCun think hallucinations are inherent in LLMs?"
To answer your question, think about how we train LLMs: We have them learn the statistical distribution of all written human language, such that given a chunk of text (a prompt, etc.) it then samples its output distribution to produces the next most likely token (word, sub-word, etc.) that should be produced and keeps doing that. It never learns how to judge what is true or false and during training it never needs to learn "Do I already know this?" It is just spoon fed information that it has to memorize and has no ability to acquire metacognition, which is something that it would need to be trained to attain. As humans, we know what we don't know (to an extent) and can identify when we already know something or don't already know something, such that we can say "I don't know." During training, an LLM is never taught to do this sort of introspection, so it never will know what it doesn't know.
I have a bunch of ideas about how to address this with a new architecture and a lifelong learning training paradigm, but it has been hard to execute. I'm an AI professor, but really pushing the envelope in that direction requires I think a small team (10-20) of strong AI scientists and engineers working collaboratively and significant computational resources. It just can't be done efficiently in academia where we have PhD student trainees who all need to be first author and work largely in isolation. By the time AI PhD students get good, they graduate.
I've been trying to find the time to focus on getting a start-up going focused on this. With Terry Sejnowski, I pitched my ideas to a group affiliated with Schmidt Sciences that funds science non-profits at around $20M per year for 5 years. They claimed to love my ideas, but didn't go for it....

eximius · Answer

I believe that so long as weights are fixed at inference time, we'll be at a dead end.
Will Titans be sufficiently "neuroplastic" to escape that? Maybe, I'm not sure.
Ultimately, I think an architecture around "looping" where the model outputs are both some form of "self update" and "optional actionality" such that interacting with the model is more "sampling from a thought space" will be required.

bashfulpup · Answer

He's right but at the same time wrong. Current AI methods are essentially scaled up methods that we learned decades ago.
These long horizon (agi) problems have been there since the very beginning. We have never had a solution to them. RL assumes we know the future which is a poor proxy. These energy based methods fundamentally do very little that an RNN didn't do long ago.
I worked on higher dimensionality methods which is a very different angle. My take is that it's about the way we scale dependencies between connections. The human brain makes and breaks a massive amount of nueron connections daily. Scaling the dimensionality would imply that a single connection could be scalled to encompass significantly more "thoughts" over time.
Additionally the true to solution to these problems are likely to be solved by a kid with a laptop as much as an top researcher. You find the solution to CL on a small AI model (mnist) you solve it at all scales.

hnfong · Answer

I'm not an insider and I'm not sure whether this is directly related to "energy minimization", but "diffusion language models" have apparently gained some popularity in recent weeks.
https://arxiv.org/abs/2502.09992
https://www.inceptionlabs.ai/news
(these are results from two different teams/orgs)
It sounds kind of like what you're describing, and nobody else has mentioned it yet, so take a look and see whether it's relevant.

coderenegade · Answer

You could reframe the way LLMs are currently trained as energy minimization, since the Boltzmann distribution that links physics and information theory (and correspondingly, probability theory as well) is general enough to include all standard loss functions as special cases. It's also pretty straightforward to include RL in that category as well.
I think what Lecun is probably getting at is that there's currently no way for a model to say "I don't know". Instead, it'll just do its best. For esoteric topics, this can result in hallucinations; for topics where you push just past the edge of well-known and easy-to-Google, you might get a vacuously correct response (i.e. repetition of correct but otherwise known or useless information). The models are trained to output a response that meets the criteria of quality as judged by a human, but there's no decent measure (that I'm aware of) of the accuracy of the knowledge content, or the model's own limitations. I actually think this is why programming and mathematical tasks have such a large impact on model performance: because they encode information about correctness directly into the task.
So Yann is probably right, though I don't know that energy minimization is a special distinction that needs to be added. Any technique that we use for this task could almost certainly be framed as energy minimization of some energy function.

jiggawatts · Answer

My observation from the outside watching this all unfold is that not enough effort seems to be going into the training schedule.
I say schedule because the “static data once through” is the root of the problem in my mind is one of the root problems.
Think about what happens when you read something like a book. You’re not “just” reading it, you’re also comparing it to other books, other books by the same author, while critically considering the book recommendations made by your friend. Any events in the book get compared to your life experience, etc…
LLM training does none of this! It’s a once-through text prediction training regime.
What this means in practice is that an LLM can’t write a review of a book unless it has read many reviews already. They have, of course, but the problem doesn’t go away. Ask an AI to critique book reviews and it’ll run out of steam because it hasn’t seen many of those. Critiques of critiques is where they start falling flat on their face.
This kind of meta-knowledge is precisely what experts accumulate.
As a programmer I don’t just regurgitate code I’ve seen before with slight variations — instead I know that mainstream criticisms of micro services misses their key benefit of extreme team scalability!
This is the crux of it: when humans read their training material they are generating an “n+1” level in their mind that they also learn. The current AI training setup trains the AI only the “n”th level.
This can be solved by running the training in a loop for several iterations after base training. The challenge of course is to develop a meaningful loss function.
IMHO the “thinking” model training is a step in the right direction but nowhere near enough to produce AGI all by itself.

TrainedMonkey · Answer

This is a somewhat nihilistic take with an optimistic ending. I believe humans will never fix hallucinations. Amount of totally or partially untrue statements people make is significant. Especially in tech, it's rare for people to admit that they do not know something. And yet, despite all of that the progress keeps marching forward and maybe even accelerating.

probably_wrong · Answer

I haven't read Yann Lecun's take. Based on your description alone my first impression would be: there's a paper [1] arguing that "beam search enforces uniform information density in text, a property motivated by cognitive science". UID claims, in short, that a speaker only delivers as much content as they think the listener can take (no more, no less) and the paper claims that beam search enforced this property at generation time.
The paper would be a strong argument against your point: if neural architectures are already constraining the amount of information that a text generation system delivers the same way a human (allegedly) does, then I don't see which "energy" measure one could take that could perform any better.
Then again, perhaps they have one in mind and I just haven't read it.
[1] https://aclanthology.org/2020.emnlp-main.170/

nonagono · Answer

Many of his arguments make &ldquo;logical&rdquo; sense, but one way to evaluate them is: would they have applied equally well 5 years ago? and would that have predicted LLMs will never write (average) poetry, or solve math, or answer common-sense questions about the physical world reasonably well? Probably. But turns out scale is all we needed. So yeah, maybe this is the exact point where scale stops working and we need to drastically change architectures. But maybe we just need to keep scaling.

jurschreuder · Answer

This concept comes from Hopfield networks.
If two nodes are on, but the connection between them is negative, this causes energy to be higher.
If one of those nodes switches off, energy is reduced.
With two nodes this is trivial. With 10 nodes it's more difficult to solve, and with billions of nodes it is impossible to "solve".
All you can do then is try to get the energy as low as possible.
This way also neural networks can find out "new" information, that they have not learned, but is consistent with the constraints they have learned about the world so far.

__rito__ · Answer

Sligtly related: Energy Based Models (EBMs) are better in theory and yet too resource intensive. I tried to sell using EBMs to my org, but the price for even a small use case was prohibitive.
I learned it from: https://youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyo...
Yann LeCun, and Michael Bronstein and his colleagues have some similarities in trying to properly Sciencify Deep Learning.
Yann LeCun's approach, as least for Vision has one core tenet- energy minimization, just like in Physics. In his course, he also shows some current arch/algos to be special cases for EBMs.
Yann believes that understanding the Whys of the behavior of DL algorithms are going to be beneficial in the long term rather than playing around with hyper-params.
There is also a case for language being too low-dimensional to lead to AGI even if it is solved. Like, in a recent video, he said that the total amount of data existing on all digitized books and internet are the same as what a human children takes in in the first 4/5 years. He considers this low.
There are also epistemological arguments against language not being able to lead to AGI, but I haven't heard him talk about them.
He also believes that Vision is a more important aspect of intellgence. One reason being it being very high-dim. (Edit) Consider an example. Take 4 monochrome pixels. All pixels can range from 0 to 255. 4 pixels can create 256^4 = 2^32 combinations. 4 words can create 4! = 24 combinations. Solving language is easier and therefore low-stakes. Remember the monkey producing a Shakespeare play by randomly punching typewriter keys? If that was an astronomically big number, think how obscenely long it would take a monkey to paint Mona Lisa by randomly assigning pixel values. Left as an exercise to the reader.
Juergen Schmidhuber has gone a lot queit now. But he also told that a world-model, explicitly included in training is reasoning is better, rather than only text or image or whatever. He has a good paper with Lucas Beyer.

alok-g · Answer

>> believes LLMs will never fix hallucinations because, essentially, the token choice method at each step leads to runaway errors
I may have an actual opinion on his viewpoint, however, I have a nitpick even before that.
How exactly is 'LLM' defined here? Even if some energy-based thing is done, would some not call even that an LLM? If/when someone finds a way to fix it within the 'token choice' method, could some people not just start calling it something differently from 'LLM'.
I think Yann needs to rephrase what exactly he wants to say.

killthebuddha · Answer

I've always felt like the argument is super flimsy because "of course we can _in theory_ do error correction". I've never seen even a semi-rigorous argument that error correction is _theoretically_ impossible. Do you have a link to somewhere where such an argument is made?

itkovian_ · Answer

The fundamental distinction is usually made to contrastive approaches (i.e. make correct more likely, make everything else we just compared unlikely). Ebms are "only what is correct is more likely and the default for everything is unlikely"This is obviously an extremely high level simplification, but that's the core of it.

EEgads · Answer

Yann LeCun understands this is an electrical engineering and physical statistics of machine problem and not a code problem.
The physics of human consciousness are not implemented in a leaky symbolic abstraction but the raw physics of existence.
The sort of autonomous system we imagine when thinking AGI must be built directly into substrate and exhibit autonomous behavior out of the box. Our computers are blackboxes made in a lab without centuries of evolving in the analog world, finding a balance to build on. They either can do a task or cannot. Obviously from just looking at one we know how few real world tasks it can just get up and do.
Code isn’t magic, it’s instruction to create a machine state. There’s no inherent intelligence to our symbolic logic. It’s an artifact of intelligence. It cannot imbue intelligence into a machine.

d--b · Answer

Well, it could be argued that the &ldquo;optimal response&rdquo; ie the one that sorta minimizes that &ldquo;energy&rdquo; is sorted by LLMs on the first iteration. And further iterations aren&rsquo;t adding any useful information and in fact are countless occasions to veer off the optimal response.For example if a prompt is: &ldquo;what is the Statue of Liberty&rdquo;, the LLMs first output token is going to be &ldquo;the&rdquo;, but it kinda already &ldquo;knows&rdquo; that the next ones are going to be &ldquo;statue of liberty&rdquo;.So to me LLMs already &ldquo;choose&rdquo; a response path from the first token.Conversely, a LLM that would try and find a minimum energy for the whole response wouldn&rsquo;t necessarily stop hallucinating. There is nothing in the training of a model that says that &ldquo;I don&rsquo;t know&rdquo; has a lower &ldquo;energy&rdquo; than a wrong answer&hellip;

rglover · Answer

Not an ML researcher, but implementing these systems has shown this opinion to be correct. The non-determinism of LLMs is a feature, not a bug that can be fixed.
As a result, you'll never be able to get 100% consistent outputs or behavior (like you hypothetically can with a traditional algorithm/business logic). And that has proven out in usage across every model I've worked with.
There's also an upper-bound problem in terms of context where every LLM hits some arbitrary amount of context that causes it to "lose focus" and develop a sort of LLM ADD. This is when hallucinations and random, unrequested changes get made and a previously productive chat spirals to the point where you have to start over.

prats226 · Answer

I feel like success of LLM's have been combination of multiple factors coming together favourably: 1) Hardware becoming cheap enough to train models beyond a size where we could see emergent properties. Which is going to become cheaper and cheaper. 2) Model architecture which can in computationally less expensive manner being able to look at all inputs at the same time. CNN's, RNN's all succeded at smaller scale becuase they added inductive bias in architecture favourable to the input modality, but also became less generic. Attention is simpler in computation to scale it and also has lower inductive bias. 3) Unsupervised text on internet being source of data which requires light pre-processing hence almost no efforts wrt annotations etc reaching scale wrt scaling laws corrosponding to large size models. Also text data being diverse enough to be generic to encompass variety of topics, thoughts vs imagenet etc which is highly specific and costly to produce.
Assuming that text only models will hit a bottleneck, then to have next generation models, in addition to a new architecture, we also have to find rich dataset which is even more generic and much richer in modalities and the architecture being able to natively ingest it?
However something that is not predictible is how well the emergent properties can scale with model size further. Maybe few more unlocks like model being able to retain information well in spite of really large context length, ability to SFT on super complex reasoning tasks without disrupting weights enough to loose unsupervised learning might take us much further?

inimino · Answer

I have a paper coming up that I modestly hope will clarify some of this.The short answer should be that it's obvious LLM training and inference are both ridiculously inefficient and biologically implausible, and therefore there has to be some big optimization wins still on the table.

janalsncm · Answer

I am an MLE not an expert. However, it is a fundamental problem that our current paradigm of training larger and larger LLMs cannot ever scale to the precision people require for many tasks. Even in the highly constrained realm of chess, an enormous neural net will be outclassed by a small program that can run on your phone.https://arxiv.org/pdf/2402.04494

AlexCoventry · Answer

Not an insider, but:I don't know about you, but I certainly don't generate text autoregressively, token by token. Also, pretty sure I don't learn by global updates based on taking the derivative of some objective function of my behavior with respect to every parameter defining my brain. So there's good biological reason to think we can go beyond the capabilities of current architectures.I think probably an example of the kind of new architectures he supports is FB's Large Concept Models [1]. It's still a self-attention, autoregressive architecture, but the unit of regression is a sentence rather than a token. It maps sentences into a latent space via an autoencoder architecture, then has a transformer architecture in which the tokens are elements in that latent space.[1] https://arxiv.org/abs/2412.08821

bobosha · Answer

I argue that JEPA and its Energy-Based Model (EBM) framework fail to capture the deeply intertwined nature of learning and prediction in the human brain—the “yin and yang” of intelligence. Contemporary machine learning approaches remain heavily reliant on resource-intensive, front-loaded training phases. I advocate for a paradigm shift toward seamlessly integrating training and prediction, aligning with the principles of online learning.
Disclosure: I am the author of this paper.
Reference: (PDF) Hydra: Enhancing Machine Learning with a Multi-head Predictions Architecture. Available from: https://www.researchgate.net/publication/381009719_Hydra_Enh... [accessed Mar 14, 2025].

blueyes · Answer

Sincere question - why doesn't RL-based fine-tuning on top of LLMs solve this or at least push accuracy above a minimum acceptable threshhold in many use cases? OAI has a team doing this for enterprise clients. Several startups rolling out of current YC batch are doing versions of this.

estebarb · Answer

I have no idea about EBM, but I have researched a bit on the language modelling side. And let's be honest, GPT is not the best learner we can create right now (ourselves). GPT needs far more data and energy than a human, so clearly there is a better architecture somewhere waiting to be discovered.
Attention works, yes. But it is not naturally plausible at all. We don't do quadratic comparisons across a whole book or need to see thousands of samples to understand.
Personally I think that in the future recursive architectures and test time training will have a better chance long term than current full attention.
Also, I think that OpenAI biggest contribution is demostrating that reasoning like behaviors can emerge from really good language modelling.

giantg2 · Answer

I feel like some hallucinations aren't bad. Isn't that basically what a new idea is - a hallucination of what could be? The ability to come up with new things, even if they're sometimes wrong, can be useful and happen all the time with humans.

tyronehed · Answer

Any transformer based LLM will never achieve AGI because it's only trying to pick the next word. You need a larger amount of planning to achieve AGI. Also, the characteristics of LLMs do not resemble any existing intelligence that we know of. Does a baby require 2 years of statistical analysis to become useful? No. Transformer architectures are parlor tricks. They are glorified Google but they're not doing anything or planning. If you want that, then you have to base your architecture on the known examples of intelligence that we are aware of in the universe. And that's not a transformer. In fact, whatever AGI emerges will absolutely not contain a transformer.

hansvm · Answer

One small point: Token selection at each step is fine (and required if you want to be able to additively/linearly/independently handle losses). The problem here is the high inaccuracies in each token (or, rather, their distributions). If you use more time and space to generate the token then those errors go down. If using more time and space cannot suffice then, by construction, energy minimization models and any other solution you can think of also can't reduce the errors far enough.

akomtu · Answer

The next-gen LLMs are going to use something like mipmaps in graphics: a stack of progressively smaller versions of the image, with a 1x1 image at the top. The same concept applies to text. When you're writing something, your have a high-level idea in mind that serves as a guide. That idea is such a mipmap. Perhaps the next-gen LLMs will be generating a few parallel sequencies, the top-level will be a slow-pace anchor and the bottom-level being the actual text that depends on slower upper levels.

snats · Answer

Not an insider but imo the work on diffusion language models like LLaDA is really exciting. It's pretty obvious that LLMs are good but they are pretty slow. And in a world where people want agents you want a lot of the time something that might not be that smart but is capable of going really fast + searches fast. You only need to solve search in a specific domain for most agents. You don't need to solve the entire knowledge of human history in a single set of weights

fovc · Answer

I wonder if the error propagation problem could be solved with a &ldquo;branching&rdquo; generator? Basically at every token you fork off N new streams, with some tree pruning policy to avoid exponential blowup. With a bit of bookkeeping you could make an attention mask to support the parallel streams in the same context sharing prefixes. Perhaps that would allow more of an e2e error minimization than the greedy generation algorithm in use today?

schainks · Answer

This seems really intuitive to me. If I can express something concisely and succinctly because I understand it, I will literally spend less energy to explain it.

adamnemecek · Answer

We are actually working on scaling energy-based models http://traceoid.ai

trod1234 · Answer

Not an ML researcher, but neither of those ideas are going to work.
The token approach is inherently flawed because the tokens pre-suppose unique meaning when in fact they may not be unique.
Said another way, it lacks properties that would be able to differentiate true from false because the differentiating input isn't included and cannot be derived from the inputs given. This goes to decidability.

zmmmmm · Answer

So much of our fundamental scientific progress has been made by people who were considered crazy and their ideas delusional. Even mundane software engineering is done with layers of code review and automated tests because even the best engineers are still pretty bad at it. At a larger level, humanity itself seems to largely operate more like an ensemble method where many people in parallel solve problems and we empirically find who was "hallucinating".Which is just to say, it feels to me like there's a danger that the stochastic nature of outputs is fundamental to true creative intelligence and all attempts to stamp it out will result in lower accuracy overall. Rather we should be treating it more like we do actual humans and expect errors and put layers of process around things where it matters to make them safe.

jongjong · Answer

It's weird that we don't have human-level AGI yet considering that we have AIs that are in some ways much smarter than humans.
Top-end LLMs write better and faster than most humans.
Top-end stable diffusion models can draw and render video much faster and with much more precision than the best human artists.

danielmarkbruce · Answer

Whether or not he's right big picture, the specific thing about runaway tokens is dumb as hell.

infamouscow · Answer

There was an article about in the March 2025 issue of Communications of the ACM: https://dl.acm.org/doi/pdf/10.1145/3707200

simne · Answer

I'm not deep researcher, more like amateur, but could explain some things.
Most problem with current approach, to grow abilities, need to add more neurons, but this is not just energy consuming, but also knowledge consuming, mean, at GPT-4 level all text sources of humanity already exhausted and model become essentially overfitted. So looks like multi-modal models appear not because so good, but because they could learn on additional sources (audio/video).
I seen few approaches to overcome problem of overfitting, but as I understand not exist universal solution.
For example, tried approach to create from current texts some synthetic training data, but this idea is limited by definition.
So, current LLMs appear to hit dead end, and researchers now trying to find exit from this dead end. I believe, nearest years somebody will invent some universal solution (probably, complex of approaches) or suggest another architecture, and progress of AI will continue.

tyronehed · Answer

The alternative architectures must learn from streaming data, must be error tolerant and must have the characteristic that similar objects or concepts much naturally come near to each other. They must naturally overlap.

jmpeax · Answer

A transformer will attend to previous tokens, and so it is free to ignore prior errors. I don't get LeCun's comment on error propagation, and look forward to a more thorough exposition of the problem.

iammrpayments · Answer

Why do they have to use the word &ldquo;hallucination&rdquo; when the model makes a mistake, if you tell your teacher or boss you didn&rsquo;t get the answer wrong, you&rsquo;ve hallucinated it, he will send you to the hospital.

stevenae · Answer

https://en.m.wikipedia.org/wiki/Energy-based_model

ic_fly2 · Answer

I can recommend his introduction to energy models. It is a bit older but explains the idea very well.

in3d · Answer

I don't think it's a coincidence that he is interested in non-LLM solutions, since he mentioned last year on Twitter that he doesn't have an internal monologue (I hope this is not taken as disparaging of him in any way). His criticisms of LLMs never made sense, and the success of reasoning models has shown him to be definitely wrong.

natch · Answer

Not formally an ML researcher but I&rsquo;ve heard this (and similar from Melanie Mitchell) and it seems like ridiculous gatekeeping.There&rsquo;s no real rule worthy of any respect imho that LLMs can&rsquo;t be configured to get additional input data from images, audio, proprioception sensors, and any other modes. I can easily write a script to convert such data into tokens in any number of ways that would allow them to be fed in as tokens of a &ldquo;language.&rdquo; Convolutions for example. A real expert could do it even more easily or do a better job. And then don&rsquo;t LeCun&rsquo;s objections just evaporate? I don&rsquo;t see why he thinks he has some profound point. For gods sake our own senses are heavily attenuated and mediated and it&rsquo;s not like we actually experience raw reality ourselves, ever; we just feel like we do. LLMs can be extended to be situated. So much can be done. It&rsquo;s like he&rsquo;s seeing http in 1993 and saying it won&rsquo;t be enough for the full web&hellip; well duh, but it&rsquo;s a great start. Now go build on it.If anything the flaw in LLMs is how they maintain only one primary thread of prediction. But this is changing; having a bunch of threads working on the same problem and checking each other from different angles of the problem will be an obvious fix for a lot of issues.

nsonha · Answer

cane someone elaborate on the energy bit? I vaguely recall something similar in ML 101 way back in university. Was that not widely used?

ilaksh · Answer

I don't think you need to be an ML researcher to understand his point of view. He wants to do fundamental research. Optimizing LLMs is not fundamental research. There are numerous other potential approaches, and it's obvious that LLMs have weaknesses that other approaches could tackle.
If he was Hinton's age then maybe he would also want to retire and be happy with transformers and LLMs. He is still an ambitious researcher that wants to do foundational research to get to the next paradigm.
Having said all of that, it is a misjudgement for him to be disparaging the incredible capabilities of LLMs to the degree he has.

bitwize · Answer

Ever hear of Dissociated Press? If not, try the following demonstration.
Fire up Emacs and open a text file containing a lot of human-readable text. Something off Project Gutenberg, say. Then say M-x dissociated-press and watch it spew hilarious, quasi-linguistic garbage into a buffer for as long as you like.
Dissociated Press is a language model. A primitive, stone-knives-and-bearskins language model, but a language model nevertheless. When you feed it input text, it builds up a statistical model based on a Markov chain, assigning probabilities to each character that might occur next, given a few characters of input. If it sees 't' and 'h' as input, the most likely next character is probably going to be 'e', followed by maybe 'a', 'i', and 'o'. 'r' might find its way in there, but 'z' is right out. And so forth. It then uses that model to generate output text by picking characters at random given the past n input characters, resulting in a firehose of things that might be words or fragments of words, but don't make much sense overall.
LLMs are doing the same thing. They're picking the next token (word or word fragment) given a certain number of previous tokens. And that's ALL they're doing. The only differences are really matters of scale: the tokens are larger than single characters, the model considers many, many more tokens of input, and the model is a huge deep-learning model with oodles more parameters than a simple Markov chain. So while Dissociated Press churns out obvious nonsensical slop, ChatGPT churns out much, much more plausible sounding nonsensical slop. But it's still just rolling them dice over and over and choosing from among the top candidates of "most plausible sounding next token" according to its actuarial tables. It doesn't think. Any thinking it appears to do has been pre-done by humans, whose thoughts are then harvested off the internet and used to perform macrodata refinement on the statistical model. Accordingly, if you ask ChatGPT a question, it may well be right a lot of the time. But when it's wrong, it doesn't know it's wrong, and it doesn't know what to do to make things right. Because it's just reaching into a bag of refrigerator magnet poetry tiles, weighted by probability of sounding good given the current context, and slapping whatever it finds onto the refrigerator. Over and over.
What I think Yann LeCun means by "energy" above is "implausibility". That is, the LLM would instead grab a fistful of tiles -- enough to form many different responses -- and from those start with a single response and then through gradient descent or something optimize it to minimize some statistical "bullshit function" for the entire response, rather than just choosing one of the most plausible single tiles each go. Even that may not fix the hallucination issue, but it may produce results with fewer obvious howlers.

ALittleLight · Answer

I've never understood this critique. Models have the capability to say: "oh, I made a mistake here, let me change this" and that solves the issue, right?A little bit of engineering and fine tuning - you could imagine a model producing a sequence of statements, and reflecting on the sequence - updating things like "statement 7, modify: xzy to xyz"