Tag talk

2997Δ22m Academic

Yann LeCun | Self-Supervised Learning, JEPA, World Models, and the future of AI

www.youtube.com/watch?v=yUmDRxV0krg

Summary

In this insightful presentation at the Harvard Center for Mathematical Sciences and Applications (CMSA), Yann LeCun, Chief AI Scientist at Meta and Turing Award laureate, outlines a roadmap for the next generation of Artificial Intelligence. He argues that current Large Language Model (LLM) architectures are fundamentally limited and proposes a shift toward "World Models" and Joint Embedding Predictive Architectures (JEPA).

The Limitations of Current AI Architectures

LeCun begins by highlighting the stark contrast between human/animal learning and current machine learning. Despite the success of LLMs, he identifies several critical flaws:

Data Inefficiency: LLMs require trillions of tokens—equivalent to hundreds of thousands of years of reading—to reach their current level. In contrast, a four-year-old child has processed a similar amount of data (roughly $10^{14}$ bytes) through visual observation, yet possesses a far superior understanding of the physical world.
Autoregressive Failures: Current models predict the next token in a sequence. This process is inherently divergent; errors accumulate exponentially, leading to "hallucinations" and a lack of logical consistency.
Lack of Physical Grounding: LLMs lack a "mental model" of reality. They cannot reason about gravity, inertia, or the outcomes of physical actions, which are concepts human infants grasp within the first months of life.
Fixed Computation: Standard neural networks use the same amount of computation for every token, whereas complex problems should require more "thinking time"—a distinction between "System 1" (instinctive) and "System 2" (deliberative) cognition.

The World Model and JEPA

The core of LeCun’s proposal is the Joint Embedding Predictive Architecture (JEPA). He argues against generative models that attempt to predict every pixel in a video, noting that most details (like the movement of leaves on a tree) are irrelevant and unpredictable.

Representation over Generation: Instead of reconstructing pixels, JEPA predicts the representation of the next state in an abstract space. This allows the system to ignore unpredictable noise while capturing essential structures.
Hierarchical Abstraction: Just as science uses different levels of abstraction (from quantum mechanics to cells to ecosystems), AI must learn a hierarchy of representations. This is essential for Hierarchical Planning—the ability to break a long-term goal (e.g., traveling from New York to Paris) into a series of sub-goals and specific muscle movements.

Energy-Based Models and Optimization

LeCun advocates for moving beyond simple feed-forward propagation toward Inference by Optimization. He describes an "Energy-Based Model" (EBM) where an energy function measures the incompatibility between an input and a potential output.

Inference as Search: Under this framework, the system does not just "blurting out" an answer; it searches for an output that minimizes energy (maximizes compatibility with the world model and task objectives).
Preventing Collapse: A major technical challenge in non-generative models is "collapse," where the system learns a trivial constant representation. LeCun discusses regularized methods (like VICReg and Dino) that prevent collapse by maximizing the information content in the representation space.

Practical Applications and Results

The talk highlights recent breakthroughs from Meta’s Fundamental AI Research (FAIR) lab:

Dino-v2: A self-supervised vision model that matches or surpasses supervised systems in image understanding using far less labeled data.
V-JEPA: A video-based model that learns intuitive physics and common sense by observing unlabelled video. It can detect "impossible" events, such as an object disappearing, by noting spikes in prediction error.
Robotic Planning: Demonstrations show how these world models allow robots to plan complex tasks (like navigation or object manipulation) "zero-shot," without specific reinforcement learning for every new task.

Future Directions: A Shift in Paradigm

LeCun concludes with several strategic recommendations for the AI research community:

Abandon Generative Models: Focus on JEPA for non-discrete signals like video and sensory data.
Use Regularized Methods: Move away from contrastive learning (which requires too many negative samples) toward methods that regularize representation volume.
Minimize Reinforcement Learning: RL is highly inefficient; instead, utilize world models to plan actions through optimization.
Objective-Driven AI: Build systems where behavior is dictated by hard-coded guardrails and task-specific cost functions, ensuring safety and controllability.

Transcript

Introduction

Mike Friedman: Welcome everyone. Can you hear me? I'm Mike Friedman, representing the Center for Mathematics and Scientific Applications at Harvard, and it's my great pleasure to be introducing Yann LeCun, Chief Scientist at Meta. We're running a conference at CMSA on the geometry of machine learning, and this is actually a lecture within that conference, but it's outside the CMSA building because we knew too many people would show up to hear Yann. So we were able to move it to the Science Center where it's appropriate.

As soon as we got Yann to agree to give this talk, all the other speakers accepted immediately. So thank you, Yann. It's the easiest conference to organize. Yann is one of these scientists that it would anesthetize the audience if I tried to go through his awards, and also I would need a script. So I'll just mention that he won the Turing Award with Bengio and Hinton a few years ago. I think of him interchangeably with the idea of convolutional neural nets. I'm a geometer, as a mathematician—you know, topologist and geometer—and I think that's something we share: a confidence in the geometric imagination. I know it's something that Yann has always tried to figure out how to weave into artificial intelligence, and it's a vein of exploration that I've greatly admired. So, I think we're all very much looking forward to this talk. So am I. And without further ado, let me turn the stage over to Yann.

Yann LeCun: Thank you so much. Well, I have a terrible confession to make, which is that I'm not a mathematician. I'm not really a computer scientist either. I never actually studied computer science. So I'm not exactly sure what I am, but I'm going to talk about machine learning. I was told this was a bit of a more general audience than the one at the workshop, so I made this a bit more of a wide-audience talk—still technical, but a little lightweight on the theory, that's for sure.

The Current State of AI and the Need for Better Learning

I want to talk about the future of AI and how we can make significant progress towards more intelligent machines beyond what they are currently capable of doing. And I tell you right now, there is a lot of work to do. We're nowhere near matching human intelligence or even animal intelligence with the type of techniques that we have access to at the moment.

So one big question we can ask ourselves is: do we actually need AI systems with human-level intelligence? And the answer is probably yes, because the future in which each of us walks around with AI assistants helping us in our daily lives at all times—perhaps in wearable devices like smart glasses like the ones I'm wearing at the moment—is coming. We'll be their boss. It's kind of like we'd be running around with a team of virtual people helping us at all times. And of course, for this, we need AI systems that have intelligence that is in some way similar to humans, because that's the kind of entity that we are most familiar with interacting with.

But the technology is nowhere near where it needs to be at the moment for that. The main issue is that current AI architectures and machine learning techniques suck compared to what we can observe in humans and animals. The type of efficiency in learning that we see in animals and humans is just astonishing, and we're not matching this at the moment in many instances.

Early on in machine learning, the main technique was supervised learning, and then there was a big fashion around reinforcement learning for a while. Now it's used a lot, of course, to fine-tune large models, but in themselves, those two techniques are really insufficient. The type of learning that we observe in humans and animals is very different. It's neither supervised nor reinforced for that matter. It's more like self-supervised learning, something that has really revolutionized AI and machine learning over the last few years. The underlying principles are very similar to supervised learning, but there is no clear difference between input and output.

This works astonishingly well for training a system to understand the structure of sequences of discrete symbols such as language, code, and mathematics. But the problem is that it only works for sequences of discrete symbols. It doesn't really work for natural signals yet. Self-supervised learning is starting to work there, but the techniques are very different, and that'll be the main topic of this talk.

There are other limitations with current AI architectures. The type of inference that they perform is basically feed-forward propagation through a fixed number of layers. That's computationally limited. There's a lot of functions you cannot represent efficiently by just stacking a fixed number of layers. Also, current architectures use autoregressive prediction. They use their own predictions as input to make further predictions, and that leads to divergence or "hallucination," as people call it.

The World Model Concept

Humans and animals have mental models of the world. Their behavior is driven by objectives, tasks, and goals. They can reason and plan complex action sequences—all things that chatbots and LLMs are essentially incapable of, or at least not at the level we'd like. We need systems that understand the physical world, have persistent memory, can plan complex actions, can reason (spending more time on difficult problems), and are controllable and safe.

Let's start with this idea of a "World Model." We have mental models of reality that allow us to predict what's going to happen, particularly as a consequence of our actions. This allows us to plan. This chart indicates at what age infants learn basic concepts, like object permanence—knowing objects don't just disappear—and category recognition. By nine months, infants learn basic intuitive physics like gravity, inertia, and conservation of momentum. If you show a six-month-old a cart pushed off a platform that appears to float, they won't pay much attention. A ten-month-old will be extremely surprised, because by then they've learned that objects are supposed to fall.

How do we get machines to learn like babies? We haven't solved that problem. We don't have domestic robots. We don't have level-five self-driving cars. We have systems that can pass the bar exam or solve math problems, but we don't have robots that can do what a cat can do or what a ten-year-old can do the first time they are told to clear a table. A 17-year-old can learn to drive in 20 hours without causing accidents, while we have millions of hours of training data and still don't have fully autonomous cars without specialized sensors and mapping. This is the Moravec’s paradox: things that are intellectually challenging for humans (chess, integrals) are algorithmically simple, while things that are easy for humans (dexterity, common sense) are incredibly difficult for AI.

The Data Efficiency Gap

A typical large language model is trained on something like 30 trillion tokens (Llama 3). That's about $10^{14}$ bytes. It would take a human half a million years to read that. Compare this to a four-year-old child. A child has seen about 16,000 hours of "video" through their eyes. The optic nerve carries about 2 megabytes per second. Over 16,000 hours, that’s also about $10^{14}$ bytes.

A four-year-old has seen as much data as the biggest LLMs have read. Visual data is redundant, and that's exactly what you want for self-supervised learning. You need redundancy to learn structure. This tells us two things: first, we're never going to get to human-level AI by just training on text. It’s just not going to happen. Second, we need serious progress if we want useful robots. Current humanoid robots are impressive in videos, but they aren't smart enough to be useful except in narrow, carefully trained tasks.

Inference by Optimization

I mentioned the limitations of feed-forward propagation. A more powerful way to perform inference is through optimization. Instead of a net just propagating through layers to produce an output, imagine a system that extracts a representation and then has another machine with a single scalar output—an "Energy"—that measures the degree of incompatibility between the input and a proposed output.

If I put an image of an elephant and the label "elephant," I want the energy to be zero. If I put the label "table," I want the energy to be high. Inference, then, is a search: you search for an output that minimizes the energy. This is classical in AI for path planning, logic inference, and SAT problems. This allows for "zero-shot" problem-solving. It's a good model for "System 2" thinking—deliberate, slow reasoning.

LLMs, by contrast, spend a fixed amount of computation per token. To make them "think" more, you have to trick them into producing more tokens (Chain of Thought). Also, autoregressive generation is a divergent process. The set of all possible sequences is a tree. Once a token takes you outside the sub-tree of correct answers, there is no way back. The probability of a sequence being correct decreases exponentially with length. This is why LLMs hallucinate. We don't produce answers by blurting one word after another; we have an abstract thought and then turn it into text.

Joint Embedding Predictive Architecture (JEPA)

One idea is to train a generative model to predict what happens next in a video. However, predicting at the pixel level is an impossible task. If you train a neural net to make a single prediction of a video, the best it can do is predict a blurry average of all possible futures. To handle natural video, you'd need to parameterize a distribution over high-dimensional continuous space, which is mathematically intractable.

The proposal is: don't predict at the pixel level; predict at the representation level. This is the JEPA (Joint Embedding Predictive Architecture). Instead of predicting all the pixels, we predict a representation of the pixels. We run the video through an encoder and train a predictor in that representation space. This abstract representation can eliminate details that are not predictable, making the task simpler.

This is how we apprehend the world. Science is the quest for representations that allow us to make predictions while ignoring details. To predict Jupiter's trajectory, you don't need to know the details of its surface; you only need six numbers: three positions and three velocities. Everything in this room could be described by quantum field theory, but that's impossible to compute. So we invent abstractions: atoms, molecules, cells, organisms, societies. Every level of science is defined by the abstraction level we choose to make predictions.

Hierarchical Planning and Cognitive Architectures

If we have a world model, how do we use it? The agent observes the world, combines perception with memory, and feeds it to the world model. The model takes an imagined sequence of actions and predicts the resulting states. These predicted states are fed to a "task objective" and "guardrails." The robot searches for an action sequence that satisfies those objectives. It cannot escape the guardrails because, by construction, it only takes actions that minimize the cost function.

Ultimately, we need hierarchical world models. If I want to go from New York to Paris, I don't plan millisecond-by-millisecond muscle controls. I plan at a high level: go to the airport, catch a plane. Each high-level action becomes a sub-goal at a lower level (get a taxi, go to the elevator, stand up from the chair). There is a point where I don't need to plan; I just act (System 1). How we learn these appropriate levels of abstraction and plan hierarchically is completely unsolved. It is a wide-open problem for the next generation of researchers.

Self-Supervised Learning and Preventing Collapse

To train these models, we need a way to ensure the energy is low for observed data and high for unobserved data. If you only minimize the energy of training samples, the system might "collapse"—learning a flat energy surface where everything has zero energy.

There are two main ways to prevent this:

Contrastive Methods: You generate "negative" samples and push their energy up. This is hard to scale in high-dimensional spaces.
Regularized Methods: You use a term that minimizes the volume of space that can have low energy. When you push down on the training samples, the rest must go up.

I’ve become a fan of regularized methods. One example is the "Dino" model. It uses two encoders where one is a running average of the other (distillation). Somehow, this doesn't collapse, even though we don't fully understand why yet. Dino-v2 is a major success; it shows that self-supervised learning now matches or surpasses supervised learning in image understanding using less labeled data.

We can use these representations for world models. We've shown experiments where a robot uses a Dino encoder and a predictor to plan trajectories to move chips on a table or navigate to a trash can. These systems work "zero-shot" because they have a good world model.

Another recent model, V-JEPA, trains on video. It learns a representation by predicting masked parts of a video in representation space. It learns a level of "common sense" or intuitive physics. If shown a video where a ball disappears, the prediction error shoots up because the model knows that is impossible.

Summary of Recommendations

To get AI to the next level—human or even cat level—I recommend the following:

Abandon generative models in favor of Joint Embedding Predictive Architectures. Don't predict in input space; predict in representation space.
Use the energy-based framework to understand these systems. Probabilistic modeling is often unnecessary and leads to intractability.
Abandon contrastive methods in favor of regularized methods like VICReg or Dino.
Minimize the use of reinforcement learning. It is extremely inefficient. Use it only as a last resort.

These recommendations go against the most popular concepts in machine learning today. It doesn't make me very popular in some circles—I'm joking—but if you want to solve the big problems of AI, don't just work on LLMs. Work on JEPA.

Thank you very much.

🌐 youtube.com, ai, jepa, lecun, talk