V-JEPA: AI Model Learning Physical Intuition from Videos

Published Date : 4/10/2025

Researchers at Meta have developed V-JEPA, an AI system that learns about the world through videos and demonstrates a sense of 'surprise' when presented with physically implausible scenarios, much like infants.

Here’s a test for infants: Show them a glass of water on a desk. Hide it behind a wooden board. Now move the board toward the glass. If the board keeps going past the glass, as if it weren’t there, are they surprised? Many 6-month-olds are, and by a year, almost all children have an intuitive notion of an object’s permanence, learned through observation. Now some artificial intelligence models do too.

Researchers have developed an AI system that learns about the world via videos and demonstrates a notion of “surprise” when presented with information that goes against the knowledge it has gleaned. The model, created by Meta and called Video Joint Embedding Predictive Architecture (V-JEPA), does not make any assumptions about the physics of the world contained in the videos. Nonetheless, it can begin to make sense of how the world works.

“Their claims are, a priori, very plausible, and the results are super interesting,” says Micha Heilbron, a cognitive scientist at the University of Amsterdam who studies how brains and artificial systems make sense of the world.

As the engineers who build self-driving cars know, it can be hard to get an AI system to reliably make sense of what it sees. Most systems designed to “understand” videos in order to either classify their content (“a person playing tennis,” for example) or identify the contours of an object — say, a car up ahead — work in what’s called “pixel space.” The model essentially treats every pixel in a video as equal in importance.

But these pixel-space models come with limitations. Imagine trying to make sense of a suburban street. If the scene has cars, traffic lights, and trees, the model might focus too much on irrelevant details such as the motion of the leaves. It might miss the color of the traffic light, or the positions of nearby cars. “When you go to images or video, you don’t want to work in [pixel] space because there are too many details you don’t want to model,” said Randall Balestriero, a computer scientist at Brown University.

The V-JEPA architecture, released in 2024, is designed to avoid these problems. While the specifics of the various artificial neural networks that comprise V-JEPA are complex, the basic concept is simple. Ordinary pixel-space systems go through a training process that involves masking some pixels in the frames of a video and training neural networks to predict the values of those masked pixels. V-JEPA also masks portions of video frames. But it doesn’t predict what’s behind the masked regions at the level of individual pixels. Rather, it uses higher levels of abstractions, or “latent” representations, to model the content.

Latent representations capture only essential details about data. For example, given line drawings of various cylinders, a neural network called an encoder can learn to convert each image into numbers representing fundamental aspects of each cylinder, such as its height, width, orientation, and location. By doing so, the information contained in hundreds or thousands of pixels is converted into a handful of numbers — the latent representations. A separate neural network called a decoder then learns to convert the cylinder’s essential details into an image of the cylinder.

V-JEPA focuses on creating and reproducing latent representations. At a high level, the architecture is split into three parts: encoder 1, encoder 2, and a predictor. First, the training algorithm takes a set of video frames, masks the same set of pixels in all frames, and feeds the frames into encoder 1. Sometimes, the final few frames of the video are fully masked. Encoder 1 converts the masked frames into latent representations. The algorithm also feeds the unmasked frames in their entirety into encoder 2, which converts them into another set of latent representations.

Now the predictor gets into the act. It uses the latent representations produced by encoder 1 to predict the output of encoder 2. In essence, it takes latent representations generated from masked frames and predicts the latent representations generated from the unmasked frames. By re-creating the relevant latent representations, and not the missing pixels of earlier systems, the model learns to see the cars on the road and not fuss about the leaves on the trees.

“This enables the model to discard unnecessary … information and focus on more important aspects of the video,” said Quentin Garrido, a research scientist at Meta. “Discarding unnecessary information is very important and something that V-JEPA aims at doing efficiently.”

Once this pretraining stage is complete, the next step is to tailor V-JEPA to accomplish specific tasks such as classifying images or identifying actions depicted in videos. This adaptation phase requires some human-labeled data. For example, videos have to be tagged with information about the actions contained in them. The adaptation for the final tasks requires much less labeled data than if the whole system had been trained end to end for specific downstream tasks. In addition, the same encoder and predictor networks can be adapted for different tasks.

In February, the V-JEPA team reported how their systems did at understanding the intuitive physical properties of the real world — properties such as object permanence, the constancy of shape and color, and the effects of gravity and collisions. On a test called IntPhys, which requires AI models to identify if the actions happening in a video are physically plausible or implausible, V-JEPA was nearly 98% accurate. A well-known model that predicts in pixel space was only a little better than chance.

The V-JEPA team also explicitly quantified the “surprise” exhibited by their model when its prediction did not match observations. They took a V-JEPA model pretrained on natural videos, fed it new videos, then mathematically calculated the difference between what V-JEPA expected to see in future frames of the video and what actually happened. The team found that the prediction error shot up when the future frames contained physically impossible events. For example, if a ball rolled behind some occluding object and temporarily disappeared from view, the model generated an error when the ball didn’t reappear from behind the object in future frames. The reaction was akin to the intuitive response seen in infants. V-JEPA, one could say, was surprised.

Heilbron is impressed by V-JEPA’s ability. “We know from developmental literature that babies don’t need a lot of exposure to learn these types of intuitive physics,” he said. “It’s compelling that they show that it’s learnable in the first place, and you don’t have to come with all these innate priors.”

Karl Friston, a computational neuroscientist at University College London, thinks that V-JEPA is on the right track in terms of mimicking the “way our brains learn and model the world.” However, it still lacks some fundamental elements. “What is missing from [the] current proposal is a proper encoding of uncertainty,” he said. For example, if the information in the past frames isn’t enough to accurately predict the future frames, the prediction is uncertain, and V-JEPA doesn’t quantify this uncertainty.

In June, the V-JEPA team at Meta released their next-generation 1.2-billion-parameter model, V-JEPA 2, which was pretrained on 22 million videos. They also applied the model to robotics: They showed how to further fine-tune a new predictor network using only about 60 hours of robot data (including videos of the robot and information about its actions), then used the fine-tuned model to plan the robot’s next action. “Such a model can be used to solve simple robotic manipulation tasks and paves the way to future work in this direction,” Garrido said.

To push V-JEPA 2, the team designed a more difficult benchmark for intuitive physics understanding, called IntPhys 2. V-JEPA 2 and other models did only slightly better than chance on these tougher tests. One reason, Garrido said, is that V-JEPA 2 can handle only about a few seconds of video as input and predict a few seconds into the future. Anything longer is forgotten. You could make the comparison again to infants, but Garrido had a different creature in mind. “In a sense, the model’s memory is reminiscent of a goldfish,” he said.

Frequently Asked Questions (FAQS):

Q: What is V-JEPA?

A: V-JEPA, or Video Joint Embedding Predictive Architecture, is an AI model developed by Meta that learns about the world through videos and demonstrates a notion of 'surprise' when presented with physically implausible scenarios.

Q: How does V-JEPA differ from traditional AI models?

A: Unlike traditional AI models that work in 'pixel space' and treat every pixel equally, V-JEPA uses higher levels of abstraction, or 'latent representations,' to focus on essential details and discard unnecessary information.

Q: What tasks can V-JEPA perform?

A: V-JEPA can be adapted to perform tasks such as classifying images or identifying actions depicted in videos. It requires less labeled data and can be fine-tuned for different tasks using the same encoder and predictor networks.

Q: How accurate is V-JEPA in understanding intuitive physics?

A: V-JEPA achieved nearly 98% accuracy on the IntPhys test, which requires AI models to identify physically plausible or implausible scenarios in videos.

Q: What are the limitations of V-JEPA?

A: One limitation is that V-JEPA does not properly encode uncertainty. Additionally, the model can handle only a few seconds of video as input and predict a few seconds into the future, similar to the memory of a goldfish.

V-JEPA: AI Model Learning Physical Intuition from Videos

Researchers at Meta have developed V-JEPA, an AI system that learns about the world through videos and demonstrates a sense of 'surprise' when presented with physically implausible scenarios, much like infants.

Frequently Asked Questions (FAQS):

More Related Topics :

Thinking About AI Vision for Your Business? Let's Make It Happen.

Explore our AI-powered tools that can boost your business success.

Watchman AI

Employee Monitoring

ICAO Facial Image App

Container Number Recognition System

Automated Number Plate Recognition

Proctor AI