LeCun world model appeared, Meta shocked to release the first "humanoid" model, understand the world after completing half of the picture, self-supervised learning is widely expected 04/18 Update SLTechnology News&Howtos

LeCun world model appeared, Meta shocked to release the first "humanoid" model, understand the world after completing half of the picture, self-supervised learning is widely expected

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

LeCun's world model finally arrived, and it was widely expected. Now that big models have learned to understand the world and reason like humans, is AGI not far off?

LeCun's ideal AI has long been one that leads to human-level AI, for which he proposed the idea of a "world model."

Recently, LeCun criticized GPT large models again in public speeches: large models based on probability generation autoregression cannot solve illusion problems at all. He even asserted that the GPT model would not survive for five years.

Today, LeCun is finally one step closer to his dream!

Meta Shocks has released I-JEPA, a human-like AI model that can analyze and complete missing images more accurately than existing models.

Paper address: arxiv.org/ abs / 2301.08243 Highlight: I-JEPA fills in missing fragments with background knowledge about the world! Instead of just looking at nearby pixels, as in other models.

More than a year after proposing the concept of "world model," LeCun is about to realize his own star sea.

Today, training code and models are open source. The paper will be published next week at CVPR 2023.

LeCun's model of the world comes as even today's most advanced AI systems fail to overcome some key limitations.

To break through this barrier, Meta's chief AI scientist Yann LeCun has proposed a new architecture.

His vision is to create a machine that can learn internal models of how the world works, so that it can learn faster, plan for complex tasks, and respond to unfamiliar new situations.

Meta's I-JEPA model, an image joint embedding prediction architecture, is the first AI model based on a key part of LeCun's vision of the world model.

I-JEPA is learning by creating an internal model of the external world. In completing an image, it compares abstract representations of the image, not the pixels themselves.

I-JEPA demonstrated robust performance on multiple computer vision tasks and was computationally much more efficient than other widely used CV models.

ImageNet Linear Evaluation: The I-JEPA method does not use any visual data enhancement during pre-training to learn semantic image representations, and uses less computation than other methods. The I-JEPA learned representation can be used for many different applications without a lot of fine-tuning.

For example, the researchers trained a 632M parameter Visual Transformer model using 16 A100 GPUs in 72 hours.

On the low-shot classification task on ImageNet, it reached SOTA, reducing each class to 12 labeled examples.

Other methods typically require 2 to 10 times as many GPU hours and have higher error rates when trained on the same amount of data.

Acquiring common sense through self-supervised learning Often, humans learn a lot of background knowledge about the world simply by passive observation.

Presumably, this common sense information seems to be the key to intelligent behavior, such as acquiring valid samples of new concepts, foundations, and plans.

Modeling concept learning as learning a linear read Meta works on I-JEPA (and more generally the Joint Embedded Prediction Architecture JEPA model) based on the fact that.

What the researchers are trying to do is design a learning algorithm that captures common-sense background knowledge about the world and then encodes it into numerical representations that the algorithm can access.

To be sufficiently efficient, the system must learn these representations in a self-supervised manner-that is, directly from unlabeled data such as images or sounds, rather than from manually assembled labeled datasets.

At a higher level, JEPA aims to predict partial representations of an input based on representations of other parts of the same input (image or text).

Because it does not involve folding multiple views/enhanced representations of an image onto a single point, JEPA has high hopes of avoiding the biases and problems that arise in widely used methods (i.e., invariant based pre-training).

Joint embedding approaches can avoid representation collapse and, by predicting representations at a highly abstract level rather than directly predicting pixel values, JEPA promises to be able to learn useful representations directly while avoiding the limitations of generative approaches, which is why so many exciting large language models have been produced recently.

In contrast, general generative models are learned by removing or distorting parts of the input model.

For example, erase a portion of a photo, or hide certain words in a text paragraph, and then try to predict corrupted or missing pixels or words.

But a significant disadvantage of this approach is that, although the world itself is unpredictable, the model tries to fill in every missing piece of information.

Thus, such approaches can make mistakes that would never be made, because they focus too much on irrelevant details rather than capturing higher-level, predictable concepts.

A well-known example is the difficulty of generative models in generating the right hands.

In the general architecture of self-supervised learning, the system learns to capture relationships between different inputs.

Its goal is to allocate high energy to incompatible inputs and low energy to compatible inputs.

Common architectures for self-supervised learning The difference between these three architectures is--

(a)The joint embedding (invariant) architecture learns to output similar embeddings for compatible inputs x, y and dissimilar embeddings for incompatible inputs.

(b)A generative architecture learns to reconstruct signal y directly from compatible signal x, using a decoder network conditioned on an additional variable z (possibly a latent variable) to facilitate reconstruction.

(c)The joint embedding prediction architecture learns to predict embedding of signal y from compatible signals x, using prediction networks conditioned on additional variables z (possibly latent variables) to facilitate prediction.

The principle behind the Joint Embedded Prediction Architecture I-JEPA is to predict missing information through an abstract representation that is more similar to human understanding.

To guide I-JEPA to generate semantic representations, one of the core designs is the multi-block masking strategy.

Specifically, the team demonstrated the importance of predicting chunks containing semantic information. These chunks are large enough to cover important semantic features.

The advantage of this strategy is that it reduces unnecessary detail and provides a higher level of semantic understanding.

By focusing on large chunks of semantic information, models can better grasp important concepts in images or text, resulting in stronger predictive power.

Image-based Joint Embedded Prediction Architecture (I-JEPA) uses a single context block to predict tokens from the same image

The context encoder is a visual Transformer (ViT) that only processes visible context patches.

The predictor is a narrow ViT that receives the output of the context encoder and predicts a representation of the target block based on the position token of the target.

The target characterization corresponds to the output of the target encoder, whose weights are updated at each iteration by an exponential moving average of the context encoder weights.

In I-JEPA, a predictor can be thought of as a primitive (and constrained) model of the world that is able to use known context information to infer the contents of unknown regions.

This ability enables the model to reason about static images, thereby establishing an understanding of spatial uncertainty in images.

Unlike methods that focus only on pixel-level details, I-JEPA is able to predict high-level semantic information in unseen regions, thus better capturing the semantic content of images.

The predictor learns the process of modeling the semantics of the world. For each image, the part outside the blue box is encoded and provided to the predictor as context. The predictor outputs a representation of what is expected in the blue box.

To understand what the model captures, the team trained a random decoder to map the representations of I-JEPA predictions back into pixel space, thus showing the output of the model when predicted within the blue box.

Obviously, the predictor is able to identify the semantic information that should be filled in (top of dog head, leg of bird, leg of wolf, other side of building).

Given an image, randomly sample 4 target blocks, randomly sample context blocks of a range scale, and delete any overlapping target blocks. In short, I-JEPA can learn high-level representations of object parts without discarding their local position information in the image.

Higher efficiency, stronger performance In pre-training, I-JEPA calculations are more efficient.

First, it does not require the application of more computationally intensive data enhancements to generate multiple views, and therefore does not incur additional overhead.

Secondly, the target encoder only needs to process one view of the image, and the context encoder only needs to process the context block.

Experiments show that I-JEPA can learn powerful off-the-shelf semantic representations without using artificial view enhancement.

Furthermore, I-JEPA outperforms pixel reconstruction and token reconstruction methods in ImageNet-1K linear detection and semi-supervised evaluation.

Performance of linear evaluation on ImageNet-1k as a function of GPU hours during pre-training On semantic tasks, I-JEPA performs better than previous pre-training methods that rely on artificial data for enhancement.

Compared to these methods, I-JEPA achieves better performance on low-level visual tasks such as object counting and depth prediction.

By using simpler, more flexible inductive bias models, I-JEPA can be used on a wider range of tasks.

Low sample classification accuracy: Semi-supervised evaluation of ImageNet-1k using 1% labels (approximately 12 labeled images per category)

AI takes human intelligence a step further I-JEPA demonstrates the architecture's potential for learning representations of ready-made images without the need for artificial knowledge as an additional aid.

Advancing JEPA to learn more general world models from richer modalities would be equally particularly interesting work.

For example, make long-range spatial and temporal predictions of video from short contexts and condition these predictions based on audio or textual cues.

Visualization of I-JEPA predictor characterization: The first column contains the original image, the second column contains the context image, and the green bounding box contains samples from the generated model decoded by the predictor output. The predictor correctly captured positional uncertainty, produced parts of high-level objects in the right poses, discarded precise low-level details and background information team representations, and looked forward to extending the JEPA approach to other domains such as image-text pairing data and video data.

In the future, the JEPA model could have exciting applications for tasks such as video understanding. This will also be an important step in applying and extending self-supervised methods to learn world models.

pre-trained models

Single-GPU training In a single-GPU setup, implementation starts at main.py

For example, to run I-JEPA pre-training on GPUs 0, 1, and 2 on your local machine using the configuration configs / in1k_vith14_ep300.yaml, enter the following command:

python main.py\ --fname configs/in1k_vith14_ep300.yaml \ --devices cuda:0 cuda:1 cuda:2 Note: ViT-H / 14 configuration should be run on 16 A100 80G graphics cards, valid batch size is 2048 to reproduce results.

Multi-GPU training In a multi-GPU setup, the implementation starts with main_distributed.py, which allows detailed information about distributed training to be specified in addition to parsing the configuration file.

For distributed training, use the popular open source submitt tool and provide an example of SLURM clustering.

For example, to pretrain on 16 A100 80G graphics cards using the pretraining lab configuration specified in configs / in1k_vith14_ep300.yaml, enter the following command:

python main_distributed.py \ --fname configs/in1k_vith14_ep300.yaml \ --folder $path_to_save_submitit_logs \ --partition $slurm_partition \ --nodes 2 --tasks-per-node 8 \ --time 1000 comments Netizens expressed their appreciation for LeCun's new work.

That was groundbreaking work. It blew up. The successor to the autoregressive model is here!

I believe that federated embedded architecture is the future of AI, not generative. But I'm just curious, why don't we go further into multimodality (like ImageBind, not just text-image pairs) and replace the VIT encoder with a perceptron like an encoder?

It's a neat job. In my understanding, it is similar to a masking autoencoder, but loses functionality when defined in latent space rather than input/pixel space. But I need more detail if I'm going to understand it.

My brain can only understand 10% of the paper, but it would be amazing if I-JEPA could actually create the target image in Figure 3, and most importantly: it was relevant to AI-generated MMORPG!

The project is about to open source, and netizens also appreciate Meta's contribution to the open source community.

References:

https://ai.facebook.com/blog/yann-lecun-ai-model-i-jepa/

This article comes from Weixin Official Accounts: Xinzhiyuan (ID: AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.