Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Prompt matting with one button, Meta releases the first basic model of image segmentation in history, creating a new paradigm of CV

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

Today, Meta publishes the first basic model of image segmentation in history, SAM, which introduces the prompt paradigm of NLP into CV, so that the model can be matted with one click of prompt. Netizens said: CV no longer exists!

Just now, Meta AI released Segment Anything Model (SAM), the first basic model for image segmentation.

SAM can split any object with one click from photos or videos, and can migrate zero samples to other tasks.

Overall, SAM follows the idea of the underlying model:

1. A very simple but extensible architecture that can handle multimodal prompts: text, keys, bounding boxes.

two。 The intuitive labeling process is closely related to the model design.

3. A data flywheel that allows the model to be bootstrapped to a large number of unmarked images.

Moreover, it is no exaggeration to say that SAM has learned the general concept of "objects", even for unknown objects, unfamiliar scenes (such as underwater and microscopes), and fuzzy cases.

In addition, SAM can be extended to new tasks and areas, and practitioners do not need to fine-tune the model themselves.

The most powerful thing about https://ai.facebook.com/ research / publications / segment-anything/ is that Meta implements a completely different CV paradigm. You can specify a point, a bounding box, a sentence, and directly segment the object with one click in a unified framework prompt encoder.

In this regard, Tencent AI algorithm expert Jin Tian said, "the prompt paradigm in the field of NLP has been extended to the field of CV." This time, it may completely change the traditional prediction thinking of CV. This time you can really use a model to segment any object, and it is dynamic! "

Nvidia AI scientist Jim Fan praised this even more: we have come to the "GPT-3 moment" in the field of computer vision!

So, CV really doesn't exist?

SAM: one-click "cut out" all objects in any image Segment Anything is the first basic model dedicated to image segmentation.

Segmentation is to identify which image pixels belong to an object, which has always been the core task of computer vision.

However, if you want to create an accurate segmentation model for a particular task, it usually requires highly specialized work by experts, a process that requires training in AI's infrastructure and a large amount of carefully tagged domain data, so the threshold is extremely high.

In order to solve this problem, Meta proposed a basic model of image segmentation-SAM. This cue model, which is trained in diverse data, not only adapts to a variety of tasks, but also operates in a similar way to using prompts in the NLP model.

The SAM model grasps the concept of "what is an object" and can generate masks for any object in any image or video, even objects it has never seen in training.

SAM is so versatile that it covers a variety of use cases and can be used out of the box for new areas of imaging, whether underwater photographs or cell microscopes, without additional training. In other words, SAM already has the ability of zero sample migration.

Meta said excitedly on his blog: it can be expected that in the future, in any application that needs to find and segment objects in the image, there will be opportunities for SAM to use his talents.

SAM can be part of a larger AI system for a more general multimodal understanding of the world, such as understanding the visual and textual content of web pages.

In the field of AR / VR, SAM can select objects according to the user's line of sight, and then "promote" the objects to 3D.

For content creators, SAM can extract image areas for collage, or video editing.

SAM can also locate and track animals or objects in video, which is helpful to natural science and astronomy research.

General segmentation methods in the past, there were two ways to solve the segmentation problem.

One is interactive segmentation, which can split objects of any kind, but requires a person to fine-tune the mask through iteration.

The second is automatic segmentation, which can segment specific objects defined in advance, but the training process requires a large number of manually labeled objects (for example, thousands of examples are needed to separate cats).

In short, neither of these two methods can provide a general and automatic segmentation method.

SAM can be regarded as a summary of these two methods, which can easily perform interactive segmentation and automatic segmentation.

On the prompting interface of the model, a wide range of segmentation tasks can be completed as long as the correct prompts (clicks, boxes, text, etc.) are designed for the model.

In addition, SAM trains on diversified, high-quality data sets containing more than 1 billion masks, allowing the model to generalize to new objects and images beyond what it observed during training. As a result, practitioners no longer need to collect their own breakdown data and fine-tune the model for use cases.

This kind of flexibility, which can be extended to new tasks and new fields, is the first time in the field of image segmentation.

(1) SAM allows users to split objects with one click, or interactively click many points, and to prompt the model using a bounding box.

(2) when facing the ambiguity of the segmented object, SAM can output multiple valid masks, which is a necessary ability to solve the segmentation problem in the real world.

(3) SAM can automatically find and mask all objects in the image. (4) after the precomputed image is embedded, SAM can generate a segmentation mask for any prompt in real time, allowing users to interact with the model in real time.

SAM trained by working researchers can return a valid split mask for any prompt. The hint can be a foreground / background spot, a rough box or mask, free-form text, or any information that indicates that segmentation is needed in the image in general.

The requirement for a valid mask simply means that even if the prompt is vague and may refer to multiple objects (for example, a dot on the shirt may represent the shirt or the person wearing the shirt), the output should be a reasonable mask for one of the objects.

The researchers observed that pre-training tasks and interactive data collection imposed specific constraints on model design.

In particular, the model needs to be run in real time on CPU in a web browser in order to enable standards personnel to interact with SAM efficiently and in real time.

Although the runtime constraint means that there is a tradeoff between quality and running time, researchers have found that in practice, simple design can achieve good results.

SAM's image encoder produces one-time embedding for the image, while the lightweight decoder converts any prompt into vector embedding in real time. The two information sources are then combined in a lightweight decoder that predicts the split mask.

After calculating the image embedding, SAM can generate an image in as little as 50 milliseconds and give any hint in a web browser.

The latest SAM model was trained on 256A100s for 68 hours (nearly 5 days).

The project demonstrates a variety of input tips to specify the content to be divided in the image, which can achieve a variety of segmentation tasks without additional training.

Use interactive points and boxes as hints

Automatically segment all elements in the image

Design SAM that generates multiple valid masks for ambiguous prompts can accept input prompts from other systems.

For example, the corresponding object is selected according to the user visual focus information transmitted from the AR / VR header display. Through the development of AI, which can understand the real world, Meta paves the way for its future meta-universe.

Or, the text-to-object segmentation is realized by using the bounding box prompt from the object detector.

The extensible output mask can be used as input to other AI systems.

For example, the mask of an object can be tracked in a video, turned into 3D through an image editing application, or used for creative tasks such as collage.

The generalization of zero samples SAM has learned a general concept of what objects are-this understanding enables it to generalize unfamiliar objects and images with zero samples without the need for additional training.

All kinds of evaluation choose Hover&Click, green dots appear after clicking Add Mask, red dots appear after clicking Remove Area, and apple flowers are immediately circled.

In the Box function, simply select the box and complete the identification immediately.

After clicking Everything, all the objects identified by the system are extracted immediately.

After choosing Cut-Outs, you will get a triangle in a second.

SA-1B dataset: 11 million images, 1.1 billion masks in addition to the new model released, Meta also released the largest segmented dataset to date, SA-1B.

The dataset consists of 11 million diversified, high-resolution, privacy-preserving images and 1.1 billion high-quality segmentation masks.

The overall characteristics of the dataset are as follows:

Total number of images: 11 million

Total masks: 1.1 billion

Average mask per image: 100

Average image resolution: 1500 × 2250 pixels

Note: image or mask tags do not have class labels

Meta specifically emphasizes that this data is collected through our data engine and that all masks are fully automatically generated by SAM.

With the SAM model, new split masks are collected faster than ever before, and it takes only about 14 seconds to label a mask interactively.

Each mask labeling process is only 2 times slower than the dimensioning bounding box. Using the fastest labeling interface, it takes about 7 seconds to label the bounding box.

Compared with the previous large-scale segmentation data collection work, the completely manual polygon-based mask annotation of the SAM model COCO is 6.5times faster and twice faster than the previous largest data annotation work (also model assistance).

However, relying on interactive dimension masks is not enough to create more than 1 billion mask datasets. As a result, Meta built a data engine for creating SA-1B datasets.

This data engine has three gears:

1. Model auxiliary dimensioning

two。 The mixing of automatic and auxiliary annotations helps to increase the diversity of masks collected.

3. Fully automatic mask creation so that the dataset can be extended

Our final data set includes more than 1.1 billion split masks collected on approximately 11 million authorized and privacy images.

SA-1B has 400 times more masks than any existing split dataset. And human assessment studies have confirmed that masks are of high quality and diversity, and in some cases even comparable in quality to previous masks that are smaller and fully manually tagged data sets.

SA-1B images are obtained through photo providers from multiple countries that span different geographic regions and income levels.

Although some geographical regions are still underrepresented, SA-1B has more images and better overall representation in all regions than previous segmented datasets.

Finally, Meta says it wants this data to be the basis for new datasets that contain additional annotations, such as text descriptions associated with each mask.

RBG led the team Ross Girshick

Ross Girshick (often known as RBG) is a research scientist at the Facebook Institute of artificial Intelligence (FAIR). He specializes in computer vision and machine learning.

In 2012, Ross Girshick received a doctorate in computer science from the University of Chicago under the guidance of Pedro Felzenszwalb.

Before joining FAIR, Ross was a researcher at Microsoft Research and a postdoctoral fellow at the University of California, Berkeley, where his mentors were Jitendra Malik and Trevor Darrell.

He won the PAMI Young researcher Award in 2017 and the PAMI Mark Everingham Award in 2017 and 2021 in recognition of his contribution to open source software.

As we all know, Ross and he Kaiming jointly developed the target detection algorithm of R-CNN method. In 2017, Ross and he Kaiming's Mask R-CNN paper won the best paper of ICCV 2017.

Netizens: CV really does not exist the basic model of segmentation in the field of CV created by Meta, which makes many netizens shout, "now, CV really does not exist." "

"to me, Segment Anything's data engine and ChatGPT's RLHF represent a new era of large-scale artificial intelligence," said Justin Johnson, a Meta scientist. Instead of learning everything from noisy network data, it is better to skillfully apply the combination of human tagging and big data to release new capabilities. Supervise the strong return of learning! "

The only pity is that the release of the SAM model is mainly led by Ross Girshick, but he Kaiming is absent.

Bosom friend "matrix Ming Tsai" said that this article further proves that multimodal is the future of CV, and there is no tomorrow for pure CV.

Reference:

Https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/

Https://www.zhihu.com/question/593914819

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report