Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The Chinese team subverts the CV,SEEM and perfectly divides all the explosions, dividing the "instant universe" with one click.

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

After SAM, researchers from Wisconsin-Madison, Microsoft, HKUST and other institutions put forward the SEEM model, which can segment images and videos with one click through different visual cues and language cues.

The emergence of Meta's "dividing everything" has made many people exclaim that CV no longer exists.

Based on this model, many netizens have done further work, such as Grounded SAM.

With the combination of Stable Diffusion, Whisper and ChatGPT, you can turn a dog into a monkey by voice.

Now, it's not just voice, you can split everything everywhere at once through multimodal prompts.

How do you do it exactly?

Click the mouse to select the split content directly.

Open your mouth with a word.

As soon as I daub it, the complete meme comes.

It can even split the video.

The latest research SEEM is done by scholars from the University of Wisconsin-Madison, Microsoft Research and other institutions.

Visual cues (dots, markers, boxes, graffiti, and image clips), as well as verbal cues (text and audio), easily segment images using different kinds of prompts through SEEM.

What's interesting about the title of this paper: https://arxiv.org/ pdf / 2304.06718.pdf is that it is very similar to the name of the American sci-fi film Everything Everywhere All at Once, which was released in 2022.

Nvidia scientist Jim Fan said the Oscar for best paper title went to Segment Everything Everywhere All at Once.

Having a unified and multi-functional task specification interface is the key to expand the scale of large-scale basic models. Multimodal prompts are the future direction.

After reading the paper, netizens said that CV will now begin to embrace the big model, where is the way out for graduate students?

The Academy Award Best title thesis is inspired by the development of the cue-based LLMs general interface, and the researchers put forward SEEM.

As shown in the figure, the SEEM model can perform any segmentation tasks, such as semantic segmentation, instance segmentation, and panoramic segmentation, in an unprompted open set.

In addition, it supports any combination of visual, text and reference area prompts, allowing multi-functional and interactive reference segmentation.

In the model architecture, SEEM adopts the common encoder-decoder architecture. Its unique feature lies in the complex interaction between queries and prompts.

Features and prompts are encoded into a joint visual semantic space by the corresponding encoder, or sampler.

Learnable queries are randomly initialized, and SEEM decoders accept learnable queries, image features and text prompts as inputs and outputs, including class and mask embedding, for masks and semantic predictions.

It is worth mentioning that the SEEM model has multiple rounds of interaction. Each round contains a manual cycle and a model cycle.

In the manual loop, the mask output of the previous iteration is received manually, and the positive feedback of the next round of decoding is given through visual cues. In the model cycle, the model receives and updates memory cues for future predictions.

Through SEEM, give a picture of Optimus Prime truck, you can segment Optimus Prime on any target image.

A mask is generated from the text entered by the user for one-click segmentation.

In addition, SEEM can segment objects with similar semantics on the target image by simply clicking on the reference image, or doodling.

In addition, SEEM knows a lot about solution spatial relationships. After the left upward zebra is doodled, it will also be divided into the leftmost zebra.

SEEM can also reference the image to the video mask, which can segment the video perfectly without any video data training.

In datasets and settings, SEEM is trained in three datasets: panoramic segmentation, reference segmentation and interactive segmentation.

Interactive segmentation in interactive segmentation, the researchers compare SEEM with the most advanced interactive segmentation model.

As a general model, SEEM achieves equivalent performance such as RITM,SimpleClick. And the performance is very similar to that of SAM, SAM also uses 50 more segmented data for training.

It is worth noting that unlike existing interactive models, SEEM is the first to support not only classical segmentation tasks, but also a wide range of multimodal input, including text, doodles, border boxes, and images, providing powerful combination capabilities.

General segmentation through a set of parameters pre-trained for all segmentation tasks, researchers can directly evaluate its performance on general segmentation data sets.

SEEM achieves better panoramic view, instance and semantic segmentation performance.

The researchers have four expectations for SEEM:

1. Versatility: handle different types of prompts, including dots, boxes, graffiti, masks, text, and references to another image by introducing a multi-functional prompt engine

two。 Complexity: reasoning for the combination of visual and text cues for real-time queries by learning a joint visual-semantic space

3. Interactivity: retain dialogue history information through mask-guided cross attention by integrating learnable memory cues

4. Semantic awareness: the segmentation of open vocabulary is realized by using text encoders to encode text queries and mask tags.

Different from SAM, the SAM model proposed by Meta can specify a point, a bounding box, a sentence, and segment the object with one key in a unified framework prompt encoder.

SAM has a wide range of versatility, that is, it has the ability of zero sample migration, enough to cover a variety of use cases, without additional training, it can be used in new image areas, whether underwater photos or cell microscopes.

The researchers compared SEEM and SAM in terms of the interactive and semantic capabilities of three segmentation tasks (edge detection, open set and interactive segmentation).

In open set segmentation, a high level of semantics is also needed, and no interaction is needed.

Compared with SAM, SEEM covers a wider range of interaction and semantic levels.

SAM supports only limited interaction types, such as dots and bounding boxes, and ignores high-semantic tasks because it does not output semantic tags itself.

For SEEM, the researchers pointed out two bright spots:

First, SEEM has a unified prompt encoder that encodes all visual and linguistic cues into a joint representation space. As a result, SEEM can support more general usage, and it is possible to extend to custom prompts.

Secondly, SEEM does a good job in text mask and output semantic awareness prediction.

The author introduces the first Xueyan Zou of the paper.

She is currently a Ph.D. student in computer science at the University of Wisconsin-Madison, mentored by Professor Yong Jae Lee.

Prior to that, Zou spent three years at the University of California, Davis, under the same mentor and working closely with Dr. Fanyi Xiao.

She received her bachelor's degree from Hong Kong Baptist University under the guidance of Professors PC Yuen and Chu Xiaowen.

Jianwei Yang

Yang is a senior researcher in the Deep Learning Group of Redmond Microsoft Research, under the guidance of Dr. Gao Jianfeng.

The research of Yang mainly focuses on computer vision, vision and language and machine learning. He mainly studies different levels of structured visual understanding and how to further use them to intelligently interact with human beings through the embodiment of language and environment.

Before joining Microsoft in March 2020, Yang earned a doctorate in computer science from the Georgia Institute of Technology Interactive Computing, mentored by Professor Devi Parikh, and worked closely with Professor Dhruv Batra.

Gao Jianfeng

Gao Jianfeng is a distinguished scientist and vice president of Microsoft Research, a member of IEEE, and a distinguished member of ACM.

At present, Gao Jianfeng leads the deep learning group. The task of the team is to promote the most advanced technologies of deep learning and their applications in natural language and image understanding and to make progress in dialogue models and methods.

The research mainly includes the neurolanguage model for natural language understanding and generation, neural symbol computing, the basis and understanding of visual language, conversational artificial intelligence and so on.

From 2014 to 2018, Gao Jianfeng served as a partner research manager for business artificial intelligence in Microsoft's artificial Intelligence and Research Department and the Deep Learning Technology Center (DLTC) of Redmond Microsoft Research.

From 2006 to 2014, Gao Jianfeng served as lead researcher in the Natural language processing Group.

Yong Jae Lee

Lee is an associate professor in the Department of computer Science at the University of Washington Madison.

Before joining the University of Washington-Madison in the fall of 2021, he worked as a visiting teacher of artificial intelligence at Cruise for a year, and before that, he worked as an assistant and associate professor at the University of California, Davis for six years.

He also spent a year as a postdoctoral researcher at the Robotics Institute at Carnegie Mellon University.

He received his doctorate from the University of Texas at Austin in May 2012, studied under Kristen Grauman, and received his bachelor's degree from the University of Illinois at Urbana-Champaign in May 2006.

He also worked with Larry Zitnick and Michael Cohen as a summer intern at Microsoft Research.

At present, the research of Lee focuses on computer vision and machine learning. Lee is particularly interested in creating a powerful visual recognition system that can understand visual data with minimal human supervision.

Currently, SEEM has opened the demo demo:

Https://huggingface.co/spaces/xdecoder/SEEM

Let's get started.

Reference:

Https://twitter.com/DrJimFan/status/1649835393163091969

Https://www.reddit.com/r/MachineLearning/comments/12lf2l3/r_seem_segment_everything_everywhere_all_at_once/

Https://t.co/U6so7iuxpv

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report