The hyperevolutionary version of Meta "dividing everything" is coming, and IDEA leads the top team in China to build: detect, divide, generate everything, and win 2k stars. 04/15 Update SLTechnology News&Howtos

The hyperevolutionary version of Meta "dividing everything" is coming, and IDEA leads the top team in China to build: detect, divide, generate everything, and win 2k stars.

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

As soon as Meta's SAM "split everything" model was released, the domestic team made two innovations to create one of the strongest zero-sample visual applications, Grounded-SAM, which can not only divide everything, but also detect everything and generate everything.

Since the emergence of Meta's "split everything" model, it has made insiders exclaim that CV no longer exists.

Just one day after the release of SAM, the domestic team developed an evolutionary version of "Grounded-SAM" on this basis.

Note: the logo of the project is the Grounded-SAM that the team spent an hour doing with Midjourney to integrate SAM, BLIP and Stable Diffusion, and integrate the image "segmentation", "detection" and "generation" capabilities into the strongest Zero-Shot visual application.

Netizens have said that it is too curly!

"it's too fast," said Wenhu Chen, a research scientist at Google brain and an assistant professor of computer science at the University of Waterloo.

Shen Xiangyang, the boss of AI, also recommended this latest project:

Grounded-Segment-Anything: automatically detects, splits, and generates anything with image and text input. Edge segmentation can be further improved.

So far, this project has made 2k stars on GitHub.

Examine everything, divide everything, generate everything last week, the release of SAM ushered in the GPT-3 moment for CV. Even, Meta AI claims that this is the first basic model of image segmentation in history.

The model can specify a point, a bounding box and a sentence in a unified framework prompt encoder, and directly segment any object with one click.

SAM has a wide range of versatility, that is, it has the ability of zero sample migration, enough to cover a variety of use cases, without additional training, it can be used in new image areas, whether underwater photos or cell microscopes.

Thus it can be seen that SAM can be said to be so strong that it refers to it.

Now, based on this model, domestic researchers have come up with a new idea. By combining the powerful zero-sample target detector Grounding DINO, everything can be detected and segmented through text input.

With the powerful zero-sample detection ability of Grounding DINO, Grounded SAM can find any object in the picture through text description, and then through the strong segmentation ability of SAM, fine-grained mas can be segmented.

Finally, Stable Diffusion can be used to generate controllable text maps for the segmented region.

In the concrete practice of Grounded-SAM, the researchers combined Segment-Anything with three powerful zero-sample models to build a process of automatic tagging system, and showed very impressive results!

This project combines the following models:

BLIP: a powerful Image Annotation Model

Grounding DINO: the most advanced zero sample detector

Segment-Anything: a powerful Zero sample Segmentation Model

Stable-Diffusion: an excellent generation model

All models can be used either in combination or independently. Build a powerful visual workflow model. The whole workflow has the ability to detect everything, divide everything, and generate everything.

The functions of the system include:

BLIP+Grounded-SAM = the automatic annotator uses the BLIP model to generate headings, extracts tags, and uses Ground-SAM to generate boxes and masks:

Semi-automatic tagging system: detects input text and provides accurate box and mask annotations.

Fully automatic marking system:

First use the BLIP model to generate reliable annotations for the input image, then have Grounding DINO detect the entities in the annotations, and then use SAM to segment the instances on its box prompts.

Stable Diffusion+Grounded-SAM = data factory is used as a data factory to generate new data: the diffusion repair model can be used to generate new data based on the mask.

Segment Anything+HumanEditing in this branch, the author uses Segment Anything to edit human hair / face.

SAM + hair Editing

SAM + fashion editor

The author puts forward some possible future research directions for the Grounded-SAM model.

Automatically generate images to build new data sets; split a more powerful base model for pre-training; work with (Chat-) GPT models; a complete pipeline for automatically tagging images (including bounding boxes and masks) and generate new images.

The author introduces that one of the researchers in the Grounded-SAM project is Liu Shilong, a third-year doctoral student in the computer Science Department of Tsinghua University.

He recently introduced the latest project he and his team worked on on GitHub and said it was still being improved.

Now, Liu Shilong is an intern in the computer Vision and Robotics Research Center of Guangdong-Hong Kong-Macau Greater Bay Area Digital economy Research Institute (IDEA Research Institute), under the guidance of Professor Zhang Lei. His main research interests are target detection and multimodal learning.

Prior to that, he received a bachelor's degree in industrial engineering from Tsinghua University in 2020 and spent some time as an intern in absenteeism in 2019.

Personal home page: http://www.lsl.zone/ by the way, Liu Shilong is also a work of the target detection model Grounding DINO released in March this year.

In addition, among his 4 papers, CVPR 2023 was accepted by ICLR 2023 and one by AAAI 2023.

Address: https://arxiv.org/ pdf / 2303.05499.pdf the boss mentioned by Liu Shilong, Ren Tianhe, currently works as a computer vision algorithm engineer at IDEA Research Institute and is also directed by Professor Zhang Lei. His main research interests are target detection and multimodal.

In addition, the project collaborators include Li Kunchang, a third-year PhD student at the University of the Chinese Academy of Sciences, whose main research interests are video understanding and multimodal learning; Cao he, an intern at the computer Vision and Robotics Research Center of IDEA Research Institute, whose main research interests are generating models; and Chen Jiayu, a senior algorithm engineer at Aliyun.

Ren Tianhe and Liu Shilong need to install python 3.8 and above, pytorch 1.7 and above and torchvision 0.8 and above to install and run the project. In addition, the author strongly recommends installing PyTorch and TorchVision that support CUDA.

Install Segment Anything:

Python-m pip install-e segment_anything

Install GroundingDINO:

Python-m pip install-e GroundingDINO

Install diffusers:

Pip install-upgrade diffusers [torch] optional dependencies required to install mask post-processing, save the mask in COCO format, example notebook, and export the model in ONNX format. At the same time, the project needs jupyter to run example notebook.

Pip install opencv-python pycocotools matplotlib onnxruntime onnx ipykernelGrounding DINO demo

Download the groundingdino checkpoint:

Cd Grounded-Segment-Anythingwget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth runs demo:

Export CUDA_VISIBLE_DEVICES=0python grounding_dino_demo.py\-- config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py\-- grounded_checkpoint groundingdino_swint_ogc.pth\-- input_image assets/demo1.jpg\-- output_dir "outputs"\-- box_threshold 0.3\-- text_threshold 0.25\-- text_prompt "bear"\-- device "cuda" model prediction visualization will be saved in output_dir As follows:

Grounded-Segment-Anything+BLIP demo automatic generation of pseudo tags is simple:

1. Use BLIP (or other dimension models) to generate a dimension.

two。 Extract tags from annotations and use ChatGPT to handle potentially complex sentences.

3. Use Grounded-Segment-Anything to generate boxes and masks.

Export CUDA_VISIBLE_DEVICES=0python automatic_label_demo.py\-- config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py\-- grounded_checkpoint groundingdino_swint_ogc.pth\-- sam_checkpoint sam_vit_h_4b8939.pth\-- input_image assets/demo3.jpg\-- output_dir "outputs"\-- openai_key your_openai_key\-- box_threshold 0.25\-- text_threshold 0. 2\-- iou_threshold 0.5\-- device "cuda" pseudo tags and model prediction visualization will be saved in output_dir As follows:

Grounded-Segment-Anything+Inpainting demonstration CUDA_VISIBLE_DEVICES=0python grounded_sam_inpainting_demo.py\-- config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py\-- grounded_checkpoint groundingdino_swint_ogc.pth\-- sam_checkpoint sam_vit_h_4b8939.pth\-- input_image assets/inpaint_demo.jpg\-- output_dir "outputs"\-- box_threshold 0.3\-- text_threshold 0.25 \-det_prompt "bench"\-inpaint_prompt "A sofa" High quality, detailed "\-- device" cuda "Grounded-Segment-Anything+Inpainting Gradio Apppython gradio_.py author provides a visual web page here It is more convenient to try various examples.

Netizens' comments on logo for this project also have a deep meaning:

A mosaic bear sitting on the ground. Sitting on the ground is because ground has the meaning of the ground, and then the segmented picture can be thought of as a mosaic style, and the mosaic homophonic mask, the reason for using the bear as the main body of the logo, because the author's main example of the picture is a bear.

After seeing Grounded-SAM, netizens said that they knew they were coming, but they didn't expect to come so soon.

Project author Ren Tianhe said, "the Zero-Shot detector we use is by far the best." "

In the future, there will be web demo online.

Finally, the author said that the project can also be based on the generation model to do more extended applications, such as multi-domain fine editing, the construction of high-quality and reliable data factory and so on. People from all fields are welcome to participate.

Reference:

Https://github.com/IDEA-Research/Grounded-Segment-Anything

Https://www.reddit.com/r/MachineLearning/comments/12gnnfs/r_groundedsegmentanything_automatically_detect/

Https://zhuanlan.zhihu.com/p/620271321

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.