Without fine-tuning, a photo can change face and background, and teams such as NUS can break the situation to generate personalized videos. 04/16 Update SLTechnology News&Howtos

Without fine-tuning, a photo can change face and background, and teams such as NUS can break the situation to generate personalized videos.

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

As long as a reference picture, anyone can be replaced with the protagonist of the video.

With the development of diffusion model, it has become a reality to generate high-quality pictures and videos based on input text, but the controllability of using only text to generate visual content is limited.

To overcome this problem, researchers began to explore additional control signals and ways to edit existing content. These two directions achieve the controllability of the generation process to some extent, but still need to rely on the text to describe the target generation content.

In practical application, we are faced with a new requirement: what if the content that users want to generate cannot be described in language?

For example, a user wants to generate a video of an ordinary person, but it is meaningless to use only the ordinary person's name in the input text, because the language model cannot recognize the individual name that is not in the training corpus.

To solve this problem, a feasible solution is to train a personalized model based on a given individual.

For example, DreamBooth and Dreamix understand individual concepts through multiple images to generate personalized content, but these two methods require separate learning for each individual, and require multiple training pictures and fine tuning of that individual.

Recently, researchers from the National University of Singapore (NUS) and Huawei Noah Lab have made new progress in personalized video editing. Through the cooperation of multiple integrated models, there is no need for additional training and fine-tuning of personalized concepts, and only a target reference picture is needed to achieve protagonist replacement, background replacement and specific protagonist videos of existing videos.

Project home page: https://make-a-protagonist.github.io/

Paper address: https://arxiv.org/ pdf / 2305.08850.pdf

Code address: https://github.com/ Make-A-Protagonist / Make-A-Protagonist

This research brings new possibilities to the field of personalized video editing and makes it easier and more efficient to generate personalized content.

This paper introduces that Make-A-Protagonist divides the video into the protagonist and the background, and uses visual or language reference information for both, so as to realize the protagonist editing, background editing and specific protagonist Vincent video.

The protagonist editing feature allows the user to use the same scene description, but replaces the protagonist in the video by referring to the image. This means that users can replace the main characters in the video with images of their choice.

The background editing feature allows users to use the same protagonist description as the original video (for example, "Suzuki Jimny") and to use the original video frame as visual information, but to change the text description of the scene (for example, "in the rain"). In this way, users can keep the same protagonist, but change the description of the scene to create different visual effects.

The Wensheng video function of a specific protagonist combines the protagonist editor with the background editor. Users can use the reference image as the protagonist and describe the scene to create new video content. In addition, for multi-protagonist videos, Make-A-Protagonist can make changes to single or multiple characters.

Unlike DreamBooth and Dreamix, Make-A-Protagonist only needs a single reference image and does not need to fine-tune each concept, so it is more flexible and diverse in application scenarios. Make-A-Protagonist provides users with a simple and efficient way to achieve personalized video editing and generation.

Method

Make-A-Protagonist uses several powerful expert models to parse the original video, visual and language information, and combines the video generation model based on visual language and the mask-based denoising sampling algorithm to achieve general video editing. The model mainly consists of three key parts: original video analysis, visual and language information analysis, and video generation.

Specifically, the Make-A-Protagonist reasoning process includes the following three steps: first, we use BLIP-2, GroundingDINO, Segment Anything and XMem models to analyze the original video, obtain the protagonist mask of the video, and analyze the control signal of the original video.

Next, visual and linguistic information is parsed using CLIP and DALL-E 2 Prior. Finally, a video generation model based on visual language and a mask-based denoising sampling algorithm are used to generate new content using analytical information.

The innovation of Make-A-Protagonist lies in the introduction of a video generation model based on visual language and a mask-based denoising sampling algorithm, which achieves a breakthrough in video editing by integrating multiple expert models and analyzing and merging a variety of information.

The application of these models makes the system understand the original video, visual and language information more accurately, and generate high-quality video content.

Make-A-Protagonist provides users with a powerful and flexible tool that allows them to easily edit general-purpose videos and create unique and amazing visual works.

1. The goal of the original video parsing is to obtain the language description (caption) of the original video, the text description of the protagonist, the segmentation result of the protagonist and the control signal needed by ControlNet.

For the caption and the protagonist text description, Make-A-Protagonist uses the BLIP-2 model.

By modifying the image network of BLIP-2, the video is parsed, and the captioning mode is used to generate video descriptions, which are used in the video generation network in training and video editing.

For the text description of the protagonist, Make-A-Protagonist uses VQA mode to ask the question: "what is the protagonist of the video?" "and use the answer to further analyze the protagonist information in the original video.

In the aspect of the protagonist segmentation in the original video, Make-A-Protagonist uses the protagonist text description above, uses the GroundingDINO model to locate the corresponding detection content in the first frame, and uses the Segment Anything model to obtain the segmentation mask of the first frame. Then, with the help of tracking network (XMem), Make-A-Protagonist gets the segmentation result of the whole video sequence.

In addition, Make-A-Protagonist uses ControlNet to retain the details and actions of the original video, so it is necessary to extract the control signal of the original video. Depth signal and attitude signal are used in this paper.

Through these innovative parsing methods and techniques, Make-A-Protagonist can accurately analyze the language description, protagonist information and segmentation results of the original video, and extract control signals, which lays a solid foundation for subsequent video generation and editing.

two。 Visual and linguistic information analysis for visual signals, Make-A-Protagonist uses CLIP image embedding as the generation condition in this paper. In order to remove the influence of the background of the reference image, similar to the original video parsing, Make-A-Protagonist uses GroundingDINO and Segment Anything to get the segmentation mask of the protagonist of the reference image, and uses the mask to input the segmented image into the CLIP visual model to obtain reference visual information.

Language information is mainly used to control the background. In this paper, language information is used in two aspects. On the one hand, CLIP language model is used to extract features as the key and value of the attention network.

On the other hand, DALL-E 2 Prior network is used to transform language features into visual features, so as to enhance the representation ability.

3. Video generation

3.1 Video generation network training

In order to make full use of visual information, Make-A-Protagonist uses Stable UnCLIP as the pre-training model and fine-tunes the original video to generate video using visual information.

In each training iteration, Make-A-Protagonist extracts a random frame of CLIP image embedding from the video and inputs it into Residual block as visual information.

3.2 Mask-based denoising sampling

In order to integrate visual information and language information, this paper proposes a mask-based denoising sampling, which fuses the two kinds of information in feature space and hidden space.

Specifically, in the feature domain, Make-A-Protagonist uses the protagonist mask of the original video to use visual information for the corresponding part of the protagonist, and the language information transformed by DALL-E 2 Prior for the background part:

In the hidden space, Make-A-Protagonist fuses the reasoning results using only visual information and the reasoning results after feature fusion according to the protagonist mask of the original video:

Through the information fusion of feature space and hidden space, the result is more real and more consistent with the visual language expression.

Summary Make-A-Protagonist leads a new video editing framework that makes full use of visual and linguistic information.

The framework provides a solution for independent editing of vision and language, parsing the original video, visual and language information through multiple expert networks, and using video generation network and mask-based sampling strategy to integrate these information.

Make-A-Protagonist shows excellent video editing ability and can be widely used in protagonist editing, background editing and Wensheng video tasks with specific protagonists.

The emergence of Make-A-Protagonist brings new possibilities to the field of video editing. It creates a flexible and innovative tool for users to edit and shape video content in an unprecedented way.

Both professional editors and creative enthusiasts can create unique and wonderful visual works through Make-A-Protagonist.

Reference:

Https://make-a-protagonist.github.io/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.