Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

A photo and a sound give birth to a super realistic video! Nanjing University and others put forward a new framework, and the mouth-shaped movements were accurately restored.

2025-01-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)12/24 Report--

Thanks to CTOnews.com netizens, assassins for their clues delivery!

Xin Zhiyuan reports

Editor: moist and sleepy

[guide to Xin Zhiyuan] recently, researchers from institutions such as Nanjing University have developed a general framework that allows the avatar in a photo to speak multiple languages with a piece of audio. Both the head movement and the mouth shape are very natural. I see a lot of good ones.

A piece of audio + a photo, and instantly the person in the photo can start talking.

The generated speech animation is not only mouth shape and audio can be seamlessly aligned, facial expressions and head posture are very natural and expressive.

And support the image style is also very diverse, in addition to general photos, cartoon pictures, certificate photos and other effects are very natural.

Coupled with the support of multiple languages, the character in the photo will come back to life in an instant, and you can speak a foreign language with your mouth open.

This is a general framework proposed by researchers from Nanjing University and other institutions-VividTalk, which only needs voice and a picture to generate high-quality speech video.

Https://arxiv.org/ abs / 2312.01841 is a two-stage framework composed of audio to grid generation and grid to video generation.

In the first stage, the one-to-many mapping between facial motion and blendshape distribution is considered, and blendshape and 3D vertices are used as intermediate representations, in which blendshape provides rough motion and vertex offset describes fine-grained lip motion.

In addition, a multi-branch Transformer network is used to make full use of the audio context to model the relationship with the intermediate representation.

In order to learn rigid head motion more reasonably from audio, researchers transform this problem into a code query task in discrete finite space, and build a learnable head posture codebook with reconstruction and mapping mechanism.

After that, the two learned motions are applied to the reference logo, resulting in a drive grid.

In the second stage, the projected textures of the inner and outer surfaces (such as the torso) are rendered based on the drive mesh and the reference image to fully model the motion.

Then a novel double-branch motion model is designed to simulate dense motion, which is sent as input to the generator to synthesize the final video frame by frame.

VividTalk can generate mouth-shaped synchronized head talking videos with expressive facial expressions and natural head posture.

As shown in the table below, both visual results and quantitative analysis demonstrate the superiority of the new method in generating quality and model generalization.

The frame implementation method gives an audio sequence and a reference facial image as input, and the new method can generate head speech video with different facial expressions and natural head posture.

The VividTalk framework consists of two phases, called audio-to-grid generation and grid-to-video generation.

The goal of the audio-to-grid generation stage is to generate a 3D-driven mesh based on the input audio sequence and the reference facial image.

Specifically, firstly, FaceVerse is used to reconstruct the reference facial image.

Next, learn non-rigid facial expression motion and rigid head motion from audio to drive the reconstructed mesh.

To this end, the researchers proposed a multi-branch BlendShape and vertex offset generator and a learnable head pose codebook.

BlendShape and Vertex offset Generator

Learning a general model to generate accurate mouth movements and expressive facial expressions with a specific person style is challenging in two ways:

1) the first challenge is the relevance of audio motion. Because the audio signal is most related to the mouth motion, it is difficult to model the non-mouth motion according to the audio.

2) the mapping from audio to facial expression action naturally has one-to-many attributes, which means that the same audio input may have more than one correct action mode, resulting in facial images without personal features.

In order to solve the problem of audio motion correlation, researchers use blendshape and vertex offset as intermediate representations, in which blendshape provides global coarse-grained facial expression motion, while lip-related vertex offset provides local fine-grained lip motion.

For the lack of facial features, the researchers proposed a generator based on multi-branch transformer to model the motion of each part separately and inject a topic-specific style to maintain personal features.

The password book of the learnable head posture

Head posture is another important factor that affects the reality of head speaking videos. However, it is not easy to learn it directly from audio, because the relationship between them is weak, which can lead to unreasonable and discontinuous results.

Inspired by the previous research, using the discrete codebook as a priori, high fidelity generation can be guaranteed even in the case of input degradation.

The researchers suggest that this problem be transformed into a code query task in discrete and limited head posture space, and a two-stage training mechanism is carefully designed. in the first stage, a rich head posture codebook is constructed. in the second stage, the input audio is mapped to the codebook to generate the final result, as shown in the following figure.

Grid-to-video generation as shown in the following figure, the researchers proposed a two-branch motionvae to model 2D dense motion, which will be used as input to the generator to synthesize the final video.

It is difficult and inefficient to convert 3D domain motion directly into 2D domain motion, because the network needs to find the corresponding relationship between the two domain motions in order to better model.

In order to improve the performance of the network and obtain further performance, researchers use projected texture representation to carry out this transformation in the 2D domain.

As shown above, in the face branch, the reference projection texture P T and the driven projection texture P Tare are connected and fed into the encoder, and then input MLP to output a 2D facial motion map.

To further enhance lip movement and model more accurately, the researchers also selected lip-related markers and converted them into Gaussian diagrams, which is a more compact and efficient representation.

The hourglass network then takes the subtracted Gaussian graph as input and outputs 2D lip motion, which is connected to facial motion and decoded into dense motion and occlusion.

Finally, the researchers deform the reference image according to the previously predicted dense motion map to obtain the deformed image, which will be used as the input of the generator together with the occlusion image to synthesize the final video frame by frame.

Experimental effect data set

HDTF is a high-resolution audio-visual data set containing more than 16 hours of video on 346 topics. VoxCeleb is another larger data set, involving more than 100000 videos and 1000 identities.

The researchers first filtered two data sets to delete invalid data, such as audio and video data that were out of sync.

Then crop the face area in the video and resize it to 256 × 256.

Finally, the processed video is divided into 80%, 10% and 10%, which will be used for training, verification and testing.

Implementation details

In the experiment, the researchers used FaceVerse, a state-of-the-art single-image reconstruction method, to restore video and obtain ground live mixed shapes and grids for monitoring.

In the training process, the Audio-To-Mesh stage and the Mesh-To-Video stage are trained separately.

Specifically, the BlendShape and vertex offset generators in the audio-to-mesh phase and the learnable head pose code are also trained respectively.

In the process of reasoning, the researcher's model can work end-to-end by cascading the above two stages.

For optimization, using the Adam optimizer, the learning rates of the two stages are 1 × 10 and 1 × 10, respectively. The total training time on 8 NVIDIA V100 GPU was 2 days.

Comparison with SOTA

It can be seen that the method proposed by the researchers can generate high-quality head speech videos with accurate lip synchronization and expressive facial movements.

By comparison:

SadTalker fails to generate accurate fine-grained lip motion, and the video quality is lower.

TalkLip produces blurred results and changes the skin color to a slightly yellowish color, losing identity information to some extent.

MakeItTalk cannot generate an accurate mouth shape, especially in cross-identity dubbing settings.

Wav2Lip tends to synthesize blurred mouth areas and outputs video with static head posture and eye movements when entering a single reference image.

PC-AVS needs to drive video as input and tries to preserve its identity.

Quantitative comparison

As shown in the table below, the new approach performs better in terms of image quality and identity retention, as reflected by lower FID and higher CSIM indicators.

Because of the novel learnable password mechanism, the head posture generated by the new method is more diversified and natural.

Although the new method has a lower SyncNet score than Wav2Lip, it can drive reference images that use a single audio instead of video and generate higher-quality frames.

Reference:

Https://humanaigc.github.io/vivid-talk/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report