In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
The original title: "AIGC is developing too fast!" Meta launches its first text-based 4D video synthesizer: are 3D game modelers going to be laid off? "
Text to 2-D image, 2-D video, 3-D model, now finally developed to 3-D model video!
The AI generation model has made great progress in the past period of time. In the image field, users can generate images by entering natural language prompts (such as DALL-E 2 stable Diffusion), expand in the time dimension to generate continuous video (such as Phenaki), or expand in the spatial dimension to generate 3D models directly (such as Dreamfusion).
But so far, these tasks are still in an isolated research state, and there is no technological intersection between them.
Recently, researchers at Meta AI have combined the advantages of video and 3D generation models to propose a new text-to-four-dimensional (3D + time) generation system, MAV3D (MakeA-Video3D), which takes natural language description as input and outputs a dynamic 3D scene representation that can be rendered from any perspective.
Links to papers: https://arxiv.org/ abs / 2301.11280
Project link: https://make-a-video3d.github.io/
MAV3D is also the first model that can generate a 3D dynamic scene based on a given text description.
The method proposed in this paper uses a 4D dynamic neural radiation field (NeRF) to optimize scene representation, density and motion consistency by querying a model based on text-to-video (T2V) diffusion. The dynamic video output generated by the provided text can be viewed from any camera position and angle, and can be synthesized into any 3D environment.
This method can be used to generate 3D assets for video games, visual effects, or enhanced and virtual reality.
Unlike image generation and video generation tasks, there is a large amount of caption data available for training on the Internet, but there is not even a ready-made set of 4D models.
Corky's MAV3D training does not require any 3D or 4D data, while the T2V model only needs to be trained on text-image pairs and untagged videos.
In the experimental part, the researchers carried out comprehensive quantitative and qualitative experiments to prove the effectiveness of the method, and significantly improved the previously established internal baseline.
Due to the lack of training data in text-to-4D dynamic scene, researchers have come up with several ideas to solve this task.
One way might be to find a pre-trained two-dimensional video generator and extract a four-dimensional reconstruction from the generated video. However, it is still a very challenging problem to reconstruct the shape of deformable objects from video, that is, non-rigid motion structures (Non-Rigid Structure from Motion, NRSfM).
Given multiple synchronous viewpoints (multiple simultaneous viewpoints) of an object, the task becomes simple. Although multi-position settings are rare in real data, researchers believe that existing video generators imply an arbitrary viewpoint model for generating scenes.
That is, the video generator can be used as a "statistical" multi-camera setting to reconstruct the geometry and luminosity of deformable objects.
The MAV3D algorithm achieves this goal by optimizing the dynamic neural radiation field (NeRF) and decoding the input text into video to sample the random viewpoints around the object.
The direct use of video generator to optimize dynamic NeRF does not achieve satisfactory results, and there are still several difficult problems to overcome in the implementation process:
1. Need an effective, end-to-end learnable dynamic 3D scene representation
two。 A data source with supervised learning is needed, because there are no large-scale (text, 4D) pairs of data sets to learn.
3. The resolution of the output needs to be expanded in space and time dimensions, because 4D output requires a lot of memory and computing power.
MAV3D model MAV3D model is based on the latest work of neural radiation field (NeRFs), combines the achievements of efficient (static) NeRFs and dynamic NeRFs, and represents the 4D scene as a set of six multi-resolution feature planes.
In order to monitor this representation in the absence of corresponding (text, 4D) data, the researchers proposed a multi-stage training pipeline for dynamic scene rendering and proved the importance of each component in achieving high-quality results.
A key observation is that using the Text-to-Video (T2V) model and using Score Distillation Sampling (SDS) to directly optimize the dynamic scene will lead to visual artifacts and suboptimal convergence.
So the researchers choose to first use the text-to-image (T2I) model to match the static 3D scene with text prompts, and then enhance the 3D scene model in a dynamic way.
In addition, a new temporal-aware SDS loss and motion regularization term are introduced into the model, which is proved to be very important for real and challenging sports.
And through an additional temporal-aware super-resolution fine-tuning phase to expand to a higher resolution output.
Finally, the SDS of the super-resolution module of the T2V model is used to obtain high-resolution gradient information for supervised learning of the 3D scene model, increase its visual fidelity, and sample the output of higher resolution in the reasoning process.
In the experimental part, the evaluation indicators use CLIP R-Precision to evaluate the generated video, which can be used to measure the consistency between the text and the generated scene, and can reflect the retrieval accuracy of input prompts from the rendered framework. The researchers used a ViT-B / 32 variant of CLIP and extracted frames in different views and time steps.
In addition, four qualitative indicators were used to obtain (I) video quality, (ii) fidelity to text cues, (iii) exercise volume, and (iv) movement authenticity by asking human commentators about their preferences in the two generated videos.
Text-to-4D comparison since there is no previous method of converting text to 4D, the researchers established three baselines based on the T2V generation method for comparison, and the sequence of two-dimensional frames is converted into a sequence represented by a three-dimensional scene in three different ways.
The first sequence is obtained through the one-shot Neural scene renderer (Point-E); the second is generated by applying pixelNeRF independently to each frame; and the third is the camera position extracted by applying D-NeRF in conjunction with COLMAP.
It can be seen that this method exceeds the baseline model in the objective R-accuracy index, and has been higher evaluated by human commentators in all indicators.
In addition, the researchers also explored the performance of this method in different camera perspectives.
Ablation experiment
1. The model trained without scene super-resolution (SR) fine-tuning has the same number of steps as MAV3D (stage 3). Human annotators prefer to use SR training model in terms of quality, text alignment and movement.
In addition, super-resolution fine-tuning enhances the quality of rendered video, making high-resolution video with finer detail and less noise.
2. No pre-training: when the steps for directly optimizing dynamic scenes (without static scene pre-training) are the same as MAV3D, the result is much lower scene quality or poor convergence: in 73% and 65% cases, static pre-training models are more popular in terms of video quality and real sports.
Reference:
Https://arxiv.org/abs/2301.11280
This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.