Meta and CMU jointly launched an epic upgrade of VR, and the latest HyperReel model enables high-fidelity six-degree-of-freedom video rendering. 04/21 Update SLTechnology News&Howtos

Meta and CMU jointly launched an epic upgrade of VR, and the latest HyperReel model enables high-fidelity six-degree-of-freedom video rendering.

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Recently, researchers from Meta and CMU have proposed a new method of 6-DoF video representation, which can achieve megapixel resolution rendering at 18 frames per second on a single RTX 3090, or will bring a revolutionary high-quality experience to VR.

Recently, HyperReel, a 6-DoF video representation model proposed by Meta and Carnegie Mellon University, may herald the birth of a new VR "killer" application.

The so-called "six-degree-of-freedom video" (6-DoF) is simply an ultra-high definition 4D experiential playback.

Among them, the user can be completely "exposed" to the dynamic scene and can move freely. When they arbitrarily change their head position (3 DoF) and orientation (3 DoF), the corresponding view will be generated.

Https://arxiv.org/ abs / 2301.02238 compared with the previous work, the biggest advantage of HyperReel lies in memory and computational efficiency, both of which are very important for portable VR header display.

And only need to use vanilla PyTorch,HyperReel to achieve megapixel resolution rendering at a speed of 18 frames per second on a single Zhang Yingwei RTX 3090.

It is too long to read the version:

1. This paper presents a light conditional sampling prediction network which can realize high fidelity and high frame rate rendering at high resolution, and a compact and memory efficient dynamic volume representation.

2. 6-DoF video representation method HyperReel combines the above two core parts, which can achieve an ideal balance among speed, quality and memory while rendering megapixel resolution in real time.

3. HyperReel is superior to other methods in memory requirements, rendering speed and other aspects.

This paper introduces that Volume scene representation (volumetric scene representation) can provide realistic view synthesis for static scenes, and constitutes the basis of the existing 6-DoF video technology.

However, the volume rendering program that drives these representations requires careful tradeoffs in terms of quality, rendering speed, and memory efficiency.

The existing methods have a drawback-they cannot achieve real-time performance, small memory footprint and high-quality rendering at the same time, which are extremely important in very challenging real-world scenarios.

In order to solve these problems, the researchers proposed a 6-DoF video representation method of HyperReel-- based on NeRF technology (neural radiation field).

Among them, the two core parts of HyperReel are:

1. A sampling prediction network under light conditions can render with high fidelity and high frame rate at high resolution.

two。 A compact and memory-efficient dynamic volume representation.

Compared with other methods, HyperReel's 6-DoF video pipeline not only performs extremely well in visual quality, but also requires very little memory.

At the same time, HyperReel can render at 18 frames per second at megapixel resolution without any custom CUDA code.

Specifically, HypeReel achieves a balance among high rendering quality, speed and memory efficiency by combining the sample prediction network and the volume representation method based on key frames.

The sample prediction network can not only accelerate volume rendering, but also improve rendering quality, especially for challenging view-dependent scenes.

In terms of volume representation based on keyframes, the researchers use an extension of TensoRF.

This method can represent a complete video sequence while the memory consumption is roughly the same as that of a single static frame TensoRF.

Real-time demonstration next, we will demonstrate in real time how HypeReel renders dynamic and static scenes at 512x512 pixel resolution.

It is worth noting that the researchers used smaller models in Technicolor and Shiny scenes, so the frame rate of rendering is greater than 40 FPS. The full model is used for the rest of the dataset, but HypeReel can still provide real-time reasoning.

Technicolor

Shiny

Stanford

Immersive

DoNeRF implementation method in order to achieve HeperReel, the first problem to be considered is to optimize the volume representation of static view synthesis.

Volume representations like NeRF model the density and appearance of every point in a static scene in 3D space.

More specifically, through the function

Change the position x and direction

Map to color along the stripe ray

And density σ (x).

The trainable parameter θ here can be a neural network weight, an N-dimensional array entry, or a combination of both.

You can then render a new view of the static scene

Among them

Characterization from o to

The transmittance of.

In practice, you can take multiple sample points along a given ray, and then use numerical quadrature to calculate equation 1:

In which the weight

Specifies the contribution of the color of each sample point to the output.

An example of a volume rendered mesh is in the HyperReel of a static scene, given a set of images and camera poses, and the training goal is to reconstruct the measured color associated with each ray.

Most scenes are made up of solid objects whose surfaces are located on a 2D manifold within the volume of the 3D scene. In this case, only a small number of sample points affect the rendering color of each ray.

Therefore, in order to speed up volume rendering, the researchers want only non-zero

Point, query color and opacity.

As shown in the figure below, the researchers use feedforward networks to predict the location of a set of samples.

. Specifically, it uses samples to predict the network.

, will the ray

Map to sample point

To get the rendering in volume equation 2

Here, the researchers use the parameterization of Plucker to represent light.

But there is a problem: giving too much flexibility to the network may have a negative impact on the quality of view synthesis. For example, if (x1,. . . , xn) is a completely arbitrary point, so the rendering may not appear to be caused by multiple views.

In order to solve this problem, the researchers chose to use the sample prediction network to predict the parameters of a set of geometric primitives G1,..., Gn, where the parameters of the primitives can vary according to the input rays. To get the sample point, the ray is intersected with each primitive.

As shown in figure a, given the input light from the camera origin o and propagating in the direction omega, the researchers first re-parameterized the light using Plucker coordinates.

As shown in figure b, a network

Take this ray as input and output a set of geometric primitives {

} (such as axis-aligned planes and spheres) and displacement vector {

The parameter of}.

As shown in figure c, in order to generate sample points {for volume rendering

The researchers calculated the point of intersection between the ray and the geometric primitive and added the displacement vector to the result. The advantage of predicting geometric primitives is that the sampled signal is smooth and easy to be interpolated.

The displacement vector provides additional flexibility for sampling points to better capture the complex line-of-sight-dependent appearance.

As shown in figure d, in the end, the researchers used formula 2 to render the volume to produce a pixel color, and supervised and trained it according to the corresponding observations.

The dynamic volume based on Keyframe can sample the volume of 3D scene effectively through the above methods.

How to characterize the volume? In the static case, the researchers use the memory-efficient tensor radiated field (TensoRF) method; in the dynamic case, the TensoRF is extended to the dynamic volume representation based on keyframes.

The following figure explains the process of extracting dynamic sample point representations from keyframe-based representations.

As shown in figure 1, first, the researchers used to predict the speed of network output from the sample {

}, set the time

Sample point at {

} Pan to the nearest Keyframe

Medium.

Then, as shown in figure 2, the researchers queried the outer product of the spatio-temporal texture, generated the appearance features of each sample point, and then converted it into color by formula 10.

In this process, the researchers extracted the opacity of each sample.

Results the comparison of static scenarios here, the researchers compared HyperReel with the existing static view synthesis methods (including NeRF, InstantNGP and three methods based on sampling network).

DoNeRF data set

The DoNeRF data set contains six composite sequences with an image resolution of 800 × 800 pixels.

As shown in Table 1, HyperReel's approach outperforms all baselines in quality and greatly improves the performance of other sampling network schemes.

At the same time, HyperReel is implemented in vanilla PyTorch and can render an image of 800 × 800 pixels at a speed of 6.5 FPS on a single RTX 3090 GPU (or 29 FPS with a Tiny model).

In addition, compared with the deep MLP of 88 layers and 256 hidden units of R2L, the reasoning speed of the network with 6 layers and 256 hidden units and the volume backbone of TensoRF proposed by the researchers is faster.

LLFF data set

The LLFF dataset contains 8 real-world sequences with 1008 × 756 pixel images.

As shown in Table 1, HyperReel's method is better than DoNeRF, AdaNeRF, TermiNeRF and InstantNGP, but the quality achieved is slightly worse than that of NeRF.

This data set is a great challenge for explicit volume representation due to incorrect camera calibration and sparse input angle of view.

Comparison of Technicolor datasets in dynamic scenes

The Technicolor light field data set contains videos of various indoor environments captured by a time-synchronized 4x4 camera device, where each picture in each video stream is 2048 × 1088 pixels.

The researchers compared HyperReel and Neural 3D Video to five sequences (Birthday, Fabien, Painter, Theater, Trains) of the data set at full image resolution, each 50 frames long.

As shown in Table 2, the quality of HyperReel exceeds that of Neural 3D Video, while training time per sequence is only 1.5 hours (instead of more than 1000 hours of Neural 3D), and rendering is faster.

Neural 3D Video dataset

The Neural 3D Video dataset contains 6 indoor multi-view video sequences shot by 20 cameras at a resolution of 2704 × 2028 pixels.

As shown in Table 2, HyperReel outperforms all baseline methods on this dataset, including the latest work such as NeRFPlayer and StreamRF.

In particular, HyperReel surpasses NeRFPlayer in quantity and renders about 40 times faster than StreamRF in quality, although it uses Plenoxels as the backbone (using a custom CUDA kernel to speed up reasoning).

In addition, HyperReel consumes much less memory per frame on average than both StreamRF and NeRFPlayer.

Google Immersive dataset

The Google Immersive dataset contains light field video for a variety of indoor and outdoor environments.

As shown in Table 2, HyperReel is 1 dB better in quality and faster in rendering than NeRFPlayer.

Unfortunately, HyperReel has not yet reached the rendering speed required by VR (ideally 72FPS, stereo).

However, because this approach is implemented in vanilla PyTorch, you can further optimize performance through work such as a custom CUDA kernel.

The author introduces his thesis, Benjamin Attal, and is currently studying for a doctorate at the Carnegie Mellon Robotics Institute. Research interests include virtual reality, as well as computational imaging and display.

Reference:

Https://arxiv.org/abs/2301.02238

Https://hyperreel.github.io

Https://hub.baai.ac.cn/view/23146

Https://twitter.com/DrJimFan/status/1611791338034593793

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era), editor: sleepy Aeneas

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.