3D face technology roaming guide 07/08 Update SLTechnology News&Howtos

3D face technology roaming guide

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is from the Institute of absenteeism, author: Yan Dong. AI Technology Review is authorized to reprint. If you need to reprint, please contact the Institute of neglect.

Catalogue

Introduction

Basic knowledge of 3D face

The first acquaintance of 3D faces

Camera model

3D camera

3D face data

3D face related tasks

Common Pipeline

3D face recognition

3D face reconstruction

Summary

Introduction

With the development of deep learning technology, the research of face-related tasks has become a hot spot in the academic and industry. Well-known face tasks generally include face detection, face identity recognition, facial expression recognition and so on. Most of them use 2D RGB face (including some texture information) as input. With the emergence and development of 3D scanning imaging technology, there is a new exploration route for face-related tasks-3D face.

Compared with many literature / review articles on the introduction of 2D face-related tasks, the introduction knowledge of 3D face is not good enough. This paper will sort out and introduce the basic knowledge of 3D face, and summarize some basic literature on 3D face recognition and reconstruction.

Basic knowledge of 3D face

The first acquaintance of 3D faces

2D/2.5D/3D face

Generally speaking, RGB, grayscale and infrared face images are 2D faces, and most of them are images that represent color or texture from a certain perspective, and there is no spatial information. The image used for training in deep learning is usually 2D.

2.5D is the face depth data taken from a certain angle of view, but the surface it shows is not continuous because of the angle problem, that is, when you try to rotate the face, there will be some gully-like hollow areas. This is due to the fact that the depth data of the occluded part is not captured during shooting.

What about 3D faces? It is generally composed of several depth images from different angles, which fully show the surface shape of the human face, and the human face is presented in the space in the form of dense point clouds, with certain depth information.

Here is a question: what dimension of human face does RGB-D belong to (note that the dimension has nothing to do with texture and color)?

Camera model

Before understanding 3D face-related tasks, there is a basic and very important "knowledge point", that is, the camera model. If you don't understand it, you can't start 3D. For camera models, it is recommended to refer to "Visual slam Fourteen lectures" (link: https://github .com / gao xi ang12/slambook) or "introduction to SlAM" (link: https:// www .cnblogs .com / wangguchangqing/p/8126333.html). In this paper, we first use the shortest time to give you a preliminary understanding of the camera model.

The camera model includes four kinds of coordinate systems: pixel coordinates, image coordinates, camera coordinates and world coordinates. The camera imaging process is the process of mapping three-dimensional points in real three-dimensional space to the imaging plane (two-dimensional space), also known as projective transformation.

Camera coordinates → image coordinates

The process from camera coordinate system to image coordinate system can be explained by keyhole imaging. In this paper, the midpoint to image plane point of camera coordinate system can be clearly described with the help of similarity principle.

Where f is the focal length of the camera

Camera keyhole imaging icon (https:// www .cnblogs .com / wangguchangqing/p/8126333.html)

Homogeneous representation of camera coordinates to image coordinates

Image coordinates → pixel coordinates

Generally, the pixel value is used to represent the 2D image, and the coordinate origin is usually the upper left corner of the image, so there is a difference between the pixel coordinates and the imaging plane coordinates by a zoom and the translation of the origin.

By using camera coordinates to represent image coordinates, the relationship between pixel coordinates and camera coordinates can be obtained.

To ensure homogeneity (which is commonly found in many transformation matrices), this is slightly rewritten here:

Among them

In other words, it is often said that the camera internal parameter matrix (Camera Intrinsics), K has four unknowns related to the construction of the camera, fimagxcentroy is related to the focal length and pixel size of the camera, and the distance of translation is related to the size of the imaging plane of the camera.

World coordinates → camera coordinates

In fact, the camera coordinate system is not a particularly "stable" coordinate system, because the phase will change the origin of the coordinates and the direction of each axis as it moves, so a more stable coordinate system is needed to better represent the projective transformation. the constant coordinate system we usually use is the world coordinate system.

The difference between the camera coordinate system and the world coordinate system is a rotation matrix and a shift vector (quoted from Visual slam XIV)

Similarly, in order to ensure homogeneity, it is rewritten in the following form:

Where the transformation matrix

That is, the so-called camera external parameters (Camera Extrinsics).

From the world coordinate system to the pixel coordinate system is equivalent to a weak projection process, to sum up, the transformation from the camera coordinate system to the pixel coordinate system requires the camera internal parameters, and the transformation from the camera coordinate system to the world coordinate system requires the camera external parameters, which is written as follows:

3D camera

According to the working mode of the camera, it can be divided into monocular camera (Monocular), binocular camera (Stereo) and depth camera (RGB-D), and the essence of the camera is to reflect the three-dimensional world through two-dimensional form.

A monocular camera is a camera with a single camera, because it can only capture an image from a certain angle of view at the same time, which will lose the depth of the scene. For example, if it is known that a certain image point P is on the imaging plane, because you do not know the specific distance, the projected pixel can be anywhere on the line between the camera origin and P, so when you travel or graduate, you can take a picture of the misplaced effect of holding people by hand.

(quoted from "Visual slam XIV")

So how do you take a picture with depth information? One way is to get the depth through a binocular camera. As the name implies, the binocular camera is "two eyes". The center of the aperture of the left eye camera and the right eye camera constitute the baseline, and a point P in the space will be projected on the image plane of the binocular camera respectively. In this way, the distance from P to the baseline, that is, the depth of point P, can be solved by the principle of similarity (see the formula below). In practical application, it is easy to calculate parallax where the texture of objects is rich, and considering the amount of calculation, binocular depth estimation is generally calculated by GPU or FPGA.

(quoted from "Visual slam XIV")

, of which

With the continuous evolution of technology, the emergence of depth cameras makes it more convenient for us to obtain the depth of the image. One of the depth cameras is the RGB-D camera based on structured light. Taking the human face as an example, the scanner will emit light patterns (such as raster) to the target face, calculate the surface shape according to its deformation, and then calculate the depth information of the human face.

(quoted from "Visual slam XIV")

There is also a RGB camera in the picture, so how to achieve an one-to-one correspondence between depth and RGB? After measuring the depth, the RGB-D camera will usually match the depth and color image pixels according to the position of each camera at the time of production, and output one-to-one corresponding color map and depth map. We can read the color information and distance information in the same image position, calculate the 3D camera coordinates of pixels, and generate a point cloud (Point Cloud).

There is also a depth camera based on the time-of-flight principle (Time of Flight,ToF), in which the ToF phase emits pulsed light to the target, and then determines the distance between the object and itself according to the flight time between the beam and the return beam. Different from the laser sensor, the ToF camera can obtain the pixel depth of the whole image in the process of emitting pulsed light, while the laser generally obtains the depth information by point-by-point scanning.

(quoted from "Visual slam XIV")

To sum up, 3D face tasks generally use depth cameras to obtain depth information of human faces. Depth cameras generally include binocular cameras, RGB-D cameras based on infrared structured light principle (such as Kinect 1 generation) or ToF cameras based on light time-of-flight principle (such as Kinect 2 generation).

3D face data

3D face-related tasks generally have three ways to represent data: point cloud, grid map and depth map.

Point cloud (Point cloud)

In a 3D point cloud, each point corresponds to a 3D coordinate. Many 3D scanning devices use this data format to store the collected 3D face information. Sometimes, the texture attributes of the face can also be spliced to the shape information, and then the expression of the point becomes, where ppen Q is the sparse coordinate.

The disadvantage of point cloud representation is that the neighborhood information of each point is difficult to obtain, because the storage of points is generally disordered. In general, point cloud data will be used to fit a smooth surface to reduce the impact of noise.

Grid (Mesh)

3D mesh is represented by pre-calculated and indexed information on 3D surface. it needs more memory and storage space than point cloud data, but because of the flexibility of 3D mesh, it is more suitable to do some 3D transformation. such as affine transformation, rotation and scaling. Each 3D mesh data consists of the following elements: points, lines, triangles. The coordinate information of two-dimensional texture can also be stored in point information, which is conducive to the reconstruction of a more accurate 3D model.

Depth (Depth/Range)

Depth images are also known as 2.5D or Range images. The z-axis value of the 3D face is projected onto the 2D plane, which is similar to a smooth 3D surface. Because this is a two-dimensional representation, many existing two-dimensional image processing methods can be applied directly. This kind of data can be displayed directly as a grayscale image, or it can be converted into a three-dimensional mesh using the principle of triangulation.

The first essential thing to do 3D face is 3D data, but the current situation is that there is less public data, which is much less than 2D face pictures. 3D high-precision faces can only be collected by expensive equipment, and the process is tedious. This paper combs the existing commonly used 3D or 2.5D face data sets. For the introduction of database and 3D face tasks, please refer to "3D face Research" (link: http://blog.csdn.net/alec 198 7/article/details/7469501).

3D face related tasks

Commonly used Pipeline

The Pipeline of 2D face-related tasks is generally divided into data preprocessing, feature extraction, feature analysis and so on. What about the Pipeline of 3D face? This article quotes pictures from "3D Face Analysis: Advances and Perspectives" (link: https://link.springer. Com / chapter/10.1007/978-3-319-12484-1) to explain.

A general 3D/2.5D face analysis framework is shown above. We obtain the 3D/2.5D representation of human face (Mesh, Point Cloud, Depth) through the device, and further obtain the available 3D/2.5D face through some preprocessing operations such as spherical clipping, noise removal, depth loss repair, point cloud registration and so on.

Next, the preprocessed face is represented in many ways, such as surface normal, curvature, UV-Map or commonly used CNN methods; after extracting a feature, we can carry out a variety of face tasks, such as recognition, expression analysis, gender classification, age classification and so on.

In view of the fact that the purpose of this paper is to sort out the relevant knowledge of the introduction to 3D face, here is a brief introduction to the related work on 3D face reconstruction and recognition, including the development process and some papers that are easy to use.

3D face recognition

In the first few decades of 3D face recognition, manually designed features and classification or measurement methods were used for face verification and recognition. In recent years, with the rise of deep learning methods, some work is gradually using data-driven to train 3D face recognition models. This paper briefly summarizes the 3D face recognition methods, as follows:

1. Traditional identification methods.

3D face recognition based on Point Cloud data

This kind of method usually does not consider the facial features in 3D space, but directly uses 3D point cloud for matching. Common methods are ICP (Iterative Closest Point, link: https://en.wikipedia.org/wiki/Iterative_closest_point) and Hausdorff distance (link: https://en.wikipedia.org/wiki/Hausdorff_distance).

As a rigid matching algorithm, ICP can correct the translation and rotation transformation of 3D point cloud itself, but it is not robust enough for surface concavity changes caused by facial expression and occlusion, and it costs a lot of time.

ICP uses the normal vector sampled from the face surface to match. Because the normal information has better discrimination, this paper briefly introduces the algorithm of ICP. ICP is an iterative nearest point method, which can realize the registration of two clusters of point clouds, which is better than the key point alignment of 2D face.

Suppose there are two sets of point clouds:

Through iterative methods to find a set of

And

To satisfy

, that is, the solution

. For the specific solution process, you can refer to Chapter 7 of "Visual Slam 14" (link: https://github. Com / gao xi ang12/slambook).

Hausdorff distance evaluates the distance between different true subsets in space by calculating the maximum value between the nearest point pairs between the 3D point clouds of two faces. However, the algorithm still has the problem of not being robust to expression and occlusion. The improved Hausdorff distance algorithm uses the contours of 3D faces to filter objects in the database.

The method of template face uses the seed points of 3D human face for deformation, fits it to the face to be tested, uses fitting parameters for face recognition, and can generate a specific deformable face model through dense 3D face point cloud alignment.

3D face recognition based on facial Features

3D face recognition based on facial features can be divided into two aspects: local features and global features. For more information, please refer to "3D Face Analysis: Advances and Perspectives" (link: https://link.springer. Com / chapter/10.1007/978-3-319-12484-1) and "3D face recognition: a survey" (link: https:// www. Face recognition. Net / publication/329202680_3D_face_recognition_a_survey).

There are two aspects of local features, one is the features based on the information of facial region components, such as nose, eyes and mouth, which can be roughly divided into feature extraction methods based on facial key points, curvature and blocks. The other is the feature extraction based on the local descriptor algorithm, such as wavelet feature extraction from depth image, SIFT, 2D-LBP, MRF, LSP, and there are also operators for feature extraction from 3D data, such as 3D-LBP. Global feature is to transform the whole face and extract features. Face data may be stored in different ways, such as point cloud, image, Mesh type 3D face data, such as representing 3D face model as spherical harmonic feature (SHF), or mapping 3D face surface homeomorphism to two-dimensional mesh for sparse representation, using sparse coefficients as features.

2. Deep learning recognition method

CNN has made great progress in 2D face recognition, but 2D face is easily affected by makeup, posture, illumination and expression. 3D face itself contains the spatial shape information of face, which is less affected by external factors. Compared with 2D face, 3D face data carries more information. However, because it is difficult to obtain 3D face data and the accuracy of some face data is not enough, the development of 3D face recognition is not very hot.

Face recognition based on depth map

The common face recognition methods of depth map include extraction of LBP and other features, multi-frame depth map fusion, depth map normalization and so on. Two face recognition papers related to depth map are briefly introduced here.

"Robust Face Recognition with Deeply Normalized Depth Images"

This paper is a common depth map face recognition Pipeline, which is divided into two networks: normalized network and feature extraction network. The normalized network is responsible for converting the input depth map into a HHA image (link: https://blog.csdn.net/WillWinston/article/details/78723507), and returns the parameters of 3DMM through a CNN network (mentioned in the following 3D reconstruction), which can be projected into a normalized depth map after 3D point cloud is reconstructed; the feature extraction network is basically similar to the ordinary 2D face recognition network to get a feature vector that characterizes the face of the depth map.

"Led3D: A Lightweight and Efficient Deep Approach to Recognizing Low-quality 3D Faces"

This paper is a low-quality depth map face recognition article written by CVPR 2019. Some face preprocessing and data enhancement operations for depth map are worthy of reference. In this paper, the normal direction of the deep face after spherical clipping is used as the network input, and the experimental results show that it can better represent the deep face. At the same time, the author also carefully designed a lightweight recognition network (mainly for multi-layer feature fusion and attention mechanism). For reference.

Face recognition based on RGB-D

The face recognition based on RGB-D is basically based on 2D face recognition, and the depth map aligned with RGB is sent to the CNN network as a channel. One of the advantages of RGB-D is that it increases the spatial shape information of the face. There are still many face recognition papers for RGB-D images, but the basic idea is fusion in the feature layer or in the pixel layer.

"Accurate and robust face recognition from RGB-D images with a deep learning approach"

In this paper, a face recognition algorithm based on RGB-D image based on deep learning is proposed in 2016. the paper carries out pre-training and migration learning through RGB image and multi-frame fused depth image, and fuses it in the feature layer to improve the recognition rate.

Face recognition based on Depth/RGB-3DMM

In the past two years, 3DMM has been used to regression the face model of depth map or RGB map, and applied to the work of recognition task. The general idea of this kind of work is to realize the expansion of 3D face data by regressing the parameters of 3DMM (expression, pose, shape), and apply it to the training of CNN, such as FR3DNet (link: https://ar xi v.org/abs/ 1703 1.05942), 3D Face Identification (link: https://ar xi v.org/abs/ 1703. 10714).

"Deep 3D Face Identification"

This paper is one of the first methods to apply depth neural network to 3D face recognition task. the main idea is to fit the depth map into a 3D face model with expression through 3DMM+BFM, so as to realize the expansion of depth data. at the same time, it also makes data enhancement such as random occlusion and pose transformation, and finally sends it to a 2D face recognition network for Finetune.

"Learning from Millions of 3D Scans for Large-scale 3D Face Recognition"

This paper is a masterpiece of 3D face recognition, which really creates millions of 3D face data and proposes a 3D face recognition network FR3DNet. Finally, it is tested on the existing public data set, and the effect is very good (data-driven mode, basically full state). The way to create a new ID in this paper is to find two 3D faces with the greatest difference in bending energy in the author's private data set, and to get a new 3D face by adding (please refer to the original for details). At the same time, it is proposed that in the face 3D point cloud recognition network, the use of large convolution kernel is helpful to better feel the shape information of the point cloud.

There are also many data-driven 3D face recognition such as 3DMMCNN (link: https://ar xi v.org/abs/1612.04904). To sum up, the 3D face recognition methods based on deep learning are limited by the lack of data and the accuracy of the existing data is not enough. The first task of researchers is to enhance a lot of data or generate a large number of virtual 3D faces, but whether these methods really have strong generalization performance is still open to question. Maybe the era of 3D face recognition has not come yet.

3D face reconstruction

Another focus of 3D face research is 3D face reconstruction, that is, 3D face model is reconstructed from one or more RGB face images. It has many applications, such as Face Animation,dense Face Alignment,Face Attribute Manipulation and so on. In fact, face reconstruction from RGB to 3D is a morbid problem, because RGB images actually represent texture features and have no spatial information, but considering the practical application value, some 3D reconstruction methods have been proposed in recent years.

This article will introduce several popular 3D face reconstruction methods for beginners' reference. For more information on 3D face reconstruction summary and recommendation, refer to "3D face Reconstruction Summary" (link: https://blog.csdn.net/u011681952/article/details/82623328). Here is an example of 3D face reconstruction (taken from PRNet).

Face Reconstruction based on traditional methods

Traditional 3D face reconstruction methods generally use the information expressed by the image itself to complete 3D face reconstruction, such as image parallax, relative height and so on. For more common methods, such as 3D reconstruction through binocular vision, the difficulty lies in how to match the corresponding feature points from different perspectives. For this kind of articles, please refer to "A Survey of Different 3D Face Reconstruction Methods" (link: https://pdfs.semanticscholar.org/d4b8/8be6ce771 64 f5eea1ed2b16b985c0670463a.pdf).

Model-based face Reconstruction

There are two commonly used models in 3D face reconstruction, one is the general model CANDIDE, the other is 3DMM.

Among the many general models, CANDIDE-3 is the most famous, consisting of 113vertices and 168faces. To put it simply, these vertices and faces are modified to make their features match the image to be reconstructed. Through the overall adjustment, the facial features and other facial key points can be aligned as far as possible; through local adjustment, the local details of the face can be more refined, and then the reconstructed face can be obtained by vertex interpolation.

The advantages and disadvantages of this model are obvious, the number of vertices of the template is too small, and the reconstruction speed is fast, but the reconstruction accuracy is seriously insufficient, and the facial detail feature reconstruction is not good.

The algorithm for getting started with 3D faces is 3D Morphable Model (3DMM), which is a linear representation of a face model proposed by Volker Blanz in "A Morphable Model For The Synthesis Of 3D Faces" in 1999. a 2D face image can be generated into its corresponding 3D face model by:

Among them

And

So how to reconstruct 3D from 2D? First of all, we need to understand how the 3D model is projected to the 2D plane. The camera model mentioned at the beginning of the above, the 3D model projected to the 2D plane can be expressed as:

An average face deformation model is constructed by using a face database. After giving a new face image, the face image is matched with the model, the corresponding parameters of the model are modified, and the model is deformed. Until the difference between the model and the face image is minimized, the texture can be optimized and the face modeling can be completed.

In general, the supervision method used in the process of 2D-3D reconstruction is the key points on the orthogonal projection of 2D facial key points and 3D vertices.

End-to-end face Reconstruction based on CNN

With the 3DMM model, the 3D reconstruction of a single 2D face can be carried out, but a practical problem is that the traditional 3DMM reconstruction is an iterative fitting process, which is inefficient, so it is not suitable for real-time 3D face reconstruction. By analyzing the principle of 3DMM, we can see that what needs to be adjusted is the 199D parameters of 3DMM (this different basis is different). Why not use the parameters of CNN regression basis? In this way, we can predict the parameters through the network and realize the rapid reconstruction of 3DMM.

But there is a question, how do we get the training data? For this reason, most papers choose to use 3DMM offline fitting a large number of face images as ground-truth, and then send it to the neural network for training. Although it is a morbid problem, it works well. This paper will introduce several easy-to-understand end-to-end 3D face reconstruction methods based on CNN.

"Disentangling Features in 3D Face Shapes for Joint Face Reconstruction and Recognition"

This paper uses CNN to regression Identity Shape and Residual Shape parameters, and the expression is similar to 3DMM, except that in addition to the ordinary reconstruction loss (usually element-wise L2 loss), an Identification loss is added to ensure that the reconstructed facial ID features remain unchanged.

"End-to-end 3D face reconstruction with deep neural networks"

The idea of this paper is also to return to 3DMM parameters. The author believes that the high-level semantic features can represent ID information, while the middle-level features can represent expression features, so the corresponding parameters can be regressed from different levels to achieve 3D face reconstruction.

"Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network"

Another common end-to-end 3D face reconstruction method is Position Regression Network (PRN), push! (with open source code PRN: https://github. Com / YadiraF/PRNet).

In this paper, an end-to-end Position Regression Network is proposed to complete 3D face reconstruction and dense face alignment.

The author introduces UV Position Map, which can store 3D point cloud coordinates of human face through 2D images. Suppose a 3D point cloud containing 65536 points can be represented as a 256cm 2563 2D image by UV Position Map, and each pixel stores the spatial coordinates of the point cloud. Therefore, 3D face reconstruction can be realized by regression of the UV Position Map of the original image through an encoder-decoder network.

By designing a Loss Function with different regions and different weights, the author finally achieves high-precision face reconstruction and dense key point alignment.

"3D Dense Face Alignment via Graph Convolution Networks"

There is a problem with the above way of regression UV Position Map. Finally, when the UV image is mapped to the 3D face mesh image, there will be some stripes. In some recent 3D face reconstruction work, another method of multi-level regression 3D face mesh has achieved good results.

The author of this paper completes the final mesh regression under multiple supervision tasks by increasing the regression mesh vertices step by step. At the same time, the composition relationship between points can be more essentially carried out by using the form of graph convolution, and a good reconstruction effect is achieved.

3D face reconstruction is a hot topic in recent years, and many articles have proposed a variety of 3D face reconstruction schemes in various conferences every year, but from an entry point of view, mastering the above common methods will lay a good foundation for future research.

Summary

This paper introduces the basic knowledge of 3D face technology, including 3D basic knowledge such as camera model, 3D camera working principle, 3D face data processing and so on. At the same time, it also summarizes the related methods of 3D face recognition / reconstruction, hoping to throw a brick and attract jade. and help to get started with 3D faces. Due to time reasons, some summaries may not be perfect. Please correct them in time.

Https://www.toutiao.com/i6722041131011408392/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.