Summary of the principle of 3D face recognition Technology of "practical Information" Image algorithm 07/02 Update SLTechnology News&Howtos

Summary of the principle of 3D face recognition Technology of "practical Information" Image algorithm

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

2019-08-01 10:27:22

With the progress of deep learning technology, the research of face-related tasks has become a hot topic in academia and industry. Well-known facial tasks usually include facial detection, facial recognition, facial expression recognition, etc., which mainly use 2D RGB face (usually including some texture information) as input; the emergence of 3D scanning imaging technology has developed a new exploration route for face-related tasks-3D face.

Compared with many introductory literature / reviews on 2D face-related tasks, the introductory knowledge of 3D face is not good enough. This paper will review and introduce the basic knowledge of 3D face, and summarize some basic literature on 3D face recognition and reconstruction.

Basic knowledge of 3D face

Generally speaking, RGB, grayscale and infrared face images are 2D faces, which mainly represent color or texture images from a specific perspective, and have no spatial information. The image used for training deep learning is usually 2D.

2.5D is the facial depth data taken from a certain angle of view, but because of the angle, it shows a discontinuous surface, that is, when you try to rotate the face, there will be some gully-like gaps. This is because the depth data of the occluded part is not captured during the shooting.

What about 3D faces? It usually consists of multiple depth images from different angles, fully shows the surface shape of the face, and presents the face in the space of dense point clouds with certain depth information.

Camera model

The camera model includes four coordinate systems: pixel coordinates, image coordinates, camera coordinates, and world coordinates (high school physics teacher's head does not have a flash to talk about the reference system). The camera imaging process is the process of mapping points in three-dimensional real three-dimensional space to the imaging plane (two-dimensional space), also known as projection transformation.

Camera coordinates → image coordinates

The process from camera coordinate system to image coordinate system can be explained by keyhole imaging. In this paper, the midpoint of camera coordinate system can be clearly described with the help of similarity principle.

To the image plane point.

Where f is the focal length of the camera

Homogeneous representation of camera coordinates to image coordinates

Image coordinates → pixel coordinates

Generally, the pixel value is used to represent the 2D image, and the coordinate origin is usually the upper left corner of the image, so there is a difference between the pixel coordinates and the imaging plane coordinates by a zoom and the translation of the origin.

By using camera coordinates to represent image coordinates, the relationship between pixel coordinates and camera coordinates can be obtained.

To ensure homogeneity (which is commonly found in many transformation matrices), this is slightly rewritten here:

In other words, it is often said that the camera internal parameter matrix (Camera Intrinsics), K has four unknowns related to the construction of the camera, fimagxcentroy is related to the focal length and pixel size of the camera, and the distance of translation is related to the size of the imaging plane of the camera.

World coordinates → camera coordinates

In fact, the camera coordinate system is not a particularly "stable" coordinate system, because the phase will change the origin of the coordinates and the direction of each axis as it moves, so a more stable coordinate system is needed to better represent the projective transformation. the constant coordinate system we usually use is the world coordinate system.

The difference between the camera coordinate system and the world coordinate system is a rotation matrix and a shift vector.

Similarly, in order to ensure homogeneity, it is rewritten in the following form:

That is, the so-called camera external parameters (Camera Extrinsics).

From the world coordinate system to the pixel coordinate system is equivalent to a weak projection process. In a word, it is necessary to convert the camera coordinate system into pixel coordinate system. The camera internal parameters need to be converted from the camera coordinate system to the world coordinate system.

3D camera

According to the working mode of the camera, it can be divided into monocular camera (Monocular), binocular camera (Stereo) and depth camera (RGB-D), and the essence of the camera is to reflect the three two-dimensional world.

A monocular camera, a camera with a single camera, loses the depth of the scene because it can only capture images from a certain angle of view at the same time. For example, if it is known that an image point P is on the imaging plane, because the specific distance is unknown, the projected pixel point can be any position on the line connecting the camera origin and P, so it can be taken when traveling or graduating. The carefully selected person misplaced the effect picture.

So how do you take pictures with in-depth information? One way is to get the depth through a binocular camera. As the name of the binocular camera implies, the aperture center and baseline of the left and right eye cameras, the point P in the space will be projected on the sum of the binocular camera images, so the P principle can be solved by the similarity principle. The distance to the baseline is the depth of point P (see the formula below). In practical application, it is easier to calculate the parallax where the texture is rich, and considering the amount of computation, binocular depth estimation is usually calculated by GPU or FPGA.

, of which

With the continuous development of technology, the emergence of depth cameras makes it easier for us to capture the depth of the image. One of the depth cameras is a RGB-D camera based on structured light. Taking the human face as an example, the scanner emits light patterns (such as gratings) on the target surface, and calculates the shape of the surface according to its deformation, thus calculating the facial depth information.

There is also a RGB camera in the picture, so how to achieve an one-to-one correspondence between depth and RGB? After measuring the depth, the RGB-D camera usually completes the pairing between depth and color image pixels according to the position of each camera at the time of production, and outputs one-to-one corresponding color map and depth map. We can read color information and distance information in the same image position, calculate the 3D camera coordinates of pixels, and generate point clouds.

There is also a depth camera based on time-of-flight (ToF), which emits pulsed light to the target and then determines the distance between the object and itself according to the flight time of the beam between transmission and return. Different from the laser sensor, the ToF camera can capture the pixel depth of the whole image while emitting pulsed light, while the laser usually obtains the depth information by point-by-point scanning.

To sum up, 3D face tasks usually use depth cameras to obtain the depth information of human faces. Depth cameras usually include binocular cameras, RGB-D cameras based on infrared structured light principles (such as Kinect 1 generation), or light-time-based principles. ToF cameras (such as Kinect 2).

3D face data

There are usually three ways to represent data for 3D face-related tasks: point cloud, grid map and depth map.

Point cloud (Point cloud)

In a 3D point cloud, each point corresponds to 3D coordinates. Many 3D scanning devices use this data format to store the acquired 3D facial information. Sometimes, the texture attributes of the face can also be stitched to the shape information, and the expression of the point becomes, where p _ Q is a sparse coordinate.

The disadvantage of point cloud is that the neighborhood information of each point can not be obtained well because the point storage is usually disordered. In general, point cloud data are used to fit smooth surfaces to reduce the impact of noise.

Grid (Mesh)

The 3D mesh is represented by pre-calculated and indexed information on the 3D surface. Compared with point cloud data, it needs more memory and storage space, but because of the flexibility of 3D mesh, it is more suitable to do some 3D transformations, such as affine transformation, rotation and scaling. Each 3D mesh data consists of the following elements: points, lines, and triangular faces. The coordinate information of two-dimensional texture can also be stored in point information, which is conducive to the reconstruction of a more accurate 3D model.

Depth (Depth/Range)

A depth image is also known as a 2.5D or range image. Projects the z-axis value of a 3D face onto a 2D plane, similar to a smooth 3D surface. Because this is a two-dimensional representation, many existing methods for processing two-dimensional images can be applied directly. The data can be directly displayed in grayscale or converted to 3D mesh using the principle of triangulation.

3D face related tasks

Commonly used Pipeline

The Pipeline of 2D face-related tasks is generally divided into data preprocessing, feature extraction, feature analysis and so on. What about the Pipeline of 3D face? This article quotes "3D Face Analysis: Advances and Perspectives"

A general 3D / 2.5D facial analysis framework is shown above. We obtain the 3D / 2.5D representation of the face (mesh, point cloud, depth) through the device, and obtain some available 3D / 2.5D cloud registration and other human faces after some preprocessing operations (such as spherical clipping, noise elimination, depth loss repair, points).

Next, the preprocessing surface is characterized by many methods, such as surface normal, curvature, UV-Map or commonly used CNN methods; after feature extraction, you can perform a variety of facial tasks, such as recognition, expression analysis, gender classification, age classification and so on.

In view of the fact that the purpose of this paper is to sort out the relevant knowledge of 3D face introduction, the following briefly introduces the related work of 3D face reconstruction and recognition, including the development process and some relatively easy-to-use papers.

3D face recognition

In the first decades of 3D face recognition, facial design features and classification or measurement methods were used for facial verification and recognition. In recent years, with the rise of deep learning methods, some work has been driven by data and 3D face recognition model training. This paper briefly summarizes the methods of 3D face recognition, as follows:

1. Traditional recognition method

3D face recognition based on Point Cloud data

This method usually does not consider the facial features in 3D space, but directly uses 3D point clouds for matching. The common methods are ICP (closest point of iteration, link: https://en.wikipedia.org/wiki/Iterative_closest_point) and Hausdorff distance (link: https://en.wikipedia.org/wiki/Hausdorff_distance).

As a rigid matching algorithm, ICP can correct the translation and rotation transformation of 3D point cloud itself, but it is not robust enough for surface concavity changes caused by expressions and occlusion, and the time cost is relatively large.

ICP uses normal vectors sampled from the facial surface for matching. Because the normal information is more identifiable, the ICP algorithm is briefly introduced here. ICP is a way to iterate through the nearest point, which can match two clouds. To be exact, this type is aligned with the key points of 2D faces.

Suppose there are two sets of point clouds:

Through iterative methods to find a set of

And

To satisfy

, that is, the solution

Hausdorff distance evaluates the distance between different real subsets in space by calculating the maximum value between the nearest point pairs between 3D point clouds of two faces. However, the algorithm still has the problems of expression and occlusion. The improved Hausdorff distance algorithm uses the contours of 3D faces to filter objects in the database.

The template face method is deformed by the seed points of the three-dimensional human face, which is suitable for the face of the person to be tested, uses fitting parameters for face recognition, and can generate a specific dense three-dimensional point cloud alignment method. A deformed face model.

3D face recognition based on facial Features

3D face recognition based on facial features can be divided into local features and global features. For more information, see 3D Face Analysis: progress and prospects.

There are two aspects of local functionality. One is based on facial region component information, such as nose, eye and mouth regions. These features can be roughly divided into feature extraction methods based on facial key points, curvature and blocks, features extracted by descriptor algorithm, such as wavelet feature extraction from depth images, SIFT,2D-LBP,MRF,LSP, and feature extraction from 3D data, such as 3D-LBP. Global features can transform the entire face and extract features. Facial data can be stored in different ways, such as point clouds, images, grid 3D facial data, such as 3D facial models, as spherical harmonic features. (SHF), or use sparse coefficients as features to map 3D facial surfaces to a two-dimensional mesh for sparse representation.

two。 Deep learning recognition method

CNN has made great progress in 2D face recognition. However, 2D faces are easily affected by makeup, gestures, lighting and facial expressions. The 3D surface itself contains the spatial shape information of the face, which is less affected by external factors. 3D facial data carry more information than 2D faces. However, due to the difficulty of obtaining 3D facial data and the lack of accuracy of some facial data, the development of 3D facial recognition is not very hot.

Face recognition based on depth map

The common methods of face recognition in depth map include LBP extraction, multi-frame depth map fusion, depth map normalization and so on. Here, two face recognition papers related to depth maps are briefly introduced.

"Robust Face Recognition with Deeply Normalized Depth Images"

This paper is considered as a common depth map face recognition pipeline, which is divided into two networks: standardized network and feature extraction network. The normalized network is responsible for converting the input depth map into a HHA image and returning the parameters of the 3DMM through the CNN network (as described in the 3D reconstruction below), which can be projected into the normalized depth after the 3D point cloud is reconstructed. The feature extraction network is basically similar to the ordinary 2D face recognition network, and the feature vector representing the face of the depth map is obtained.

"Led3D: A Lightweight and Efficient Deep Approach to Recognizing Low-quality 3D Faces"

This article is a low-quality depth map face recognition article of CVPR 2019. Some of the text of the depth map for preprocessing and data enhancement is worth referring to. In this paper, the normal of the normal of the sphere is used as the network input. The experimental results show that the deep layer can be better characterized. At the same time, the author also carefully designed a lightweight recognition network (mainly multi-layer feature fusion and attention). Mechanism), can be used for reference.

Face recognition based on RGB-D

Face recognition based on RGB-D is basically based on 2D face recognition method. Depth maps aligned with RGB are sent as channels to the CNN network. One of the advantages of RGB-D is to increase the spatial shape information of the face. There are many facial recognition papers on RGB-D images, but the basic idea is to integrate them in the feature layer or pixel layer.

"Accurate and robust face recognition from RGB-D images with a deep learning approach"

In 2016, this paper proposes a face recognition algorithm for RGB-D images based on deep learning. In this paper, RGB images and multi-frame fusion depth images are used for pre-training and transfer learning, and fused in the feature layer to enhance the recognition ability.

Face recognition based on depth / RGB-3DMM

In the past two years, facial model regression using 3DMM for depth or RGB images has appeared and applied to recognition tasks. The general idea of this kind of work is to enhance 3D face data by regressing 3DMM parameters (expression, pose, shape), and apply it to CNN training, such as FR3DNet (link: https://arxiv.org/) abs/ 1711.05942), 3D face recognition (link: https://arxiv.org/abs/1703.10714).

"Deep 3D Face Identification"

This paper is the first method of applying depth neural network to 3D face recognition task. The main idea is to use 3DMM + BFM to fit the depth map into 3D face model, to enlarge the depth data, and finally to send the data enhancement (such as random occlusion and pose transformation) to Finetune's 2D face recognition network.

"Learning from Millions of 3D Scans for Large-scale 3D Face Recognition"

This paper is a masterpiece of 3D face recognition, which really realizes the creation of millions of face data. A 3D face recognition network FR3DNet is proposed and finally tested on the existing public data set. The effect is very good (data mode) driving is basically in the state of completely brushing teeth). The way to create a new ID is to find two 3D faces with the greatest difference in bending energy in the author's private dataset and add them to get a new 3D face (see the original text for more information); facial 3D is also proposed. In the point cloud recognition network, the use of large convolution kernels helps to better perceive the shape information of the point cloud.

There are also many data-driven 3D face recognition, such as 3DMMCNN (link: https://arxiv.org/abs/1612.04904). It is concluded that the 3D face recognition method based on deep learning is limited by the lack of data and existing data. The precision is not enough. The first task of the researchers is to enhance a large amount of data or generate a large number of virtual 3D faces. However, whether these methods have strong generalization performance is still worth discussing, and the era of 3D face recognition may not have come yet.

3D face reconstruction

Another interesting aspect of 3D facial research is 3D facial reconstruction, which reconstructs the 3D model of the face from one or more RGB facial images. It has many application scenarios, such as Face Animation, dense Face. In fact, RGB to 3D face reconstruction is a morbid problem, because RGB images actually represent texture features and have no spatial information, but considering the practical application value, some 3D reconstruction methods have been proposed. All these years.

Face Reconstruction based on traditional methods

Traditional 3D face reconstruction methods usually use the information represented by the image itself to complete 3D face reconstruction, such as the parallax and relative height of the image. 3D reconstruction through binocular vision is more common. The difficulty is how to match the corresponding features from different perspectives. For such articles, you can refer to the "Survey of different 3D facial Reconstruction methods" (link: https://pdfs.semanticscholar.org/d4b8/8be6ce77164f5eea1ed2b16b985c0670463a.pdf).

Model-based facial reconstruction

There are two commonly used models in 3D facial reconstruction, one is the general model CANDIDE, the other is 3DMM.

Of the many common models, CANDIDE-3 is the most famous, including 113vertices and 168faces. Simply put, by modifying these vertices and faces, their features match the image to be reconstructed. Through the overall adjustment, facial features and other facial features are aligned as much as possible; local adjustments are made to make the local details of the face more accurate, and then vertex interpolation is performed, and the reconstructed face can be obtained.

The advantages and disadvantages of this model are obvious. The number of vertices of the template is too small, and the reconstruction speed is fast, but the reconstruction accuracy is seriously insufficient, and the facial detail feature reconstruction is not good.

The algorithm for getting started with 3D faces is 3D Morphable Model (3DMM), which is a linear representation of a face model proposed by Volker Blanz in "A Morphable Model For The Synthesis Of 3D Faces" in 1999. A 2D face image can be generated into its corresponding 3D face model by:

Among them

And

For the shape and expression bases obtained by statistical analysis of PCA, these two sets of bases are needed to reconstruct human face using 3DMM model. At present, BFM bases [download address] (https://faces.dmi.unibas.ch/bfm/main.php?nav=1-2&id=downloads) [paper address] (https://ieeexplore.ieee.org/document/5279762) are widely used.

So how to reconstruct 3D from 2D? First of all, we need to understand how the 3D model is projected to the 2D plane. The camera model mentioned at the beginning of the above, the 3D model projected to the 2D plane can be expressed as:

The average facial deformation model is constructed by using facial database. After a new facial image is given, the facial image is matched with the model, and the corresponding parameters of the model are modified to deform the model until the model and the facial image. Differences are minimized and textures are optimized and adjusted to complete facial modeling.

The general 2D to 3D reconstruction process uses a supervised method to deal with the key points on the orthographic projection of 2D facial keywords and 3D vertices.

End-to-end face Reconstruction based on CNN

Using 3DMM model, we can perform 3D reconstruction of a single 2D surface, but the real problem is that the traditional 3DMM reconstruction is an iterative fitting process, which is inefficient, so it is not suitable for real-time 3D surface. reconstruction. To analyze the principle of 3DMM, we need to adjust the 199D parameters of 3DMM (this different cardinality is different), why not use CNN regression base parameters? In this way, we can predict the parameters through the network and realize the rapid reconstruction of 3DMM.

But there is a question, how do we get the training data? For this reason, most papers choose to use 3DMM lines to fit a large number of face images as ground scenes, and then send them to the neural network for training. Although this is a morbid problem, it works well. This paper introduces several end-to-end CNN face reconstruction methods that are easy to understand.

"Disentangling Features in 3D Face Shapes for Joint Face Reconstruction and Recognition"

This article uses CNN to return Identity Shape and Residual Shape parameters. The expression is similar to 3DMM. The difference is that in addition to the normal reconstruction loss (usually an elemental L2 loss), the recognition loss is increased to ensure the reconstructed face. The characteristics of ID remain unchanged.

"End-to-end 3D face reconstruction with deep neural networks"

The idea of this article is also to return the 3DMM parameter. The author believes that high-level semantic features can represent ID information, while middle-level features can represent expressive features, so the corresponding parameters can be returned from different levels to realize 3D face reconstruction.

"Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network"

Another common end-to-end 3D face reconstruction method is location regression Network (PRN), push! (use open source PRN:https://github.com/YadiraF/PRNet).

In this paper, an end-to-end position regression network is proposed for 3D face reconstruction and dense face alignment.

The author introduces the UV location map, which can store 3D point cloud coordinates of human face through 2D images. Suppose a 3D point cloud containing 65536 points can be represented as a 256x2563 2D image through a UV location map (each pixel). The spatial coordinates of the point cloud are stored, so 3D facial reconstruction can be realized by returning the UV location map of the original image through the encoder-decoder network.

By designing the loss function of different regions and different weights, the author finally realizes more accurate face reconstruction and dense key point alignment.

"3D Dense Face Alignment via Graph Convolution Networks"

There are problems in the above method of regressing UV location map. When the final UV image is mapped to the image of the 3D face mesh, some stripes appear. In some recent 3D facial reconstruction work, the multi-stage regression 3D facial mesh also has a good reconstruction effect.

In this paper, the author gradually increases the regression of mesh vertices, thus completing the final mesh regression under multiple monitoring tasks. At the same time, in the form of graph convolution, the mapping relationship between points can be more necessary and finally realized. Good reconstruction effect.

3D face reconstruction is a hot topic in recent years. Various 3D facial reconstruction procedures are proposed in many articles at various conferences every year. However, from the perspective of entry, mastering the above common methods will be the next step of research and will lay a good foundation.

Https://www.toutiao.com/a6720019399383728652/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.