Finer-grained expression Motion Unit Detection: implications from object Detection 07/02 Update SLTechnology News&Howtos

Finer-grained expression Motion Unit Detection: implications from object Detection

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

2020-04-13 21:14:56 author | Ma Chen

Editor | Jia Wei

This paper interprets the paper on facial expression motion unit detection published by Ma Chen of Tsinghua University: "AU R-CNN: merging expert prior knowledge into R-CNN model for facial expression motion unit detection".

This paper takes the lead in using prior knowledge and object detection technology to do Action Unit facial expression recognition, and achieves the experimental results of SOTA in BP4D and DISFA databases: under the benchmark of F1 score, BP4D database achieves 63% of the best results.

Paper link: https://arxiv.org/abs/1812.05788

Code link: https://github.com/sharpstill/AU_R-CNN

FACS (Facial Action Coding System) is 44 facial motion units (AU) defined by the face International Standards Organization. These motion units can be combined to represent all possible facial expressions (including frowning, frowning, etc.). AU is the cornerstone of facial expression.

In this paper, the so-called task of face AU detection is to identify which AU appears on the face of each image in a video. Because AU is only a subtle movement of facial muscles, and different facial muscles have different degrees of movement, the task of AU detection is challenging. AU detection has important applications in lie detector, car driving assistance system (to detect whether the driver is sleepy) and so on.

Figure 1. Example of Action Unit

Figure 1 is an example of Action Unit about how subtle facial expressions are defined by Action Unit expressions.

Https://imotions.com/blog/facial-action-coding-system/ provides animated demos that readers can watch for themselves.

Summarize the shortcomings of existing methods:

1. Although the existing methods put forward the concept of AU center as an important region of AU, and it is defined as the vicinity of facial key points, this definition is rough and the location is not accurate. AU occurs in a specific area of facial muscle movement, but not necessarily near a landmark.

2. Previous studies use CNN to recognize the image of the whole face, rather than the AU of the local region.

3. Face AU recognition is a multi-label classification problem, this multi-label constraint can be limited to a finer granularity: the local region of the face, so as to achieve higher accuracy.

1-party method

The framework of AU R-CNN method is shown in figure 2. The most difficult part of AU detection is that the facial features of human face vary in size, each person looks different, and the location of the expression is also different. How to detect this challenging problem? This article stands on the shoulders of predecessors, using the key points of the face! Face key points provide a wealth of face position information, if we can make full use of it, we can eliminate the differences in facial features and detect AU more accurately. So the framework first divides the human face into different regions, each of which is detected independently, as shown in figure 2:

Figure 2. An overview of the overall framework of the AU R-CNN method. First, 68 key points of the face are located with landmark, and then detected independently according to different regions of ROI. Finally, the detection results of each ROI are summarized, and the whole face detection results are obtained.

Figure 3. Key points and facial segmentation map

In order to make use of the information of these key points and the definition of AU, this paper introduces the concept of expert prior knowledge. AU R-CNN method defines the division of AU and its related face regions as expert knowledge, and puts forward the rules of AU partition rule. The rule is shown in Table 1:

Table 1. AU partition rule (that is, expert prior knowledge)

AU partition rule groups different AU, and the AU that occurs in the same location area is divided into one group, such as the AU of the eye, so the concept of AU group is born (left of Table 1). As a result, the whole face is divided into nine regions, each of which is a set of ROI representations. Finally, this paper uses the minimum outer rectangle of the ROI to represent the AU group region, as shown in figure 4.

Figure 4. AU group and its enclosing rectangles, which are then fed into the R-CNN detector head

Another difficulty is that multiple AU expressions may occur even in the same region, so this paper uses the loss function of sigmoid cross entropy to calculate the loss and back propagation to optimize the network parameters.

Figure 5. The overall network structure of AU R-CNN. On the left, the bounding box of different regions is intercepted by prior knowledge, and on the right, the detection head is used to detect separately. At the same time, the ground-truth label is also segmented according to different regions, and finally the sigmoid cross entropy loss is calculated.

two

AU R-CNN extension

AU R-CNN can be used as a basic framework to produce many extensions and variants, because there is a temporal relationship between the successive frames of the video, so we can use ConvLSTM to model the relationship between successive frames. As shown in the following figure, the small box of each part is modeled by a separate timeline, modeled and learned with a separate ConvLSTM.

Figure 6. ConvLSTM extension of AU R-CNN, which can learn and model the sequence frame relationship of video.

But in the specific experiments, the author found that the effect of this modeling method using upper and lower frames is not very good, and even the overall average F 1 score is not as good as single frame detection. The author also analyzes the reasons in the part of the experiment.

In addition, in addition to space-time convolution such as ConvLSTM, it can also be extended by using other methods such as double-stream method, as shown in the table below:

three

Actual experience

The experiment is carried out on BP4D and DISFA databases. What is commendable in the experimental part of this paper is that the author adopts the standard AU R-CNN and tests it on ResNet-101, VGG-16 and VGG-19.

The experimental results are as follows, it can be seen that AU R-CNN combined with ResNet-101 's backbone achieves the best experimental results:

In the stripping experiment, the main purpose of this paper is to explore how much better the local detection is than the full face detection of the standard CNN, so it is also compared with the standard CNN at different resolutions.

DISFA database is a continuous expression video, the experimental results are as follows:

Finally, the author summarizes the different AU R-CNN extensions and their scope of application:

four

Total knot

In this paper, the author studies how to integrate the prior knowledge into the object detection framework of R-CNN, and uses the RoI pooling layer to detect each position separately. A wealth of experiments have proved the effectiveness of this method, and the experimental results of State-of-the-art have been obtained.