Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Faced: CPU Real-time face Detection based on Deep Learning

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

Is it possible to implement an object detection model with real-time performance without using GPU? Faced is a proof of concept that allows you to build a custom object detection model for a single class object (in this case, faced) running on CPU in real time.

What's the problem?

In many cases, a single class object detection is required. This means that we want to detect the location of all objects that belong to a specific class in the image. For example, we can detect the face of a facial recognition system or need to track the face of a person.

More importantly, most of the time we want to run these models in real time. To achieve this, we provide images that provide samples at the rate of x, and we need a model for each sample to run at a rate lower than x. We can then process the image as soon as it becomes available.

Now the easiest solution to this task (and many other tasks in computer vision) is to do transfer learning on the previously trained model (the standard model usually trained on the big data set, such as those in Tensorflow Hub or TF Object Detection API).

Https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

There are many trained object detection architectures, such as FasterRCNN, SSD, or YOLO, that can achieve impressive accuracy in real-time performance running on GPU.

GPU is expensive, but it is necessary during the training phase. However, in reasoning, it is not feasible to have a dedicated GPU to achieve real-time performance. None of the common object detection models (as described above) can run in real time without GPU. Then, how can we re-examine the problem of object detection of a single class object in order to achieve the real-time performance of CPU?

Main idea: simple tasks require fewer learnable features

All of the above architectures are designed to detect multiple object classes (trained on COCO or PASCAL VOC datasets). In order to be able to classify each bounding box into appropriate classes, these architectures require a lot of feature extraction. This translates into a large number of learnable parameters, a large number of filters, a large number of layers. In other words, the network is very large.

If we define simpler tasks (rather than multi-class bounding box classifications), then we can imagine that the network needs to learn fewer features to perform tasks. It is obviously easier to detect faces in an image than cars, people, traffic signs and dogs (all in the same model). The feature amount required by the deep learning model to recognize the face (or any single class object) will be less than that of dozens of classes at the same time. The first task requires less information than the latter.

A single class object detection model requires less learnable function. Fewer parameters mean a smaller network. A smaller network runs faster because it requires less computing.

So, the question is: how accurate can we achieve real-time performance on CPU?

The main concept of faced is to build as small a network as possible (hopefully) to run in real time in CPU while maintaining accuracy.

Architecture

Faces is a collection of two neural networks, both of which are implemented using Tensorflow.

Major network

The main architecture of faced is based on the architecture of YOLO. Basically, it is a full convolution network (FCN) that runs 288x288input images through a series of convolution and pooling layers (no other layer types are involved).

The convolution layer is responsible for extracting spatial perceptual features. The aggregation layer increases the perceptual domain of the subsequent convolution layer.

The output of the architecture is 9 × 9 grid (compared to 13 × 13 grid in YOLO). Each mesh cell is responsible for predicting whether the face is within the cell (each cell can detect up to 5 different objects as opposed to the YOLO).

Each grid cell has five associated values. The first is the probability p of the unit containing the center of the face. The other four values are the detected face (relative to the cell).

Main architecture

The exact architecture is defined as follows:

2 × [8 filtered convolution layers on 288 × 288 image]

Maximum pool (288 × 288 to 144 × 144 characteristic graphs)

2 × [16 filtered convolution layers] on 144 × 144 feature graph

Maximum pool (144 × 144 to 72 × 72 characteristic graphs)

2 × [32 filtered convolution layers] on 72 × 72 feature graph

Maximum pool (72 × 72 to 36 × 36 characteristic graphs)

2 × [64 filtered convolution layers] on 36 × 36 feature graph

Maximum pool (36 × 36 to 18 × 18 characteristic graphs)

2 × [128 filtered convolution layers] on 18 × 18 feature map

Maximum pool (18 × 18 to 9 × 9 characteristic graphs)

Maximum pool (18 × 18 to 9 × 9 characteristic graphs)

4 × [192 filtered convolution layers] on 9 × 9 feature graph

5 filtered convolution layers on the 9 × 9 feature map for the final mesh

All activation functions are leaky_relu.

Faced has 6993517 parameters. YOLOv2 has 51000657 parameters. Its size is 13% of the YOLO size!

Auxiliary network

The output of the main network is not as accurate as expected. Therefore, a small convolution neural network (CNN) is implemented to take the small image containing the face (cropped with the output of the main architecture) as input and output regression on the real bounding box of the face.

The network uses a bounding box that includes a face and predicts the correct bounding box.

Its only task is to supplement and improve the output coordinates of the main architecture.

The specific architecture of the network is irrelevant.

Data set

Both networks are trained on WIDER FACE data sets.

Multiple scenes of WIDER

"the WIDER FACE data set is a benchmark data set for face detection. We selected 32203 images and marked 393703 faces with high variations in size, posture and occlusion, as shown in the sample images.

Training

The training is done on Nvidia Titan XP GPU. The training time is about 20 hours. The batch normalization technique is used to help with convergence and loss (at a rate of 40%) as a regularization method to avoid overfitting.

Reasoning and non-maximal suppression

When using faced reasoning, the image size is first adjusted to 288x288in order to feed into the network. The image is located below the FCN and provides the above 9 × 9 grid output.

Each unit has a probability p containing an image. Through a configurable threshold filtering unit (only p > t units are retained). For those cells that are reserved, use the cell (height) to locate the human face.

In some cases, multiple units can compete for the same face. Assume that the center of the face is at the exact location where the four cells intersect. These four units may have a higher p (the probability of including the center of the face in the unit). If we keep all the cells and project the facial coordinates of each cell, we will see four similar bounding boxes around the same face. This problem is solved by a technique called non-maximal inhibition. The result is shown in the following figure:

Non-maximum suppression exampl

Performance faced can achieve the following speed in reasoning:

Considering that YOLOv2 can't reach 1FPS on i5 2015 MBP, that's pretty good.

Result

Let's see some results!

Facial image

Now let's look at the comparison between faced and Haar Cascades, a traditional method of computer vision that does not use deep learning. Both methods show higher accuracy in running faced at similar speed performance.

Haar Cascades [left] vs faced [right]

How do I use faced?

Faced is a very simple program that can be embedded in Python code or used as a command line program.

Log in to the github library for further instructions:

Https://github.com/iitzco/faced

Conclusion

Faced is a proof of concept in which you don't always need to rely on general training models because they are too simple for your problems and performance problems. Don't overestimate your ability to spend time designing custom neural network architectures specific to your problem. These specific networks will be a better solution than the general network.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report