Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Target detection guidelines for 2019

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

Target detection has been widely used in video surveillance, self-driving cars, target / human tracking and other fields. In this article, we will understand the basics of target detection and review some of the most commonly used algorithms and some new methods.

Original title | A 2019 Guide to Object Detection

Author | Derrick Mwiti

Translation | Lincoln 213 (Xi'an Jiaotong University), Chen Hua Mark (Wuhan University), BBuf (Southwest University of Science and Technology)

Compilation | Pita

Target detection is a computer vision technology to detect target objects such as cars, buildings and human beings, which can usually be identified by pictures or video.

Target detection has been widely used in video surveillance, self-driving cars, target / human tracking and other fields. In this article, we will understand the basics of target detection and review some of the most commonly used algorithms and some new methods.

The working principle of Target Detection

Target detection locates the object in the image and draws a bounding box around the object. This process is usually divided into two steps: classifying the target and determining the type, and then drawing a box around the object. We've discussed image classification before, so let's review some common model architectures for target detection:

R-CNN

Fast R-CNN

Faster R-CNN

Mask R-CNN

SSD (single point multi-frame detector)

YOLO (You Only Look Once)

Treat the goal as a point

Data Enhancement Strategy for Target Detection

R-CNN model

This technology combines two main methods: large-capacity convolution neural network is applied to the bottom-up region scheme to locate and segment the target; supervised pre-training.

View the paper: the feature hierarchy is rich, which can accurately carry out target detection and semantic segmentation (https://arxiv.org/abs/1311.2524).

This is a high performance boost through domain-specific fine-tuning. Because the suggestion of Regional partition is combined with convolution neural network, the author of this paper named the algorithm R-CNN (region with CNN feature).

Paper link: https://arxiv.org/pdf/1311.2524.pdf

The model first extracts about 2000 bottom-up candidate regions based on the image. Then a large CNN model is used to calculate the feature vector of each candidate region. Finally, linear support vector machine (SVMs) is used to classify each region. The average accuracy of the model is 53.7% on PASCAL VOC 2010 (http: / / host.robots.ox.ac.uk/pascal/VOC/voc2010/index.html).

The target detection system of the model consists of three modules. The first module is responsible for generating classification-independent candidate regions, which are candidate data sets available for detection. The second module is a large convolution neural network CNN model, which is responsible for extracting fixed-length feature vectors from each region. The third module is a multi-class SVM classifier.

Paper link: https://arxiv.org/pdf/1311.2524.pdf

The model uses the selective search method to generate candidate region sets. Search for similar areas based on color, texture, shape, and size. In the aspect of feature extraction, the model uses 4096-dimensional feature vector and applies Caffe CNN implementation to each regional proposal. The 227x227 RGB image is propagated forward through 5 convolution layers and 2 fully connected layers, and the features are calculated. Compared with the previous results of PASCAL VOC 2012, the model interpreted in this paper has achieved 30% improvement.

Some of the disadvantages of R-CNN:

Training is a multi-stage process. The convolutional neural network is optimized by goal suggestion, and the svm and ConvNet features are fitted. Finally, the boundary box regression is studied.

Model training is very expensive in terms of space consumption and time consumption, because deep networks such as VGG16 take up a lot of space.

Object detection is slow because it recommends one ConvNet forwarding for each object.

Fast R-CNN

In this paper, a fast target detection method based on area convolution network (Fast R-CNN) is proposed.

Review papers: fast R-CNN (https://arxiv.org/abs/1504.08083)

It is implemented in Caffe via Python and C++ (https://github. Com / rbgirshick/fast-rcnn). The average accuracy of the model on PASCAL VOC 2012 is 66%, while that of R-CNN is 62%.

Paper link: https://arxiv.org/pdf/1504.08083.pdf

Compared with R-CNN, Fast R-CNN has higher average precision, single-stage training, can update the characteristics of all network layer training, and it does not need disk storage for feature caching.

In its architecture, an Fast R-CNN receives images as both input and a set of target suggestions. The model carries on the convolution and maximum pool processing to generate the convolution feature graph, and extracts the feature vector of the fixed layer from each feature map through the interest pool layer region suggested for each region.

Next, the feature vectors are provided to the fully connected layers, which will branch into two output layers, one of which generates softmax probability estimates for multiple object classes, and the other generates four real numbers for each object class, and these four numbers represent the location of each object bounding box.

Faster R-CNN

Check out the paper: faster R-CNN: real-time Target Detection (https://arxiv.org/abs/1506.01497) using area recommendation Network

In this paper, a training mechanism is proposed, which can alternately fine-tune regional suggestion tasks and target detection.

Source: https://arxiv.org/pdf/1506.01497.pdf

The Faster R-CNN model consists of two modules: the deep convolution network generated by the regional candidate box and the R-CNN detector using the generated candidate box. The region suggests that the network take the image as the input to generate a series of rectangular target candidate boxes. Each rectangular target box has a score that belongs to a category.

Source: https://arxiv.org/pdf/1506.01497.pdf

Mask R-CNN

View the paper: Mask R-CNN (https://arxiv.org/abs/1703.06870)

The model proposed in this paper is an extension of the above Faster R-CNN architecture, and it can also estimate the posture of the human body.

Image source: https://arxiv.org/pdf/1703.06870.pdf

In this model, the target is classified and located through a labeled target box and a semantic segmentation tag tagging a set of pixels of the same category. This model adds the prediction of the segmentation mask to each region of interest on the basis of Faster R-CNN. Mask R-CNN produces two outputs: a category label and a target box.

SSD: single point multi-frame detector

This paper (https://arxiv.org/abs/1512.02325) proposes a model of using a single depth neural network to detect targets in an image. The network uses a small convolution kernel applied to feature mapping to score the areas where each target appears.

Source: https://arxiv.org/pdf/1512.02325.pdf

This method uses a feedforward convolution neural network to generate a set of bounding boxes and to score the category to which each bounding box belongs. A convolution feature layer is added to allow the network to detect features on multiple scales. In this network model, each feature graph is associated with a set of target candidate boxes. The following image shows the performance of SSD512's model for target detection on animal, vehicle and furniture data.

Source: https://arxiv.org/pdf/1512.02325.pdf

YOLO (You Only Look Once)

You only look once, as its name implies, you only watch it once. This paper presents a method which uses a single neural network to predict the frame and class probability of objects in one run, which is one of the representatives of the single-stage target detection model (different from the two-stage target detection method represented by Faster R-CNN).

Paper address: You Only Look Once: Unified, Real-Time Object Detection (https://arxiv.org/abs/1506.02640)

The YOLO model can run at a real-time speed of 45 frames per second. YOLO regards image target detection as a regression problem, which makes it easy to process the image pipeline, which is the main reason for its high speed.

YOLO can process video streams in real time with a delay of no more than 25 seconds. During the training process, YOLO sees the whole image, so context information can be taken into account in target detection.

In YOLO, the border of each object is predicted using the features of the whole image. There are five predicted values for each border: X, y, w, h and confidence (confidence). (X, y) represents the position offset of the center of the border relative to each grid, and w and h represent the relative width and height of the object relative to the whole picture.

The YOLO model is implemented by convolution neural network and trained on the PASCAL VOC target detection data set. The convolution layer of the network is used to extract features, and then the full connection layer is used to predict the frame coordinates and the probability that the object belongs to each category.

The architecture design of YOLO network is inspired by GoogLeNet (https://ai.google/research/pubs/pub4 3022) for image classification. The network has 24 convolution layers and two fully connected layers. The main problem of YOLO model is that it can only predict one category of objects per grid, and it is not effective in detecting smaller targets (such as birds).

Translator's Note 1: many friends do not quite understand the operation of "meshing" in YOLO. Look at the rightmost square in the image above, which is the original result of the output of the YOLO model-- a feature graph with a shape of 7 × 7 × 30. 7 × 7 is the size of the original 416 × 416 image obtained through a series of convolution, downsampling and padding. In other words, the 7 × 7 grids correspond to a certain region in the original image. A convolution layer of 30 channels means that each grid predicts 30 values, which are the two candidate boxes (x, y, w, h, confidence) predicted in the grid and 20 classes to which objects in the grid may belong (the PASCAL VOC data set contains 20 categories of objects). Readers may wonder why two candidate boxes are predicted but only one category probability is predicted. This is the disadvantage of YOLO mentioned above: it really can only predict one category of objects per grid.

YOLO achieves 63.4% mAP on the PASCAL VOC dataset, and the author proposes a smaller version-Fast YOLO in the original article. Fast YOLO was the fastest general target detector on the PASCAL dataset at that time, and reached the mAP target of 52.7%.

Translator's Note 2: as of July 2019, the authors of YOLO have also proposed two subsequent upgrades: YOLO9000 (also known as YOLOv2) and YOLOv3. YOLO9000 changes the network backbone to a faster full convolution network DarkNet-19, and introduces the anchor mechanism of Faster R-CNN to make the detection effect better. YOLOv3 improves the calculation method of loss function, introduces the characteristic pyramid and proposes an efficient backbone network DarkNet-53. At present, YOLOv3 has become one of the most commonly used algorithms for target detection.

Paper address: YOLO9000: Better, Faster, Stronger (https://arxiv.org/abs/1612.08242)

YOLOv3: An Incremental Improvement (https://pjreddie. Com / media/files/papers/YOLOv3.pdf)

CenterNet: treat the target as a point

Address: Objects as Points (https://arxiv.org/abs/1904.07850v2)

This paper proposes a modeling method in which an object is regarded as a point. It uses key point estimation to detect the center point of the object and regression other attributes of the object. These attributes include: 3D position, body posture, orientation, size, and so on. In other words, using CenterNet, the various attributes of the object are also regressed as the output of the network. CenterNet has become a faster and more accurate target detector based on border regression.

Translator's Note 3: there is another target detection paper published almost at the same time: CenterNet: Keypoint Triplets for Object Detection, also known as CenterNet for short. But the CenterNet mentioned in this article all refers to the work of Objects as Points.

So how on earth did these attributes be regressed? In the CenterNet model, the original image is input into the network and a heat map (confidence map) is obtained, which indicates the possible location of the center of the object (the peak of the calorific value is very likely to be the location of the center of the object). In addition to this heat map, the output of the CenterNet model has some other channels. The prediction of various attribute values of the object can be obtained by taking the values at the peak of the confidence map from each channel.

CenterNet achieves a speed of 45.1% AP and 1.4 fps on the target detection dataset in COCO. The following table shows the comparison of CenterNet with other target detectors.

Data Enhancement Strategy for Target Detection

Data enhancement is the process of obtaining new image data by manipulating the original image (such as rotation, scaling, etc.). Through data enhancement, better training results can often be achieved.

Address: Learning Data Augmentation Strategies for Object Detection (https://arxiv.org/abs/1906.11172v1)

This paper does not propose a new model structure, but proposes some image transformation strategies that can be migrated to other target detection data sets, which are often used in the training of target detection networks.

In the model of this paper, the enhancement strategy during training includes N kinds of operations. The enhancement methods used by the author in the model include changing the color channel value, geometric transformation, or only changing the pixels in the labeled object border.

Through experiments on COCO data sets, the author finds that using the optimized data enhancement strategy, the mAP index (mean average precision) can be improved by 2.3%, so that the single model can reach 50.7% mAP.

Conclusion

By reading this article, you should already know some of the most common general scene target detection methods recently.

Some of the papers mentioned in this article also include their code implementation. You might as well test it yourself to see how it works.

Https://www.toutiao.com/i6723778178361328136/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report