Do target detection, this one is enough! 2019 Guide to the most complete Target Detection 04/27 Update SLTechnology News&Howtos

Do target detection, this one is enough! 2019 Guide to the most complete Target Detection

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

2019-08-01 14:02:31

Produced by big data Digest

Compilation: Zhang Ruiyi, serenity

Computer vision is a cross-discipline that studies how to understand the high-level semantics of digital images or videos. It gives machines the intelligence to "see" and needs to realize the visual ability of the human brain (mainly the visual cortex).

Imagine, if we want to design a guide product for the blind, and the system camera captures the following image when the blind are crossing the road, what visual tasks do we need to complete?

Image classification: classify the tags of the objects that appear in the picture, such as the number of people, buildings, streets, vehicles, etc.; target detection: extract the objects of interest in the picture or video. For the blind guide system, all kinds of vehicles, pedestrians, traffic signs and traffic lights are all objects that need to be paid attention to. Image semantic segmentation: it is necessary to outline the vehicles and roads in the field of vision, which requires image semantic segmentation technology as the support to outline the outline of the foreground object in the image object; scene text recognition: road name, green light countdown seconds, store name and so on, these words are also very important for the realization of blind guide function.

The above four tasks have been included in the field of computer vision (CV). In the field of CV, there are mainly eight tasks, and the other four tasks include: image generation, human body key point detection, video classification, measurement learning and so on.

Target detection is one of the major tasks of CV, and its understanding of images also plays an important role. In this paper, we will introduce the basic knowledge of target detection, and review some of the most commonly used algorithms and some new methods. (note: the picture of the paper shown in each section gives a specific link at the end of the section)

How target detection works

The target detects and locates the position of the object in the image and draws a bounding box around the object, which usually involves two processes, classifying the object type, and then drawing a box around the object. Now let's review some common model architectures for target detection:

R-CNNFast R-CNNFaster R-CNNMask R-CNNSSD (Single Shot MultiBox Defender) YOLO (You Only Look Once)

R-CNN

This technique combines two main methods: applying high-capacity convolution neural network to bottom-up candidate regions in order to localize and segment objects and supervise the pre-training of auxiliary tasks. This is followed by area-specific fine-tuning, resulting in a high performance improvement. The author of this paper named the algorithm R-CNN (region with CNN feature) because it combines the candidate region with convolution neural network.

The model receives the image and extracts about 2000 bottom-up candidate regions, then uses a large CNN to calculate the features of each candidate region, and then classifies each region using a specific class of linear support vector machine (SVM). The model achieves an average accuracy of 53.7% on PASCAL VOC 2010.

The object detection system in the model has three modules: the first is responsible for generating category-independent candidate regions that define the set of candidate detectors available for model detectors; the second module is a large convolution neural network which is responsible for extracting fixed-length feature vectors from each region; the third module is composed of a class of support vector machines.

Within the model, selective search is used to generate region categories, and selective search groups similar regions according to color, texture, shape and size. For feature extraction, the model obtains 4096-dimensional feature vectors by applying Caffe CNN (convolution neural network) to each candidate region, and calculates features through forward propagation of five convolution layers and two full connection layers. The model explained in the link at the end of the section is 30% better than the previous results of PASCAL VOC 2012.

Some of the disadvantages of R-CNN are:

Training is a multi-stage task, adjusting the convolution neural network of the object region to make SVM (support vector machine) adapt to the function of ConvNet (convolution network), and finally learning boundary box regression; training is expensive in space and time because VGG16 is a deep network that takes up a lot of space; target detection is slow because it performs ConvNet forward propagation for each candidate region.

Links to related papers and reference content:

Https://arxiv.org/abs/1311.2524?source=post_page

Http://host.robots.ox.ac.uk/pascal/VOC/voc2010/index.html?source=post_page

Https://heartbeat.fritz.ai/a-beginners-guide-to-convolutional-neural-networks-cnn-cf26c5ee17ed?source=post_page

Fast R-CNN

The paper shown in the following figure proposes a fast region-based convolution network method (Fast R-CNN) for target detection, which is implemented in Caffe (using Python and C + +). The model achieves an average accuracy of 66% on PASCAL VOC 2012 and 62% on R-CNN.

Compared with R-CNN, Fast R-CNN has higher average accuracy, single-stage training, updated training of all network layers, and feature cache does not require disk storage.

In its structure, Fast R-CNN takes the image as input and obtains a set of candidate regions at the same time. Then, it processes the image using convolution and maximum pooling layers to generate convolution feature maps. In each feature map, a fixed size feature vector is extracted from the pooling layer of the region of interest (ROI) of each candidate region.

These feature vectors are then sent to the full connection layer, and then they branch into two output layers, one generating several object class softmax probability estimates and the other generating four real values for each object class, which represent the location of each object's bounding box.

Related content reference links:

Https://arxiv.org/abs/1512.02325?source=post_page

You Only Look Once (YOLO)

The article shown in the following figure proposes a single neural network that can predict bounding boxes and class probabilities in an image in a single evaluation.

The YOLO model processes 45 frames per second in real time, and YOLO regards image detection as a regression problem, which makes its pipeline very simple, so the model is very fast.

It can process streaming video in real time with a delay of less than 25 seconds. During training, YOLO can see the whole image, so it can include context in target detection.

In YOLO, each bounding box is predicted by the characteristics of the whole image, there are five predictions in each bounding box, xmemery and confidence, (xmemy) represents the center of the bounding box relative to the boundary of the grid element, and w and h are the predicted width and height of the whole image.

The model is implemented by convolution neural network and evaluated on PASCAL VOC detection data set. The convolution layer of the network is responsible for extracting features, while the fully connected layer predicts coordinates and output probability.

The network architecture of this model is inspired by the GoogLeNet model for image classification. The network has 24 convolution layers and two fully connected layers. The main challenge of the model is that it can only predict one class, and it does not perform well on small objects such as birds.

The average AP accuracy of this model is 52.7%, but can reach 63.4%.

Reference link:

Https://arxiv.org/abs/1506.02640?source=post_page

Look at the target as something.

The paper in the following figure proposes to model the object as a single point, which uses key point estimation to find the center point and return to all other object attributes.

These attributes include 3D position, pose direction, and size. It uses CenterNet, which is a center-based method, which is faster and more accurate than other bounding box detectors.

Attributes such as object size and pose are regressed according to the image features of the central position. in this model, the image is sent to the convolution neural network to generate thermal maps, and the maximum values in these thermal maps represent the center of the object in the image. In order to estimate the posture of the human body, the model examines the position of 2D joints and regression them at the central position.

This model achieves an average COCO accuracy of 45.1% at 1.4 frames per second. The following figure shows the results compared with the results in other research papers.

Reference links to papers:

Https://arxiv.org/abs/1904.07850v2?source=post_page

Data Enhancement Strategy for Target Detection

Data enhancement creates new image data by rotating and resizing the original picture.

Although the strategy itself is not a model structure, the following paper proposes the creation of transformations, which refer to object detection datasets that can be transferred to other target detection datasets. Conversion is usually used in training.

In this model, the augmentation strategy is defined as a group of n strategies randomly selected during the training process. Some operations applied in this model include color channel distortion, geometric image distortion, and pixel distortion only in bounding box annotations. Experiments on COCO datasets show that the optimized data enhancement strategy can improve the detection accuracy by more than + 2.3 average accuracy, which allows a single reasoning model to achieve 50.7 average accuracy.

Reference links to related papers:

Https://arxiv.org/abs/1906.11172v1?source=post_page

Summary

We should now keep up with some of the most common-and some of the target detection techniques that have recently been used in various environments. The papers / abstracts mentioned above and linked to also contain links to their code implementations. Do not set self-limits, target detection can also exist inside the smartphone, in short, we need to constantly explore and learn.

Related reports:

Https://heartbeat.fritz.ai/a-2019-guide-to-object-detection-9509987954c3

Https://www.toutiao.com/a6720074844945252867/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.