Example Analysis of YOLO Target Detection structure from V1 to V3 07/09 Update SLTechnology News&Howtos

Example Analysis of YOLO Target Detection structure from V1 to V3

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the example analysis of YOLO target detection from V1 to V3 structure, which is very detailed and has a certain reference value. Interested friends must read it!

Target detection and evaluation index

IoU (Intersection-over-Union) index

IoU is abbreviated as intersection and union ratio, as its name implies, the proportion of intersection and union in mathematics. Suppose there are two sets An and B, and IoU equals the intersection of An and B divided by the union of An and B. the expression is as follows:

In target detection, IoU is the intersection ratio of prediction box (Prediction) and real box (Ground truth). As shown in the following figure, in the target detection for kittens, the purple line border is the prediction box (Prediction) and the red line border is the real box (Ground truth).

In the target detection task, IoU ≥ 0.5 is usually selected, which is considered as recall. If the IoU threshold is set higher, the recall rate will be reduced, but the location box will be more accurate. Ideally, of course, the more overlap between the prediction box and the real box, the better. if the two completely overlap, the intersection and union area are the same, and IoU is equal to 1. YOLOv1YOLOv1 Innovation:

The whole graph is used as the input of the network, and the location and category of the bounding box are returned directly at the output layer (object detection is regarded as a regression problem)

Fast, the first work of one stage detection

The previous target detection methods need to generate candidate regions before detection, although the detection accuracy is relatively high, but the running speed is slow.

YOLO combines identification and positioning into one, the structure is simple, the detection speed is fast, and the faster Fast YOLO can reach 155FPS.

Advantages and disadvantages of YOLOv1

Compared with the previous object detection methods, YOLO model has several advantages:

YOLO detects objects very quickly.

Because there is no complex detection process, we only need to input the image to the neural network to get the detection results. YOLO can complete the object detection task very quickly. The standard version of YOLO can reach 45 FPS on the Titan X GPU. Faster Fast YOLO detection speed can reach 155 FPS. Moreover, the mAP of YOLO is more than twice that of other previous real-time object detection systems.

YOLO can avoid background errors and generate false positives.

Unlike other object detection systems using sliding windows or region proposal, the classifier can only get the local information of the image. YOLO can see the information of a whole image during training and testing, so YOLO can make good use of context information when detecting objects, so it is not easy to predict the wrong object information on the background. Compared with Fast-R-CNN, the background error of YOLO is less than half that of Fast-R-CNN.

YOLO can learn the generalization characteristics of objects.

When YOLO trains on natural images and tests on works of art, the performance of YOLO is much better than that of previous object detection systems such as DPM and R-CNN. Because YOLO can learn highly generalized features and migrate to other areas.

Although YOLO has these advantages, it also has some disadvantages:

The object detection accuracy of YOLO is lower than that of other state-of-the-art object detection systems.

YOLO is easy to produce object positioning errors.

YOLO is not good at detecting small objects (especially dense small objects, because a grid can only predict 2 objects).

Low recall rate

The biggest disadvantage of YOLOv1 is that it is not accurate enough

Network structure and testing flow

Network structure

YOLO network draws lessons from GoogLeNet classification network structure, the difference is that YOLO uses 1x1 convolution layer and 3x3 convolution layer instead of inception module. As shown in the following figure, the entire detection network includes 24 convolution layers and 2 fully connected layers. Among them, the convolution layer is used to extract image features, and the full connection layer is used to predict the image position and category probability value.

Detection process

First zoom the picture to a fixed size

YOLO divides the input image into grids of slots (7 × 7 in the paper), each of which is responsible for detecting objects centered in the grid.

Each raster prediction B (2 in the paper) bounding boxes (5 values are predicted for each bounding box, namely, the center of the bounding box, the center of the bounding box (relative to the boundary of the grid), the width and height of the bounding box w, h (relative to the width and height of the original input image), and the confidence scores of these bounding boxes. (IOU value of bounding box and ground truth box)

At the same time, each grid also needs to predict the conditional probabilities of class c (cased 20 in the paper) (a c-dimensional vector indicating the probability that an object object is in this grid and the object belongs to each category, where class c objects do not contain background)

Each grid needs to predict 2x5+20=30 values, which are mapped to a 30-dimensional vector

Finally, YOLO uses non-maximum suppression (NMS) algorithm to extract the most likely objects and their corresponding bounding boxes from the output results. (the following is a non-maximally inhibitory process)

1. Set a threshold for Score and a threshold for IOU (overlap)

two。 For each type of object, iterate through all the candidate boxes belonging to the class, and ① filters out the candidate boxes whose Score is below the Score threshold

② finds the candidate box corresponding to the largest Score in the remaining candidate boxes and adds it to the output list.

③ further calculates the IOU of the remaining candidate boxes and each candidate box in the output list in ②. If the IOU is greater than the set IOU threshold, filter out the candidate box (greater than a certain threshold, indicating a high degree of overlap), otherwise add it to the output list.

The candidate boxes in the final output list of ④ are all the bounding boxes predicted by such objects in the picture.

3. Return to step 2 to continue working on the next class of objects.

The higher the overlap threshold, the less proposals boxes is suppressed, resulting in a large number of FP (False Positives), which further leads to the decrease and loss of detection accuracy (due to the imbalance ratio between the object and the background image, the increase of FP is much higher than that of TP).

When the overlap threshold is very small, the proposals boxes is severely suppressed, resulting in a sharp drop in recall.

What is the input / output and loss function?

Input: the input in the paper is 448 × 448

Loss function

As shown in the figure above, the loss function is divided into four parts: coordinate prediction (blue box), confidence prediction with object boundary box (red box), confidence prediction without object boundary box (yellow box), and classification prediction (purple box).

Because the bounding boxes of different sizes have different sensitivity to the prediction deviation, the smaller bounding boxes are more sensitive to the prediction deviation. In order to balance the difference in sensitivity of bounding boxes of different sizes to prediction deviations. The author skillfully takes the mean value of the boundary box and then calculates the L2 loss. YOLO pays more attention to coordinate prediction and gives more weight to coordinate loss, which is recorded as coord. In pascal voc training, the weight of coodd=5 and classification error is 1.

The confidence of a bounding box is defined as: the confidence of a bounding box = the probability that the bounding box exists in some kind of object pr (object) * the IOU value of the bounding box and the ground truth of the object, if the bounding box exists in an object pr (object) = 1, otherwise pr (object) = 0. Because there are no objects in most of the meshes in a picture, the confidence of the bounding boxes in these grids is set to 0. Compared with the grids with objects, these grids that do not contain objects are more, contribute more to the gradient update, and lead to network instability. In order to balance the above problems, the YOLO loss function gives less weight to the confidence error without the bounding box of the object, which is marked as noobj, and gives a larger weight to the confidence error with the bounding box of the object. In pascal VOC training, noobj=0.5, the weight of confidence error with bounding boxes of objects is set to 1.

Output: the result is a 7 × 7 × 30 tensor.

Results the innovation of YOLOv2YOLOv2

Although the detection speed of YOLOv1 is fast, it is not accurate in positioning and the recall rate is low. In order to improve the positioning accuracy and recall rate, YOLOv2 puts forward several improvement strategies on the basis of YOLOv1.

Batch Normalization

In YOLOv2, add Batch Normalization (BN) layer after each convolution layer, and remove dropout. The BN layer can play a certain regularization effect, which can improve the convergence speed of the model and prevent the model from overfitting. YOLOv2 increases mAP by 2% by using the BN layer.

High Resolution Classifier (High Resolution)

At present, most of the detection models use the pre-training model of mainstream classification networks (such as vgg, resnet) on ImageNet as feature extractors, and most of these classification networks are trained with images smaller than 256x256 as input, and low resolution will affect the model detection ability. YOLOv2 increases the resolution of the input image to 448x448. In order to make the network adapt to the new resolution, YOLOv2 first makes 10 epoch fine-tuning of the network with the resolution of 448x448 on ImageNet to make the network adapt to high-resolution input. YOLOv2's mAP has been improved by about 4% by using high-resolution input.

Convolutional With Anchor Boxes uses anchor box for convolution

YOLOv1 uses the full connection layer to predict the bounding box directly, which leads to the loss of spatial information and inaccurate location. YOLOv2 removes the fully connected layer in YOLOv1, uses Anchor Boxes to predict the bounding box, and removes a pooled layer in order to get a higher resolution feature map. Because the objects in the picture tend to appear in the center of the picture, if the feature map happens to have a central position, using this central position to predict the objects whose central point falls into that position, it will be easier to detect these objects. Therefore, it is always hoped that the width and height of the feature graph are odd. By reducing the network and using the input of 416x416, YOLOv2 has a total sampling step of 32 under the model, and finally obtains the characteristic graph of 13x13. Then it predicts five anchor boxes for each cell of the characteristic graph of 13x13, and predicts the location information, confidence and a set of classification probability values of the boundary box for each anchor box. After using anchor boxes, YOLOv2 can predict 13x13x5=845 bounding boxes, and the recall rate of the model increases from 81% to 88%. The map decreases from 69.5% to 69.2%. The recall rate increased by 7%, and the accuracy decreased by 0.3%.

New Network:Darknet-19

YOLOv2 adopts Darknet-19, whose network structure is shown in the following figure, including 19 convolution layers and 5 max pooling layers, mainly using 3x3 convolution and 1x1 convolution, where 1x1 convolution can compress the number of characteristic graph channels to reduce the model calculation and parameters, and use BN layer after each convolution layer to accelerate model convergence and prevent over-fitting. Finally, global avg pool is used to make the prediction. With YOLOv2, the mAP value of the model does not increase significantly, but the amount of calculation is reduced.

Dimension Clusters dimension cluster

In Faster R-CNN and SSD, the prior box is set manually, with a certain degree of subjectivity. YOLOv2 uses k-means clustering algorithm to cluster the boundary box of the training set, and selects the IOU value between boxes as the clustering index. Considering the model complexity and recall rate, five clustering centers are selected and five prior boxes are obtained. It is found that there are fewer flat and long frames and more thin and tall boxes, which are more in line with the characteristics of pedestrians. Through the comparative experiment, it is found that the prior box obtained by cluster analysis has a higher average IOU than the prior box selected manually, which makes the model easier to train and learn.

Direct location prediction

Faster R-CNN uses anchor boxes to predict the offset of the bounding box relative to the prior box. Because there is no constraint on the offset, the predicted bounding box of each position can fall anywhere in the picture, which will lead to instability of the model and lengthen the training time. YOLOv2 follows the YOLOv1 method and predicts the coordinates according to the location of the grid element, then the value of Ground Truth is between 0 and 1. The network prediction results obtained in the network are input into the sigmoid function, and the output results are between 0 and 1. Let the offset of a grid relative to the upper left corner of the image be cx,cy. If the width and height of the prior box are pw and ph respectively, the calculation formula of the predicted bounding box relative to the central coordinate (bx,by) of the feature map and the width and height bw and bh is shown in the following figure.

YOLOv2 combined with Dimention Clusters makes the model easier to train stably by constraining the position prediction of the bounding box, which increases the mAP value of the model by about 5%.

Fine-Grained Features (fine-grained feature)

YOLOv2 draws lessons from SSD to use multi-scale feature images for detection, and proposes that pass through layer connects high-resolution feature images with low-resolution feature images, so as to achieve multi-scale detection. YOLOv2 extracts the input of the last max pool layer of Darknet-19 to get the feature graph of 26x26x512. After the convolution of 1x1x64 to reduce the dimension of the feature graph, the feature graph of 26x26x64 is obtained, and then processed by the pass through layer into the feature graph of 13x13x256 (the local area of each 2x2 of the original feature map is extracted to form a new channel, that is, the size of the original feature map is reduced by 4 times, and the channel is increased by 4 times), and then connected with the 13x13x1024 size feature graph to become the 13x13x1280 feature graph, and finally make prediction on these feature images. The performance of using Fine-Grained Features,YOLOv2 is improved by 1%.

Multi-Scale Training

The Darknet-19 network structure used in YOLOv2 has only convolution layer and pooling layer, so there is no limit on the size of the input image. YOLOv2 uses multi-scale input training, during the training process, every 10 batches, re-randomly select the size of the input picture, because the total sampling step of Darknet-19 is 32, the input image size generally chooses a multiple of 32. , 608} . Using Multi-Scale Training, you can adapt to different sizes of image input. * * when using low-resolution image input, the mAP value decreases slightly, but the speed is faster. When using high-resolution image input, you can get a higher mAP value, but the speed slows down. * * YOLOv2 draws lessons from some skills of many other target detection methods, such as anchor boxes of Faster R-CNN and multi-scale detection of SSD. In addition, YOLOv2 has done a lot of tricks in the network design, so that it can not only ensure the speed but also improve the detection accuracy. Multi-Scale Training also makes the same model adapt to different sizes of input, so it can make a free tradeoff between speed and accuracy. Problems in YOLOv2

YOLO v2 optimizes the defects of YOLO v1, which greatly improves the detection performance, but there are still some problems, such as unable to solve the classification of overlap problems.

Innovative points of YOLOv3

New Network structure: DarkNet-53

Input the pictures of 256x256 into the classification model based on Darknet-19,ResNet-101,ResNet-152 and Darknet-53 respectively, and the experimental results are shown in the following figure. You can see that Darknet-53 performs better than ResNet-101 and is 1.5 times faster, while Darknet-53 is similar to ResNet-152 but almost twice as fast. It is noted that compared with other network structures, Darknet-53 achieves the highest amount of floating-point computation per second, indicating that its network structure can make better use of GPU.

Fused FPN

YOLOv3 draws lessons from the idea of FPN to extract features from different scales. Compared with the last three layers of feature images extracted by YOLOv2,YOLOv3, not only independent prediction is made on each feature map, but also the small feature map is sampled to the same size as the large feature image, and then spliced with the large feature image to make further prediction. The anchor box of 9 scales is clustered by the idea of dimensional clustering, and the anchor box of 9 scales is evenly distributed to the feature graphs of 3 scales.

Using logical regression instead of softmax as classifier

In practical applications, it is possible to input multiple categories to an object, and simple single-label classification has certain limitations in the actual scene. For example, a car can belong to either the car (car) category or the vehicle (vehicle) category, and only one category can be obtained by using a single label. Therefore, in YOLO v3, the original softmax layer is replaced by the logical regression layer in the network structure, so as to change the single-label classification to multi-label classification. Using multiple logistic classifiers to replace softmax will not reduce the accuracy, but can maintain the detection accuracy of YOLO.

The above is all the contents of the article "sample Analysis of YOLO Target Detection structure from V1 to V3". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.