An example Analysis of the principle of YOLO v3 04/20 Update SLTechnology News&Howtos

An example Analysis of the principle of YOLO v3

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

On the YOLO v3 principle of the case analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Basic idea of algorithm

First of all, the input features are extracted through the feature extraction network, and the feature graph output of a specific size is obtained. The input image is divided into 13 × 13 grid cell, and then if the central coordinates of an object in the real box fall in a grid cell, then the grid cell predicts the object. Each object has a fixed number of bounding box,YOLO v3 with three bounding box, and logical regression is used to determine the regression box used to predict.

Network structure

The DBL above is the basic component of Yolo v3. The convolution layer of Darknet is followed by BatchNormalization (BN) and LeakyReLU. In addition to the last convolution layer, BN and LeakyReLU are inseparable parts of the convolution layer in yolo v3, and together constitute the smallest component.

Five resn structures are used in the backbone network. , res8, etc., indicates that this res_block contains n res_unit, which is a large component of Yolo v3. The residual structure of ResNet is borrowed from Yolo v3, which can make the network structure deeper. The explanation of res_block can be seen directly in the lower right corner of the network result in the above figure, and its basic component is also DBL.

There is a tensor splicing (concat) operation on the prediction branch. The realization method is to splice the up-sampling of the darknet middle layer and a certain layer behind the middle layer. It is worth noting that the operation of tensor stitching is different from that of add of Res_unit structure. Tensor stitching will expand the dimension of tensor, while add will not lead to the change of tensor dimension.

Yolo_body has a total of 252 floors. 23 Res_unit corresponds to 23 add layers. The number of BN layer and LeakyRelu layer is 72, which is shown in the network structure as follows: each layer of BN is followed by a layer of LeakyReLU. There are 2 up-sampling operations and 2 tensor stitching operations, and 5 zero padding corresponding to 5 res_block. The convolution layer has a total of 75 layers, of which 72 layers are followed by DBL composed of BatchNormalization and LeakyReLU. The outputs of three different scales correspond to three convolution layers, and the number of convolution cores of the last convolution layer is 255. for 80 classes of COCO data sets: 3 × (80 × 4) = 255, 3 means that a grid cell contains 4 coordinate information of 3 bounding box,4 representation boxes, and 1 indicates confidence.

The following figure shows the specific network results.

Mapping input to output

Without considering the details of the neural network structure, generally speaking, for an input image, YOLO3 maps it to the output tensor of three scales, which represents the probability of various objects in each position of the image.

Let's take a look at how many predictions YOLO3 made. For a 416 '416 input image, 3 prior boxes are set in each grid of the feature map of each scale, with a total of 13' 13'3 + 26'26'3 + 52'52'3 = 10647 predictions. Each prediction is a (4-1-80) = 85-dimensional vector, which contains border coordinates (4 values), frame confidence (1 value), and the probability of the object category (80 objects for COCO data sets).

Bounding box prediction (Bounding Box Prediction)

Yolo v3 still uses the k-means clustering method in Yolo v2 to do the initial size of bounding box. This prior knowledge is still very helpful for the initialization of bounding box. After all, too much bounding box is guaranteed for the effect, but it still has a great impact on the speed of the algorithm.

On the COCO dataset, nine clusters are shown in the table below. Note here: the larger the feature graph, the smaller the receptive field. The more sensitive to small goals, so choose a small anchor box. The smaller the feature map, the larger the receptive field. The more sensitive you are to big goals, so choose a large anchor box.

Yolo v3 uses the method of directly predicting the relative position. The relative coordinates of the b-box center point relative to the upper left corner of the grid element are predicted. (tx,ty,tw,th,t0) is predicted directly, and then the position size and confidence of b-box are calculated by the following coordinate offset formula.

Tx, ty, tw and th are the predicted outputs of the model. Cx and cy represent the coordinates of grid cell. For example, if the feature map size of a layer is 13 × 13, then there are 13 × 13 grid cell. The coordinate cx of grid cell in row 0 and column 1 is 0 and 1. Pw and ph represent the size of the pre-predicted bounding box. Bx, by, bw and bh are the coordinates and size of the center of the predicted bounding box. Sum of squared error loss (loss of squared and distance error) is used when training these coordinate values, because the error in this way can be calculated quickly.

Note: here confidence = Pr (Object) * IoU indicates that the box contains the confidence of object and how accurate the box prediction is. That is, if the box corresponds to the background, then the value should be 0, and if the box corresponds to the foreground, then the value should be the IoU of the corresponding foreground GT.

Yolo v3 uses logical regression to predict the score for each bounding box. If the bounding box overlaps the real box better than any other bounding box before it, the value should be 1. If the bounding box is not the best, but does overlap with the real object beyond a certain threshold (the threshold set here in Yolo v3 is 0.5), then ignore this prediction. Yolo v3 only assigns a bounding box to each real object. If the bounding box does not match the real object, there will be no coordinate or category prediction loss, only object prediction loss.

Multi-scale prediction

As can be seen in the above network structure diagram, Yolo v3 is set to predict 3 box per grid cell, so each box needs five basic parameters (x, y, w, h, confidence). Yolo v3 outputs three feature map of different scales, such as y1, y2, and y3 shown in the figure above. The depths of both y1 and Y3 are 255, and the rule of side length is 13:26:52.

The feature size obtained by each prediction task is N × N × [3 ∗ (4 × 1)], N is the lattice size, 3 is the number of bounding boxes per grid, 4 is the number of bounding box coordinates, 1 is the target predicted value, and 80 is the number of categories. For COCO categories, there are 80 categories of probabilities, so each box should output one probability for each category. So 3 × (5 + 80) = 255. That's how this 255 comes from.

Yolo v3 uses the up-sampling method to implement this multi-scale feature map. On the basis of the feature map obtained by Darknet-53, the first feature map is obtained through six DBL structures and the last convolution layer, and the first prediction is made on this feature map. On the Y1 branch, the output of the penultimate convolution layer from the back to the front is connected with the convolution feature tensor of the second Res8 structure through a DBL structure and one upsampling (2jue 2). The second characteristic graph is obtained through six DBL structures and the last convolution layer, and the second prediction is made on this feature map. On the Y2 branch, the output of the third convolution layer is reciprocal from back to front. After a DBL structure and one upsampling, the upsampling feature is connected with the convolution characteristic tensor of the output of the first Res8 structure. Through six DBL structures and the last convolution layer, the third feature map is obtained, and the third prediction is made on this feature map.

For the whole network, the feature map size of Yolo v3 multi-scale prediction output is y1: (13 × 13), y2: (26 × 26), y3: (52 × 52). The network receives a picture (416 × 416) and undersampling (416 / 2 × 5 = 13 × 13) by convolution with a step size of 2. Sampled from the convolution layer of the penultimate layer of y1 (x2 sampling up) and connected with the last feature graph tensor of 26 × 26, y2 output (26 × 26). Sampling from the convolution layer of the penultimate layer of y2 (x2recoveryup sampling) and connecting with the last feature graph tensor of 52 × 52, y3 output (52 × 52)

Feel the dimensions of 9 prior boxes. The blue box in the following picture is the prior box obtained by clustering. The yellow box ground truth, and the red box is the mesh where the center point of the object is located.

Three cases of Forecast Box

The prediction box is divided into three situations: positive case (positive), negative case (negative), and ignore sample (ignore).

(1) positive example: take any ground truth and calculate IOU with all the 10647 boxes calculated above. The largest prediction box of IOU is the positive example. And a prediction box can only be assigned to one ground truth. For example, the first ground truth has matched a positive check box, then the next ground truth, in the remaining 10646 test boxes, look for the largest IOU test box as a positive example. The positive example generates confidence loss, detection box loss and category loss. The prediction box is the corresponding ground truth box tag (calculated using the real x, y, w, h); the category tag corresponds to category 1, the rest is 0, and the confidence tag is 1.

(2) ignore the sample: except for the positive example, if the IOU of any ground truth is greater than the threshold (5 is used in this paper), the sample is ignored. Ignoring the sample does not produce any loss.

Why is there an example of neglect?

Because Yolov3 uses multi-scale detection, there will be repeated detection in re-detection. For example, there is a real object, and the third box,IOU of feature figure 1 is 0.98. at this time, the first box of feature figure 2 and the IOU of the ground truth are 0.95. the ground truth is also detected. if its confidence is forcibly marked with 0 at this time, the network learning effect will not be ideal.

(3) negative cases: except for positive cases (the largest IOU detection box calculated with ground truth, but the IOU is less than the threshold, it is still a positive case), and the IOU of all ground truth is less than the threshold (0.5), it is a negative case. In negative cases, only confidence produces loss, and the confidence label is 0.

As shown in the following figure:

λ is a weight parameter, which is used to control the confidence loss of detection boxes loss, obj and noobj, as well as categories.

For positive classes, the 1ijobj output is 1; for negative cases, 1ijnoobj output is 1; for ignoring samples, all are 0

The category uses cross entropy as the loss function.

Category prediction

In the aspect of category prediction, the Softmax classifier in Yolo v2 network thinks that a target belongs to only one category, and by outputting the Score size, each box is assigned to the largest category of Score. However, in some complex scenarios, a target may belong to multiple classes (with overlapping category tags), so Yolo v3 uses multiple independent Logistic classifiers to replace the Softmax layer to solve the problem of multi-label classification, and the accuracy will not decrease.

For example, the softmax layer in the original classification network assumes that an image or an object belongs to only one category, but in some complex scenarios, an object may belong to multiple classes, such as woman and person in your category, so if there is a woman in an image, then the category tag in your test result will have both woman and person classes, which is multi-label classification. Logistic classifier is needed to classify each category. The Logistic classifier mainly uses the sigmoid function, which can constrain the input in the range of 0 to 1, so when a certain kind of output of an image after feature extraction is constrained by the sigmoid function, if it is greater than 0.5, it means that the target of the bounding box belongs to this class.

Object score and class confidence

Object score: indicates the probability that a bounding box contains an object, almost 1 for the red box and the boxes around it, but almost all the boxes at the corners may be 0. The object score is also represented by a sigmoid function, which represents the probability value.

Class confidence: indicates the probability that the detected object belongs to a specific class. Previous YOLO versions used softmax to convert class scores into class probabilities. In YOLOv3, the author decided to use the sigmoid function instead because softmax assumes that classes are mutually exclusive, for example, belonging to "Person" does not belong to "Woman", but in many cases the object is both "Person" and "Woman".

Output processing

Our network generates 10647 anchor frames, but there is only one dog in the image. How can we reduce 10647 frames to 1? First of all, we filter some anchor frames through object scores, such as the anchor frames below the threshold (assuming 0.5) are rounded off directly; then, NMS (non-maximum suppression) is used to solve the problem of multiple anchor frames detecting an object (for example, three anchor frames of the red box detect a frame or consecutive cell detects the same object, resulting in redundancy), NMS is used to remove multiple detection boxes.

Specifically use the following steps: discard boxes with low scores (meaning boxes have little confidence in detecting a class); select only one box (NMS) when multiple boxes are highly coincident and all detect the same object.

In order to make it easier to understand, we choose the car image above. First of all, we use the threshold to filter part of the anchor frame. The model has 85 numbers, and each box is described by 85 numbers. Divide (19, 19, 3, 85) into the following shapes:

Box_confidence: (1913 cell 1) denotes 19 cell, each cell has 3 boxes, each box has the confidence probability of the object.

Boxes: (19, 19, 3, 4) represents three boxes for each cell, and the representation of each box

Box_class_probs: (19, 19, 3, and 80) denotes the detection probability of 3 boxes per cell, with 80 classes in each box.

For each anchor frame, we calculate the following element-level multiplication and get the probability that the anchor frame contains an object class, as shown in the following figure:

Even if some anchor frames are filtered by the class score threshold, there are still a lot of coincident boxes left. The second process is called NMS, and there is an IoU in it, as shown in the following figure.

To achieve non-maximum suppression, the key lies in: select a box with the highest score; calculate its coincidence degree with other boxes, remove the box whose coincidence degree exceeds the IoU threshold; go back to step 1 and iterate until there is no lower box than the current box.

Loss Function

The loss function is not explicitly mentioned in the paper of Yolo v3. To be exact, only Yolo v1 explicitly mentions the formula of loss function in the series of Yolo papers. In Yolo v1, a loss calculation method called sum-square error is used, which is only a simple sum of differences. We know that in the target detection task, there are several key information that needs to be determined: (XMagol y), (wMagneh), class,confidence. According to the characteristics of key information, it can be divided into the above four categories, and the loss function should be determined by their respective characteristics. Finally, it can be added together to form the final loss function, that is, a loss function to get the end-to-end training.

Yolov3 network hard core explanation (video)

Video address: https://www.bilibili.com/video/BV12y4y1v7L6?from=search&seid=442233808730191461

How the real value is encoded

Design of Prediction Anchor frame

Iou the anchor frame and the target frame

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.