The realization method of YoloV5 Target Detection platform built by Pytorch 07/19 Update SLTechnology News&Howtos

The realization method of YoloV5 Target Detection platform built by Pytorch

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "Pytorch build YoloV5 target detection platform implementation method", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "Pytorch to build YoloV5 target detection platform realization method"!

The realization of YoloV5: first, the analysis of the whole structure

Before learning YoloV5, we need to have some understanding of the work done by YoloV5, which will help us to understand the details of the network later.

Similar to previous versions of Yolo, the entire YoloV5 can still be divided into three parts, Backbone,FPN and Yolo Head.

Backbone can be called the backbone feature extraction network of YoloV5. According to its structure and the previous name of Yolo backbone, I generally call it CSPDarknet. The input image will first carry out feature extraction in CSPDarknet, and the extracted feature can be called feature layer, which is the feature set of input image. In the backbone part, we obtain three feature layers to build the next step of the network, these three feature layers I call it the effective feature layer.

FPN can be called enhanced feature extraction network of YoloV5. The three effective feature layers obtained in the backbone will carry out feature fusion in this part. The purpose of feature fusion is to combine feature information of different scales. In the FPN part, the effective feature layer that has been obtained is used to continue to extract features. The structure of Panet is still used in YoloV5, we will not only carry out up-sampling of features to achieve feature fusion, but also downsampling features to achieve feature fusion.

Yolo Head is the classifier and regression of YoloV5. Through CSPDarknet and FPN, we have been able to obtain three enhanced effective feature layers. Each feature layer has width, height and the number of channels. In this case, we can regard the feature graph as a set of feature points one after another, and each feature point has several channels. In fact, what Yolo Head does is to judge the feature points and whether there are objects corresponding to them. Like previous versions of Yolo, the decoupling headers used by YoloV5 are implemented together, that is, classification and regression are implemented in a 1X1 convolution.

Therefore, the work of the whole YoloV5 network is feature extraction-feature enhancement-prediction of the object corresponding to the feature points.

2. Analysis of network structure

1. Introduction of backbone network Backbone

The backbone feature extraction network used by YoloV5 is CSPDarknet, which has five important characteristics:

1. The residual convolution in the residual network Residual,CSPDarknet can be divided into two parts, the backbone part is the convolution of 1X1 and the convolution of 3X3, and the residual edge part does not do any processing, but directly combines the input and output of the trunk.

The backbone of the entire YoloV5 is made up of residual convolution:

Class Bottleneck (nn.Module): # Standard bottleneck def _ init__ (self, C1, c2, shortcut=True, ch_out 1, eBay 0.5): # ch_in, ch_out, shortcut, groups, expansion super (Bottleneck, self). _ _ init__ () c _ = int (c2 * e) # hidden channels self.cv1 = Conv (C1, C1, 1, 1) self.cv2 = Conv (c2, c2, 3, 1) Self.add = shortcut and C1 = c2 def forward (self, x): return x + self.cv2 (self.cv1 (x)) if self.add else self.cv2 (self.cv1 (x))

The characteristic of residual network is that it is easy to optimize and can improve the accuracy by adding considerable depth. The internal residual block uses a jump connection to alleviate the problem of gradient disappearance caused by increasing depth in the depth neural network.

2. Using the CSPnet network structure, the CSPnet structure is not complicated, that is, the stack of the original residual blocks is split into left and right parts:

The trunk part continues to stack the original residual blocks.

The other part, like a residual edge, is directly connected to the end after a small amount of processing.

Therefore, it can be considered that there is a large residual edge in CSP.

Class C3 (nn.Module): # CSP Bottleneck with 3 convolutions def _ init__ (self, C1, c2, nasty 1, shortcut=True, grub 1, eBay 0.5): # ch_in, ch_out, number, shortcut, groups, expansion super (C3, self). _ init__ () c _ = int (c2 * e) # hidden channels self.cv1 = Conv (c1, c1, 1, 1) self.cv2 = Conv (C1, cride, 1) 1) self.cv3 = Conv (2 * caches, c2,1) # act=FReLU (c2) self.m = nn.Sequential (* [Bottleneck (cations, cations, shortcut, g, eBay 1.0) for _ in range (n)]) # self.m = nn.Sequential (* [CrossConv (cations, cations, 3,1, g, 1.0, shortcut) for _ in range (n)]) def forward (self X): return self.cv3 (torch.cat ((self.m (self.cv1 (x)), self.cv2 (x)), dim=1))

The spliced feature layer becomes 12 channels relative to the original three channels. The following figure shows the Focus structure very well, which can be understood at a glance.

:: 2], x [..., 1 rig 2,:: 2], x [.,:: 2, 1)

4. The SiLU activation function is used. SiLU is an improved version of Sigmoid and ReLU. SiLU has the characteristics of no upper bound, lower bound, smooth and non-monotonous. The effect of SiLU is better than that of ReLU in deep model. It can be thought of as a smooth ReLU activation function.

Class SiLU (nn.Module): @ staticmethod def forward (x): return x * torch.sigmoid (x)

5. The SPP structure is used to extract features by maximum pooling of different pooling core sizes to improve the receptive field of the network. In YoloV4, SPP is used in FPN, and in YoloV5, SPP module is used in backbone feature extraction network.

Class SPP (nn.Module): # Spatial pyramid pooling layer used in YOLOv3-SPP def _ init__ (self, C1, c2, k = (5, 9, 13)): super (SPP, self). _ _ init__ () c _ = C1 / / 2 # hidden channels self.cv1 = Conv (C1, c1, 1, 1) self.cv2 = Conv (c* (len (k) + 1), c2,1 1) self.m = nn.ModuleList ([nn.MaxPool2d (kernel_size=x, stride=1, padding=x / / 2) for x in k]) def forward (self, x): X = self.cv1 (x) return self.cv2 (torch.cat ([x] + [m (x) for m in self.m], 1))

The whole backbone implementation code is as follows:

Import torchimport torch.nn as nnclass SiLU (nn.Module): @ staticmethod def forward (x): return x * torch.sigmoid (x) def autopad (k, p=None): if p is None: P = k / / 2 if isinstance (k, int) else [x / / 2 for x in k] return pclass Focus (nn.Module): def _ init__ (self, C1, c2, kumb1, sound1, p=None, gourmet 1, act=True): # ch_in, ch_out, kernel, stride Padding, groups super (Focus, self). _ _ init__ () self.conv = Conv (C1 * 4, c2, k, s, p, g, act) def forward (self, x): return self.conv (torch.cat ([x [...,:: 2,:: 2], x [...,:: 2,:: 2], x [...,:: 2, 1 self 2], x [..., 1 self 2], x [... 1) class Conv (nn.Module): def _ init__ (self, c1, c2, kud1, spart1, p=None, glos1, act=True): super (Conv, self). _ init__ () self.conv = nn.Conv2d (c1, c2, k, s, autopad (k, p), groups=g, bias=False) self.bn = nn.BatchNorm2d (c2, eps=0.001) Momentum=0.03) self.act = SiLU () if act is True else (act if isinstance (act, nn.Module) else nn.Identity () def forward (self, x): return self.act (self.bn (self.conv (x)) def fuseforward (self, x): return self.act (self.conv (x)) class Bottleneck (nn.Module): # Standard bottleneck def _ init__ (self, C1, c2, shortcut=True, Group1 EBay 0.5): # ch_in, ch_out, shortcut, groups, expansion super (Bottleneck, self). _ _ init__ () c _ = int (c2 * e) # hidden channels self.cv1 = Conv (C1, C1, 1, 1) self.cv2 = Conv (c2, c2, 3, 1, g) self.add = shortcut and C1 = = c2 def forward (self) X): return x + self.cv2 (self.cv1 (x)) if self.add else self.cv2 (self.cv1 (x)) class C3 (nn.Module): # CSP Bottleneck with 3 convolutions def _ init__ (self, C1, c2, nasty 1, shortcut=True, gourmet 1, eBay 0.5): # ch_in, ch_out, number, shortcut, groups, expansion super (C3) Self). _ init__ () c _ = int (c2 * e) # hidden channels self.cv1 = Conv (C1, croup, 1, 1) self.cv2 = Conv (c1, c2, 1) self.cv3 = Conv (2 * cymes, c2, 1) # act=FReLU (c2) self.m = nn.Sequential (* [Bottleneck (cations, cations, shortcut, g) Def forward 1.0) for _ in range (n)) # self.m = nn.Sequential (* [CrossConv (cations, cations, 3,1, g, 1.0, shortcut) for _ in range (n)]) def forward (self, x): return self.cv3 ((self.m (self.cv1 (x)), self.cv2 (x)) Dim=1)) class SPP (nn.Module): # Spatial pyramid pooling layer used in YOLOv3-SPP def _ init__ (self, C1, c2, k = (5, 9, 13)): super (SPP, self). _ init__ () C1 / / 2 # hidden channels self.cv1 = Conv (C1, C1, 1, 1) self.cv2 = Conv (c* (len (k) + 1), c2,1 1) self.m = nn.ModuleList ([nn.MaxPool2d (kernel_size=x, stride=1, padding=x / / 2) for x in k]) def forward (self, x): X = self.cv1 (x) return self.cv2 (torch.cat ([x] + [m (x) for m in self.m], 1)) class CSPDarknet (nn.Module): def _ init__ (self, base_channels) Base_depth): super (). _ _ init__ () #-- # # input picture is 640,640 3 # the initial basic channel is 64 #- -- # # feature extraction using focus network structure 640,3-> 320,320,12-> 320,320,64 #-- # self.stem = Focus (3, base_channels) #-- # # after convolution is completed 320,320,64-> 160,160128 # after completing CSPlayer 160160128-> 160160128 #-# self.dark2 = nn.Sequential (Conv (base_channels, base_channels * 2,3,2), C3 (base_channels * 2, base_channels * 2, base_depth) ) #-# # after convolution is completed 160,160,128-> 80,80,256 # after completing CSPlayer 80,80,256-> 80,80,256 #-# self.dark3 = nn.Sequential (Conv (base_channels * 2, base_channels * 4,3,2), C3 (base_channels * 4, base_channels * 4, base_depth * 3) ) #-# # after convolution is completed 80,80,256-> 40,40,512 # after completing CSPlayer 40,40,512-> 40,40512 #-# self.dark4 = nn.Sequential (Conv (base_channels * 4, base_channels * 8,3,2), C3 (base_channels * 8, base_channels * 8, base_depth * 3) ) #-# # after convolution is completed 40,40,512-> 20,20,512 # after completing SPP, 20,20,1024-> 20,20,1024 # after completing CSPlayer 20, 20, 1024-> 20, 20, 1024 #-# self.dark5 = nn.Sequential (Conv (base_channels * 8, base_channels * 16,3,2), SPP (base_channels * 16, base_channels * 16) C3 (base_channels * 16, base_channels * 16, base_depth, shortcut=False),) def forward (self, x): X = self.stem (x) x = self.dark2 (x) #-- # # the output of dark3 is 80 80, 256 Is an effective feature layer #-- # x = self.dark3 (x) feat1 = x #-- -# # output of dark4 is 40 40, 512 Is an effective feature layer #-- # x = self.dark4 (x) feat2 = x #-- -# # output of dark5 is 20 20, 1024 Is an effective feature layer #-# x = self.dark5 (x) feat3 = x return feat1, feat2, feat32, build FPN feature pyramid to enhance feature extraction

In the part of feature utilization, YoloV5 extracts multiple feature layers for target detection, and a total of three feature layers are extracted.

The three feature layers are located in different positions of the trunk part of the CSPdarknet, which are located in the middle layer, the lower layer and the bottom layer. When the input is (640, 640), the shape of the three feature layers is feat1= (80 ~ 80256), feat2= (40 ~ 40512), and feat3= (20 ~ 20 ~ 1024).

After obtaining three effective feature layers, we use these three effective feature layers to build the FPN layer by:

After adjusting the 1X1 convolution channel for one time, the feature layer of feat3= is obtained. After upsampling UmSampling2d, the feature layer of P5 is combined with the feature layer of feat2=, and then the feature layer is extracted by CSPLayer to obtain P5_upsample.

After the 1X1 convolution adjustment channel is carried out in the feature layer of P5line upsample = (40Magne 40512), the feature layer of P4Mague P4 is obtained after upsampling UmSampling2d and combined with the feature layer of feat1= (80Magee 80256), and then the feature layer is extracted by CSPLayer. The feature layer obtained at this time is (80Magee 80256).

The feature layer of P3 inverse out = (80050256) carries on a 3x3 convolution for downsampling, then stacks it with P4, and then uses CSPLayer to extract P4_out. At this time, the feature layer obtained is (40Lind 40512).

The feature layer of P4FeOUT = (40Magne40512) undersampled by 3x3 convolution, then stacked with P5, and then used CSPLayer to extract P5_out. At this time, the feature layer obtained is (20mem20 1024).

The feature pyramid can fuse the features of different shape feature layers, which is helpful to extract better features.

Import torchimport torch.nn as nnfrom nets.CSPdarknet import CSPDarknet, C3 Conv#---## yolo_body#---#class YoloBody (nn.Module): def _ _ init__ (self Anchors_mask, num_classes, phi): super (YoloBody, self). _ _ init__ () depth_dict = {'s': 0.33,'m': 0.67,'l': 1.00,'x': 1.33,} width_dict = {'s': 0.50,'m': 0.75,'l': 1.00 'x': 1.25,} dep_mul, wid_mul = depth_dict [phi], width_ [Phi] base_channels = int (wid_mul * 64) # 64 base_depth = max (round (dep_mul * 3) 1) # 3 #-- # # input picture is 640,640 3 # the initial basic channel is 64 #- -# # generate the backbone model of CSPdarknet53 # obtain three effective feature layers Their shape are as follows: # 80506 # 40512 # 20 self.backbone = CSPDarknet (base_channels, base_depth) self.upsample = nn.Upsample (scale_factor=2) Mode= "nearest") self.conv_for_feat3 = Conv (base_channels * 16, base_channels * 8,1,1) self.conv3_for_upsample1 = C3 (base_channels * 16, base_channels * 8, base_depth, shortcut=False) self.conv_for_feat2 = Conv (base_channels * 8, base_channels * 4,1,1) self.conv3_for_upsample2 = C3 (base_channels * 8 Base_channels * 4, base_depth, shortcut=False) self.down_sample1 = Conv (base_channels * 4, base_channels * 4,3,2) self.conv3_for_downsample1 = C3 (base_channels * 8, base_channels * 8, base_depth, shortcut=False) self.down_sample2 = Conv (base_channels * 8, base_channels * 8,3,2) self.conv3_for_downsample2 = C3 (base_channels * 16) Base_channels * 16, base_depth, shortcut=False) self.yolo_head_P3 = nn.Conv2d (base_channels * 4, len (anchors_mask [2]) * (5 + num_classes), 1) self.yolo_head_P4 = nn.Conv2d (base_channels * 8, len (anchors_mask [1]) * (5 + num_classes), 1) self.yolo_head_P5 = nn.Conv2d (base_channels * 16) Len (anchors_mask [0]) * (5 + num_classes), 1) def forward (self, x): # backbone feat1, feat2, feat3 = self.backbone (x) P5 = self.conv_for_feat3 (feat3) P5_upsample = self.upsample (P5) P4 = torch.cat ([P5_upsample, feat2] 1) P4 = self.conv3_for_upsample1 (P4) P4 = self.conv_for_feat2 (P4) P4_upsample = self.upsample (P4) P3 = torch.cat ([P4_upsample, feat1], 1) P3 = self.conv3_for_upsample2 (P3) P3_downsample = self.down_sample1 (P3) P4 = torch.cat ([P3_downsample P4], 1) P4 = self.conv3_for_downsample1 (P4) P4_downsample = self.down_sample2 (P4) P5 = torch.cat ([P4_downsample, P5] 1) P5 = self.conv3_for_downsample2 (P5) #-# # the third feature layer # y3 = (batch_size #-# out2 = self.yolo_head_P3 (P3) # -# # second feature layer # y2 = (batch_size 75 out1 40) #-# out1 = self.yolo_head_P4 (P4) #- -# # the first feature layer # y1 = (batch_size Out0 20) #-# out0 = self.yolo_head_P5 (P5) return out0, out1, out2

3. Using Yolo Head to obtain the prediction result.

Using the FPN feature pyramid, we can obtain three enhanced features, the shape of these three enhanced features are (20 shape 20je 1024), (40 meme 40512), and (80d0256), respectively. Then we use the feature layers of these three shape to input Yolo Head to obtain prediction results.

For each feature layer, we can use a convolution to adjust the number of channels, and the final number of channels is related to the number of species to be distinguished. In YoloV5, there are three prior boxes for each feature point on each feature layer.

If you are using a voc training set, there are 20 classes, and the final dimension should be 75 = 3x25, and the shape of the three feature layers should be (20 mine20, 75), (40, 40, 40, 75), and (80, 80, 80, 75).

The last 75 can be split into 3 25, corresponding to the 25 parameters of 3 prior boxes, and 25 can be split into 4'1'20.

The first four parameters are used to judge the regression parameters of each feature point, and the prediction box can be obtained after the regression parameters are adjusted.

The fifth parameter is used to determine whether each feature point contains an object.

The last 20 parameters are used to determine the type of object contained in each feature point.

If you are using a coco training set, there are 80 classes, and the final dimension should be 255 = 3x85, and the shape of the three feature layers should be (20jie 20255), (40jue 40255), (80jm80255).

The last 255 can be split into 3 85, corresponding to the 85 parameters of 3 prior boxes, 85 can be split into 4'1'80.

The first four parameters are used to judge the regression parameters of each feature point, and the prediction box can be obtained after the regression parameters are adjusted.

The fifth parameter is used to determine whether each feature point contains an object.

The last 80 parameters are used to determine the type of object contained in each feature point.

The implementation code is as follows:

Import torchimport torch.nn as nnfrom nets.CSPdarknet import CSPDarknet, C3 Conv#---## yolo_body#---#class YoloBody (nn.Module): def _ _ init__ (self Anchors_mask, num_classes, phi): super (YoloBody, self). _ _ init__ () depth_dict = {'s': 0.33,'m': 0.67,'l': 1.00,'x': 1.33,} width_dict = {'s': 0.50,'m': 0.75,'l': 1.00 'x': 1.25,} dep_mul, wid_mul = depth_dict [phi], width_ [Phi] base_channels = int (wid_mul * 64) # 64 base_depth = max (round (dep_mul * 3) 1) # 3 #-- # # input picture is 640,640 3 # the initial basic channel is 64 #- -# # generate the backbone model of CSPdarknet53 # obtain three effective feature layers Their shape are as follows: # 80506 # 40512 # 20 self.backbone = CSPDarknet (base_channels, base_depth) self.upsample = nn.Upsample (scale_factor=2) Mode= "nearest") self.conv_for_feat3 = Conv (base_channels * 16, base_channels * 8,1,1) self.conv3_for_upsample1 = C3 (base_channels * 16, base_channels * 8, base_depth, shortcut=False) self.conv_for_feat2 = Conv (base_channels * 8, base_channels * 4,1,1) self.conv3_for_upsample2 = C3 (base_channels * 8 Base_channels * 4, base_depth, shortcut=False) self.down_sample1 = Conv (base_channels * 4, base_channels * 4,3,2) self.conv3_for_downsample1 = C3 (base_channels * 8, base_channels * 8, base_depth, shortcut=False) self.down_sample2 = Conv (base_channels * 8, base_channels * 8,3,2) self.conv3_for_downsample2 = C3 (base_channels * 16) Base_channels * 16, base_depth, shortcut=False) self.yolo_head_P3 = nn.Conv2d (base_channels * 4, len (anchors_mask [2]) * (5 + num_classes), 1) self.yolo_head_P4 = nn.Conv2d (base_channels * 8, len (anchors_mask [1]) * (5 + num_classes), 1) self.yolo_head_P5 = nn.Conv2d (base_channels * 16) Len (anchors_mask [0]) * (5 + num_classes), 1) def forward (self, x): # backbone feat1, feat2, feat3 = self.backbone (x) P5 = self.conv_for_feat3 (feat3) P5_upsample = self.upsample (P5) P4 = torch.cat ([P5_upsample, feat2] 1) P4 = self.conv3_for_upsample1 (P4) P4 = self.conv_for_feat2 (P4) P4_upsample = self.upsample (P4) P3 = torch.cat ([P4_upsample, feat1], 1) P3 = self.conv3_for_upsample2 (P3) P3_downsample = self.down_sample1 (P3) P4 = torch.cat ([P3_downsample P4], 1) P4 = self.conv3_for_downsample1 (P4) P4_downsample = self.down_sample2 (P4) P5 = torch.cat ([P4_downsample, P5] 1) P5 = self.conv3_for_downsample2 (P5) #-# # the third feature layer # y3 = (batch_size #-# out2 = self.yolo_head_P3 (P3) # -# # second feature layer # y2 = (batch_size 75 out1 40) #-# out1 = self.yolo_head_P4 (P4) #- -# # the first feature layer # y1 = (batch_size Out0 = self.yolo_head_P5 (P5) return out0, out1, out2 3, decode the prediction result 1, get the prediction box and score

From the second step, we can get the prediction results of three feature layers, the shape are (NMagre 20mem20255), (Nmae40mae40255), (Nmae80mae80255).

However, this prediction result does not correspond to the position of the final prediction box on the picture, and needs to be decoded. In YoloV5, there are three prior boxes for each feature point on each feature layer.

The last 255of each feature layer can be divided into 3 85, corresponding to 85 parameters of 3 transcendental boxes. We reshape them first, and the results are (Nmem20, 20), (40. 40), (85) and (80, 80, and 85) respectively.

85% of them can be split into 4'1'80.

The first four parameters are used to judge the regression parameters of each feature point, and the prediction box can be obtained after the regression parameters are adjusted.

The fifth parameter is used to determine whether each feature point contains an object.

The last 80 parameters are used to determine the type of object contained in each feature point.

Taking the feature layer as an example, this feature layer is equivalent to dividing the image into 20x20 feature points. If a feature point falls within the corresponding frame of the object, it is used to predict the object.

As shown in the figure, the blue dots are the characteristic points of 20x20. At this time, we demonstrate the decoding operation of the three prior boxes of the black dots on the left:

1. Calculate the center prediction point, use the contents of the first two serial numbers of the Regression prediction result to offset the center coordinates of the three prior boxes of the feature points, and after the offset are the three red points on the right.

2. Calculate the width and height of the prediction box, and use the contents of the last two serial numbers of the Regression prediction results to calculate the index to obtain the width and height of the prediction box.

3. At this time, the prediction box can be drawn on the picture.

In addition to such decoding operations, there are non-maximum suppression operations that need to be carried out to prevent the accumulation of the same kind of boxes.

Def decode_box (self, inputs): outputs = [] for I, input in enumerate (inputs): #-- # # input a total of three Their shape is # batch_size, 255,20,20 # batch_size, 255,40,40 # batch_size, 255,80 respectively. 80 #-# batch_size = input.size (0) input_height = input.size (2) input_width = input.size (3) #- -- # # when input as 416x416 # stride_h = stride_w = 32, 16 、 8 #-- # stride_h = self.input_shape [0] / input_height stride_w = self.input_shape [1] / input_width #- -- # # the scaled_anchors size obtained at this time is relative to that of the feature layer #-- # scaled_anchors = [(anchor_width / stride_w) Anchor_height / stride_h) for anchor_width, anchor_height in self.anchors [self.anchors _ mask [I]] #-- # # input a total of three Their shape is # batch_size, 3,20,20,85 # batch_size, 3,40,40,85 # batch_size, 3,80,80,85 #-- # prediction = input.view (batch_size) Len (self.anchors_ Mask [I]), self.bbox_attrs, input_height, input_width) .permute (0,1,3,4) 2). Contiguous () #-- # # Adjustment parameter #-- # of the center position of the prior box -# x = torch.sigmoid (prediction [... 0]) y = torch.sigmoid (prediction [... 1]) #-- # the width and height adjustment parameter of the prior box #- -# w = torch.sigmoid (prediction [... 2]) h = torch.sigmoid (prediction [..., 3]) #-- # # gain confidence Is there an object #-# conf = torch.sigmoid (prediction [... 4]) #-category confidence #- # pred_cls = torch.sigmoid (prediction [... 5:]) FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor #-# # generate grid Center of priori frame The upper left corner of the grid # batch_size,3,20,20 #-# grid_x = torch.linspace (0, input_width-1, input_width) .repeat (input_height 1) .repeat (batch_size * len (self.anchors_ Mask [I]), 1,1) .view (x.shape) .type (FloatTensor) grid_y = torch.linspace (0, input_height-1, input_height) .repeat (input_width, 1) .t () .repeat (batch_size * len (self.anchors_ Mask [I]), 1 1) .view (y.shape) .type (FloatTensor) #-- # # generate the width and height of the prior box in grid format # batch_size 3Jing 20 anchor_w 20 #-# anchor_w = FloatTensor (scaled_anchors). Index_select (1, LongTensor ([0])) anchor_h = FloatTensor (scaled_anchors). Index_select (1 LongTensor ([1]) anchor_w = anchor_w.repeat (batch_size, 1) .repeat (1,1, input_height * input_width) .view (w.shape) anchor_h = anchor_h.repeat (batch_size, 1) .repeat (1,1) Input_height * input_width) .view (h.shape) #-- # # adjust the prior box using the prediction results # first adjust the center of the prior box Offset # from the center of the prior box to the lower right corner, and then adjust the width and height of the prior box. #-- # pred_boxes = FloatTensor (prediction [,: 4] .shape) pred_boxes [, 0] = x.data * 2.-0.5 + grid_x pred_boxes [. 1] = y.data * 2.-0.5 + grid_y pred_boxes [..., 2] = (w.data * 2) * * 2 * anchor_w pred_boxes [. 3] = (h.data * 2) * * 2 * anchor_h #-- # # normalize the output to the decimal form #- -# _ scale = torch.Tensor ([input_width Input_height, input_width, input_height]) .type (FloatTensor) output = torch.cat ((pred_boxes.view (batch_size,-1,4) / _ scale, conf.view (batch_size,-1,1), pred_cls.view (batch_size,-1, self.num_classes)) 1) outputs.append (output.data) return outputs2, score screening and non-maximal inhibition

Score ranking and non-maximal inhibition screening should be carried out after the final prediction results are obtained.

Score screening is to screen out the prediction box whose score meets the confidence level of confidence.

Non-maximal inhibition is to screen out the boxes with the highest scores of the same category in a certain area.

The process of score screening and non-maximal inhibition can be summarized as follows:

1. Find out the box in the picture where the score is greater than the threshold function. Screening scores before coincident box screening can significantly reduce the number of boxes.

2. To cycle the species, the function of non-maximal inhibition is to screen out the boxes with the highest scores of the same species in a certain area. Cycling the species can help us to carry out non-maximal inhibition on each class respectively.

3. Sort the category from large to small according to the score.

4. Take out the box with the highest score each time, and calculate the degree of coincidence with all other prediction boxes, and those with excessive coincidence will be eliminated.

The results of score screening and non-maximal suppression can be used to draw the prediction box.

The following picture is non-maximally suppressed.

The following picture is not subject to non-maximum suppression.

The implementation code is:

Def non_max_suppression (self, prediction, num_classes, input_shape, image_shape, letterbox_image, conf_thres=0.5, nms_thres=0.4): #-- # converts the format of the prediction result to the format in the upper left corner and the lower right corner. # prediction [batch_size, num_anchors, 85] #-- # box_corner = prediction.new (prediction.shape) box_corner [:,:, 0] = prediction [:,: 0]-prediction [:,: 2] / 2 box_corner [:,:, 1] = prediction [:,:, 1]-prediction [:,:, 3] / 2 box_corner [::,: 0] + prediction [:,:, 2] / 2 box_corner [:,:, 3] = prediction [:,:, 1] + prediction [:,:, 3] / 2 prediction [:,:,: 4] = box_corner [: :,: 4] output = [None for _ in range (len (prediction))] for I, image_pred in enumerate (prediction): #-- # # take max for species prediction. # class_conf [num_anchors, 1] type confidence # class_pred [num_anchors, 1] species #-- # class_conf, class_pred = torch.max (image_pred [: 5:5 + num_classes], 1 Keepdim=True) #-- # # use confidence for the first round of screening #- -- # conf_mask = (image_pred [: 4] * class_conf [: 0] > = conf_thres). Squeeze () #-- # # filter the prediction results based on confidence #- -- # image_pred = image_ pred [conf _ mask] class_conf = class_ confession [conf _ mask] class_pred = class_ pred [conf _ mask] if not image_pred.size (0): continue #- -# # detections [num_anchors 7] # 7 is x1, y1, x2, y2, obj_conf, class_conf, class_pred #-# detections = torch.cat ((image_pred [: : 5], class_conf.float (), class_pred.float () 1) #-# # get all the categories included in the forecast result #-# unique_labels = detections [: -1] .CPU () .unique () if prediction.is_cuda: unique_labels = unique_labels.cuda () detections = detections.cuda () for c in unique_labels: #-# # get all the predicted results after a certain type of score screening #-# detections_class = detections [detections [: -1] = c] #-# it is faster to use the official non-maximum suppression! #-# keep = nms (detections_class [:,: 4], detections_class [:, 4] * detections_class [:, 5] Nms_thres) max_detections = detections_ class [keep] # # sort by the confidence of existing objects # _, conf_sort_index = torch.sort (detections_class [:, 4] * detections_class [:, 5] Descending=True) # detections_class = detections_ class [conf _ sort_index] # # non-maximal suppression # max_detections = [] # while detections_class.size (0): # # take out the one with the highest confidence Judge step by step to determine whether the degree of coincidence is greater than nms_thres. If so, remove # max_detections.append (detections_class [0] .unsqueeze (0)) # if len (detections_class) = = 1: # break # ious = bbox_iou (max_detections [- 1], detections_ class [1:]) # detections_class = detections_ class [1:] [ious

< nms_thres] # # 堆叠 # max_detections = torch.cat(max_detections).data # Add max detections to outputs output[i] = max_detections if output[i] is None else torch.cat((output[i], max_detections)) if output[i] is not None: output[i] = output[i].cpu().numpy() box_xy, box_wh = (output[i][:, 0:2] + output[i][:, 2:4])/2, output[i][:, 2:4] - output[i][:, 0:2] output[i][:, :4] = self.yolo_correct_boxes(box_xy, box_wh, input_shape, image_shape, letterbox_image) return output四、训练部分1、计算loss所需内容计算loss实际上是网络的预测结果和网络的真实结果的对比。和网络的预测结果一样，网络的损失也由三个部分组成，分别是Reg部分、Obj部分、Cls部分。Reg部分是特征点的回归参数判断、Obj部分是特征点是否包含物体判断、Cls部分是特征点包含的物体的种类。 2、正样本的匹配过程在YoloV5中，训练时正样本的匹配过程可以分为两部分。 a、匹配先验框。 b、匹配特征点。所谓正样本匹配，就是寻找哪些先验框被认为有对应的真实框，并且负责这个真实框的预测。 a、匹配先验框在YoloV5网络中，一共设计了9个不同大小的先验框。每个输出的特征层对应3个先验框。对于任何一个真实框gt，YoloV5不再使用iou进行正样本的匹配，而是直接采用高宽比进行匹配，即使用真实框和9个不同大小的先验框计算宽高比。如果真实框与某个先验框的宽高比例大于设定阈值，则说明该真实框和该先验框匹配度不够，将该先验框认为是负样本。比如此时有一个真实框，它的宽高为[200, 200]，是一个正方形。YoloV5默认设置的9个先验框为[10,13], [16,30], [33,23], [30,61], [62,45], [59,119], [116,90], [156,198], [373,326]。设定阈值门限为4。此时我们需要计算该真实框和9个先验框的宽高比例。比较宽高时存在两个情况，一个是真实框的宽高比先验框大，一个是先验框的宽高比真实框大。因此我们需要同时计算：真实框的宽高/先验框的宽高；先验框的宽高/真实框的宽高。然后在这其中选取最大值。下个列表就是比较结果，这是一个shape为[9, 4]的矩阵，9代表9个先验框，4代表真实框的宽高/先验框的宽高；先验框的宽高/真实框的宽高。 [[20. 15.38461538 0.05 0.065 ] [12.5 6.66666667 0.08 0.15 ] [ 6.06060606 8.69565217 0.165 0.115 ] [ 6.66666667 3.27868852 0.15 0.305 ] [ 3.22580645 4.44444444 0.31 0.225 ] [ 3.38983051 1.68067227 0.295 0.595 ] [ 1.72413793 2.22222222 0.58 0.45 ] [ 1.28205128 1.01010101 0.78 0.99 ] [ 0.53619303 0.61349693 1.865 1.63 ]] 然后对每个先验框的比较结果取最大值。获得下述矩阵： [20. 12.5 8.69565217 6.66666667 4.44444444 3.38983051 2.22222222 1.28205128 1.865 ] 之后我们判断，哪些先验框的比较结果的值小于门限。可以知道[59,119], [116,90], [156,198], [373,326]四个先验框均满足需求。 [116,90], [156,198], [373,326]属于20,20的特征层。 [59,119]属于40,40的特征层。此时我们已经可以判断哪些大小的先验框可用于该真实框的预测。 b、匹配特征点在过去的Yolo系列中，每个真实框由其中心点所在的网格内的左上角特征点来负责预测。对于被选中的特征层，首先计算真实框落在哪个网格内，此时该网格左上角特征点便是一个负责预测的特征点。同时利用四舍五入规则，找出最近的两个网格，将这三个网格都认为是负责预测该真实框的。红色点表示该真实框的中心，除了当前所处的网格外，其2个最近的邻域网格也被选中。从这里就可以发现预测框的XY轴偏移部分的取值范围不再是0-1，而是0.5-1.5。找到对应特征点后，对应特征点在a中被选中的先验框负责该真实框的预测。 3、计算Loss 由第一部分可知，YoloV5的损失由三个部分组成： 1、Reg部分，由第2部分可知道每个真实框对应的先验框，获取到每个框对应的先验框后，取出该先验框对应的预测框，利用真实框和预测框计算CIOU损失，作为Reg部分的Loss组成。 2、Obj部分，由第2部分可知道每个真实框对应的先验框，所有真实框对应的先验框都是正样本，剩余的先验框均为负样本，根据正负样本和特征点的是否包含物体的预测结果计算交叉熵损失，作为Obj部分的Loss组成。 3、Cls部分，由第三部分可知道每个真实框对应的先验框，获取到每个框对应的先验框后，取出该先验框的种类预测结果，根据真实框的种类和先验框的种类预测结果计算交叉熵损失，作为Cls部分的Loss组成。 import torchimport torch.nn as nnimport mathimport numpy as npclass YOLOLoss(nn.Module): def __init__(self, anchors, num_classes, input_shape, cuda, anchors_mask = [[6,7,8], [3,4,5], [0,1,2]], label_smoothing = 0): super(YOLOLoss, self).__init__() #-----------------------------------------------------------# # 13x13的特征层对应的anchor是[142, 110],[192, 243],[459, 401] # 26x26的特征层对应的anchor是[36, 75],[76, 55],[72, 146] # 52x52的特征层对应的anchor是[12, 16],[19, 36],[40, 28] #-----------------------------------------------------------# self.anchors = anchors self.num_classes = num_classes self.bbox_attrs = 5 + num_classes self.input_shape = input_shape self.anchors_mask = anchors_mask self.label_smoothing = label_smoothing self.threshold = 4 self.balance = [0.4, 1.0, 4] self.box_ratio = 5 self.cls_ratio = 0.5 self.obj_ratio = 1 self.cuda = cuda def clip_by_tensor(self, t, t_min, t_max): t = t.float() result = (t >

= t_min) .float () * t + (t

< t_min).float() * t_min result = (result t_max).float() * t_max return result def MSELoss(self, pred, target): return torch.pow(pred - target, 2) def BCELoss(self, pred, target): epsilon = 1e-7 pred = self.clip_by_tensor(pred, epsilon, 1.0 - epsilon) output = - target * torch.log(pred) - (1.0 - target) * torch.log(1.0 - pred) return output def box_giou(self, b1, b2): """ 输入为： ---------- b1: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh b2: tensor, shape=(batch, feat_w, feat_h, anchor_num, 4), xywh 返回为： ------- giou: tensor, shape=(batch, feat_w, feat_h, anchor_num, 1) """ #----------------------------------------------------# # 求出预测框左上角右下角 #----------------------------------------------------# b1_xy = b1[..., :2] b1_wh = b1[..., 2:4] b1_wh_half = b1_wh/2. b1_mins = b1_xy - b1_wh_half b1_maxes = b1_xy + b1_wh_half #----------------------------------------------------# # 求出真实框左上角右下角 #----------------------------------------------------# b2_xy = b2[..., :2] b2_wh = b2[..., 2:4] b2_wh_half = b2_wh/2. b2_mins = b2_xy - b2_wh_half b2_maxes = b2_xy + b2_wh_half #----------------------------------------------------# # 求真实框和预测框所有的iou #----------------------------------------------------# intersect_mins = torch.max(b1_mins, b2_mins) intersect_maxes = torch.min(b1_maxes, b2_maxes) intersect_wh = torch.max(intersect_maxes - intersect_mins, torch.zeros_like(intersect_maxes)) intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1] b1_area = b1_wh[..., 0] * b1_wh[..., 1] b2_area = b2_wh[..., 0] * b2_wh[..., 1] union_area = b1_area + b2_area - intersect_area iou = intersect_area / union_area #----------------------------------------------------# # 找到包裹两个框的最小框的左上角和右下角 #----------------------------------------------------# enclose_mins = torch.min(b1_mins, b2_mins) enclose_maxes = torch.max(b1_maxes, b2_maxes) enclose_wh = torch.max(enclose_maxes - enclose_mins, torch.zeros_like(intersect_maxes)) #----------------------------------------------------# # 计算对角线距离 #----------------------------------------------------# enclose_area = enclose_wh[..., 0] * enclose_wh[..., 1] giou = iou - (enclose_area - union_area) / enclose_area return giou #---------------------------------------------------# # 平滑标签 #---------------------------------------------------# def smooth_labels(self, y_true, label_smoothing, num_classes): return y_true * (1.0 - label_smoothing) + label_smoothing / num_classes def forward(self, l, input, targets=None): #----------------------------------------------------# # l 代表使用的是第几个有效特征层 # input的shape为 bs, 3*(5+num_classes), 13, 13 # bs, 3*(5+num_classes), 26, 26 # bs, 3*(5+num_classes), 52, 52 # targets 真实框的标签情况 [batch_size, num_gt, 5] #----------------------------------------------------# #--------------------------------# # 获得图片数量，特征层的高和宽 #--------------------------------# bs = input.size(0) in_h = input.size(2) in_w = input.size(3) #-----------------------------------------------------------------------# # 计算步长 # 每一个特征点对应原来的图片上多少个像素点 # # 如果特征层为13x13的话，一个特征点就对应原来的图片上的32个像素点 # 如果特征层为26x26的话，一个特征点就对应原来的图片上的16个像素点 # 如果特征层为52x52的话，一个特征点就对应原来的图片上的8个像素点 # stride_h = stride_w = 32、16、8 #-----------------------------------------------------------------------# stride_h = self.input_shape[0] / in_h stride_w = self.input_shape[1] / in_w #-------------------------------------------------# # 此时获得的scaled_anchors大小是相对于特征层的 #-------------------------------------------------# scaled_anchors = [(a_w / stride_w, a_h / stride_h) for a_w, a_h in self.anchors] #-----------------------------------------------# # 输入的input一共有三个，他们的shape分别是 # bs, 3 * (5+num_classes), 13, 13 =>

Bs, 3,5 + num_classes, 13,13 = > batch_size, 3,13,13,5 + num_classes # batch_size, 3,13,13,5 + num_classes # batch_size, 3,26,26,5 + num_classes # batch_size, 3,52,52 5 + num_classes #-# prediction = input.view (bs, len (self.anchors_ Mask [l]), self.bbox_attrs, in_h, in_w) .permute (0,1,3,4) 2). Contiguous () #-- # # Adjustment parameter #-- # of the center position of the prior box -# x = torch.sigmoid (prediction [... 0]) y = torch.sigmoid (prediction [... 1]) #-- # the width and height adjustment parameter of the prior box #- -# w = torch.sigmoid (prediction [... 2]) h = torch.sigmoid (prediction [..., 3]) #-- # # gain confidence Is there an object #-# conf = torch.sigmoid (prediction [... 4]) #-category confidence #- # pred_cls = torch.sigmoid (prediction [... 5:]) #-- # get the predicted results that the network should have #-- -# y_true Noobj_mask, box_loss_scale = self.get_target (l, targets, scaled_anchors, in_h, in_w) #-- # # decode the prediction result Judge the coincidence degree between the predicted result and the real value # ignore if the coincidence degree is too large Because these feature points belong to more accurately predicted feature points # it is not suitable for negative samples #-# pred_boxes = self.get_pred_boxes (l, x, y, h, w, targets Scaled_anchors, in_h In_w) if self.cuda: y_true = y_true.cuda () noobj_mask = noobj_mask.cuda () box_loss_scale = box_loss_scale.cuda () #-- -# # reshape_y_true [... 2:3] and reshape_y_true [..., 3:4] # indicate the width and height of the real box Both are between 0-1. The larger the real box is, the smaller the specific gravity is, and the greater the proportion of the small box is. #-# box_loss_scale = 2-box_loss_scale #- -- # # calculate the giou #-# of the predicted result and the real result Giou = self.box_giou (pred_ boxes [y _ true]. 4] = = 1], y_true [...,: 4] [y_true [..., 4] = = 1]) loss_loc = torch.sum ((1-giou) * box_loss_ scale [y _ true [... 4] = = 1]) #-- # # calculate the confidence of loss #-- -- # loss_conf = torch.sum (self.BCELoss [y _ true [... 4] = = 1], giou.detach (). Clamp (0)) +\ torch.sum (self.BCELoss (conf, y_true [..., 4]) * noobj_mask) loss_cls = torch.sum (self.BCELoss (pred_ CLS [y _ true [..., 4] = = 1], self.smooth_labels (y_true [..., 5:] [y_true [..., 4] = = 1], self.label_smoothing Self.num_classes)) loss = loss_loc * self.box_ratio + loss_conf * self.balance [l] * self.obj_ratio + loss_cls * self.cls_ratio num_pos = torch.sum (y_true [..., 4]) num_pos = torch.max (num_pos, torch.ones_like (num_pos)) return loss, num_pos def get_near_points (self, x, y, I J): sub_x = x-I sub_y = y-j if sub_x > 0.5 and sub_y > 0.5: return [[0,0], [1,0], [0,1]] elif sub_x

< 0.5 and sub_y >

Return [[0,0], [- 1,0], [0,1] elif sub_x

< 0.5 and sub_y < 0.5: return [[0, 0], [-1, 0], [0, -1]] else: return [[0, 0], [1, 0], [0, -1]] def get_target(self, l, targets, anchors, in_h, in_w): #-----------------------------------------------------# # 计算一共有多少张图片 #-----------------------------------------------------# bs = len(targets) #-----------------------------------------------------# # 用于选取哪些先验框不包含物体 #-----------------------------------------------------# noobj_mask = torch.ones(bs, len(self.anchors_mask[l]), in_h, in_w, requires_grad = False) #-----------------------------------------------------# # 让网络更加去关注小目标 #-----------------------------------------------------# box_loss_scale = torch.zeros(bs, len(self.anchors_mask[l]), in_h, in_w, requires_grad = False) #-----------------------------------------------------# # anchors_best_ratio #-----------------------------------------------------# box_best_ratio = torch.zeros(bs, len(self.anchors_mask[l]), in_h, in_w, requires_grad = False) #-----------------------------------------------------# # batch_size, 3, 13, 13, 5 + num_classes #-----------------------------------------------------# y_true = torch.zeros(bs, len(self.anchors_mask[l]), in_h, in_w, self.bbox_attrs, requires_grad = False) for b in range(bs): if len(targets[b])==0: continue batch_target = torch.zeros_like(targets[b]) #-------------------------------------------------------# # 计算出正样本在特征层上的中心点 #-------------------------------------------------------# batch_target[:, [0,2]] = targets[b][:, [0,2]] * in_w batch_target[:, [1,3]] = targets[b][:, [1,3]] * in_h batch_target[:, 4] = targets[b][:, 4] batch_target = batch_target.cpu() #-------------------------------------------------------# # batch_target : num_true_box, 4 # anchors : 9, 2 # # ratios_of_gt_anchors : num_true_box, 9, 2 # ratios_of_anchors_gt : num_true_box, 9, 2 # # ratios : num_true_box, 9, 4 # max_ratios : num_true_box, 9 #-------------------------------------------------------# ratios_of_gt_anchors = torch.unsqueeze(batch_target[:, 2:4], 1) / torch.unsqueeze(torch.FloatTensor(anchors), 0) ratios_of_anchors_gt = torch.unsqueeze(torch.FloatTensor(anchors), 0) / torch.unsqueeze(batch_target[:, 2:4], 1) ratios = torch.cat([ratios_of_gt_anchors, ratios_of_anchors_gt], dim = -1) max_ratios, _ = torch.max(ratios, dim = -1) for t, ratio in enumerate(max_ratios): #-------------------------------------------------------# # ratio : 9 #-------------------------------------------------------# over_threshold = ratio < self.threshold over_threshold[torch.argmin(ratio)] = True for k, mask in enumerate(self.anchors_mask[l]): if not over_threshold[mask]: continue #----------------------------------------# # 获得真实框属于哪个网格点 #----------------------------------------# i = torch.floor(batch_target[t, 0]).long() j = torch.floor(batch_target[t, 1]).long() offsets = self.get_near_points(batch_target[t, 0], batch_target[t, 1], i, j) for offset in offsets: local_i = i + offset[0] local_j = j + offset[1] if local_i >

= in_w or local_i

< 0 or local_j >

= in_h or local_j

< 0: continue if box_best_ratio[b, k, local_j, local_i] != 0: if box_best_ratio[b, k, local_j, local_i] >

Ratio [mask]: y_true [b, k, local_j, local_i :] = 0 else: continue #-- # # the type of real box taken out #-# c = batch_ target [t 4] .long () #-# # noobj_mask stands for aimless feature points #- -# noobj_ Mask [b] K, local_j Local_i] = 0 #-# # tx, Ty represents the true value of the central adjustment parameter #-- # y _ true [b K, local_j, local_i, 0] = batch_target [t, 0] y_true [b, k, local_j, local_i, 1] = batch_target [t, 1] y_true [b, k, local_j, local_i, 2] = batch_target [t, 2] y_true [b, k, local_j, local_i 3] = batch_target [t, 3] y_true [b, k, local_j, local_i, 4] = 1 y_true [b, k, local_j, local_i C + 5] = 1 #-# # the proportion used to obtain xywh # the weight of large target loss is small Small target loss weight #-# box_loss_scale [b, k, local_j, local_i] = batch_target [t, 2] * batch_ target [t 3] / in_w / in_h #-# # get the best ratio of the current prior box #- -# box_best_ ratio [b] K, local_j, local_i] = ratio [mask] return y_true, noobj_mask, box_loss_scale def get_pred_boxes (self, l, x, y, h, w, targets, scaled_anchors, in_h In_w): #-- # # calculate how many pictures there are #-- -# bs = len (targets) FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor #- -# # generate grid Center of priori frame The upper left corner of the grid #-- # grid_x = torch.linspace (0, in_w-1, in_w). Repeat (in_h, 1). Repeat (int (bs * len (self.anchors_ Mask [l])), 1 1) .view (x.shape) .type (FloatTensor) grid_y = torch.linspace (0, in_h-1, in_h) .repeat (in_w, 1) .t () .repeat (int (bs * len (self.anchors_ Mask [l])), 1 1) .view (y.shape) .type (FloatTensor) # generates the width and height of the prior box scaled_anchors_l = np.array (scaled_anchors) [self.anchors_ Mask [l]] anchor_w = FloatTensor (scaled_anchors_l). Index_select (1, LongTensor ([0])) anchor_h = FloatTensor (scaled_anchors_l). Index_select (1, LongTensor ([1])) anchor_w = anchor_w.repeat (bs) 1) .repeat (1,1, in_h * in_w) .view (w.shape) anchor_h = anchor_h.repeat (bs, 1) .repeat (1,1) In_h * in_w) .view (h.shape) #-- # # calculate the adjusted prior box center and width and height #- -- # pred_boxes_x = torch.unsqueeze (x * 2.-0.5 + grid_x Pred_boxes_w, pred_boxes_h], dim =-1) return pred_boxes trains its own YoloV5 model

First of all, go to Github to download the corresponding warehouse, after the download, use the decompression software to extract, and then use the programming software to open the folder.

Note that the open root directory must be correct, otherwise the code will not run if the relative directory is incorrect.

Be sure to note that the opened root directory is the directory where the files are stored.

I. preparation of data sets

In this paper, we use VOC format for training. We need to make our own dataset before training. If you do not have your own dataset, you can download the dataset of VOC12+07 through Github connection.

Put the tag file in the Annotation under the VOC2007 folder under the VOCdevkit folder before training.

Put the picture file in the JPEGImages under the VOC2007 folder under the VOCdevkit folder before training.

At this point, the placement of the dataset is over.

II. Processing of data sets

After completing the placement of the dataset, we need to proceed to the next step of processing the dataset in order to obtain the training 2007_train.txt and 2007_val.txt, which needs to use the voc_annotation.py in the root directory.

There are some parameters to be set in voc_annotation.py.

They are annotation_mode, classes_path, trainval_percent, train_percent and VOCdevkit_path. Only classes_path can be modified in the first training.

By default (training set + validation set): test set = 9:1train_percent is used to specify the ratio of training set to validation set in (training set + validation set) Default training set: validation set = 9:1 valid only when annotation_mode is 0 and 1''trainval_percent = 0.9train_percent = 0.9 training sets' points to the folder where the VOC dataset is located and points to the VOC dataset under the root directory''VOCdevkit_path =' VOCdevkit' by default

Classes_path is used to point to the txt corresponding to the detection category. Taking the voc dataset as an example, the txt we use is:

When training your own dataset, you can create your own cls_classes.txt, which can write down the categories you need to distinguish.

Third, start network training

Through voc_annotation.py we have generated 2007_train.txt and 2007_val.txt, and we can start training at this time.

There are many training parameters, you can look at the notes carefully after downloading the library, the most important part of which is still the classes_path in train.py.

Classes_path is used to point to the txt corresponding to the test category, which is the same as the txt in voc_annotation.py! Training your own dataset must be modified!

After modifying the classes_path, you can run train.py to start training. After training multiple epoch, the weights will be generated in the logs folder.

The other parameters are used as follows:

#-# # whether to use Cuda# or not. No GPU can be set to False#--#Cuda = True#-- -- # # be sure to modify classes_path before training Match it to your own dataset #-# classes_path = 'model_data/voc_classes.txt'#-- -- # # anchors_path represents the txt file corresponding to the prior box It is usually not modified. # anchors_mask is used to help code find the corresponding prior box, which is generally not modified. #-# anchors_path = 'model_data/yolo_anchors.txt'anchors_mask = [[6, 7, 8], [3, 4, 5], [0, 1 2]] #- -# # for downloading the weight file, please see README It can be downloaded through the network disk. The pre-training weight of the model is universal for different data sets, because the features are universal. The important part of the pre-training weight of the # model is the weight part of the backbone feature extraction network, which is used for feature extraction. # pre-training weights must be used for 99% of the cases. If not, the weights of the backbone are too random, the effect of feature extraction is not obvious, and the result of network training will not be good # # if there is an interruption of training during the training process, model_path can be set to the weight file under the logs folder, and some of the trained weights can be loaded again. # at the same time, modify the parameters of the freezing phase or thawing phase below to ensure the continuity of the model epoch. # # the weights of the entire model are not loaded when model_path =''. # # the weight of the entire model is used here, so it is loaded in train.py. # if you want the model to start training from 0, set model_path =''and the following Freeze_Train = Fasle. At this point, training starts at 0, and there is no process of freezing the trunk. # generally speaking, the training effect from 0 is very poor, because the weight is too random and the effect of feature extraction is not obvious. # # generally, networks do not start training from 0, but at least use the weights of the backbone. Some papers mentioned that there is no need for pre-training, mainly because of their large data set and excellent ability to adjust parameters. # if you must train the backbone of the network, you can understand the imagenet data set, first train the classification model, the backbone of the classification model is common to this model, based on this training. #- -# model_path = 'shape size entered by model_data/yolov5_s.pth'#---## It must be a multiple of 32. #-# input_shape = [640] 640] #-the version of YoloV5 used. S 、 m 、 l 、 X#---#phi ='s ownership, music, etc. -# # tricks application of Yolov4 # mosaic mosaic data enhancement True or False # mosaic data enhancement is unstable during actual testing So the default is False# Cosine_lr cosine annealing learning rate True or False# label_smoothing tag smoothing below 0.01generally such as 0.01, 0.005#---#mosaic = FalseCosine_lr = Falselabel_smoothing = 0mm mi- -# # training is divided into two stages They are the freezing stage and the thawing stage. # insufficient video memory has nothing to do with the size of the dataset. Please reduce the batch_size if the video memory is insufficient. # affected by the BatchNorm layer, the minimum batch_size is 2 and cannot be 1. #-# #-- # # training parameters during the freeze period At this point, the backbone of the model is frozen The feature extraction network does not change. # takes up less video memory Only fine-tune the network #-- # Init_Epoch = 0Freeze_Epoch = 50Freeze_batch_size = 16Freeze_lr = 1eMel -- # # training parameters in thawing stage # at this time, the backbone of the model is not frozen The feature extraction network will change # takes up a large amount of video memory All the parameters of the network will change #-# UnFreeze_Epoch = 100Unfreeze_batch_size = 8Unfreeze_lr = 1eMel 4mm- -# # whether to conduct freeze training The default is to freeze the trunk training before thawing the training. #-# Freeze_Train = True#---# # used to set whether to use multi-thread to read data # when enabled, it will speed up data reading But it takes up more memory # computers with less memory can be set to 2 or 0 #-- # num_workers = 4 minutes- -- # # get the image path and label #-- # train_annotation_path = '2007_train.txt'val_annotation_path =' 2007 roomval.txt4. Prediction of training results

Two files, yolo.py and predict.py, are used to predict the training results.

First of all, we need to modify model_path and classes_path in yolo.py. These two parameters must be modified.

Model_path points to the trained weight file, which is in the logs folder.

Classes_path points to the txt corresponding to the detection category.

After completing the modification, you can run predict.py to detect it. After running, enter the image path to detect it.

Thank you for reading, the above is the content of "the method of Pytorch to build YoloV5 target detection platform". After the study of this article, I believe you have a deeper understanding of the method of Pytorch to build YoloV5 target detection platform, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.