The deltoid muscle of the singular point cloud "shows off the muscle", and the target detection model is broken again. 04/11 Update SLTechnology News&Howtos

The deltoid muscle of the singular point cloud "shows off the muscle", and the target detection model is broken again.

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Artificial intelligence is driving a new round of business change, and algorithm technology is an important force to promote the core underlying technology. In the era of the rise of algorithms, the wave of technology is advancing by leaps and bounds, and only by constantly improving their own technology can algorithm engineers keep pace with the times and control the waves. Recently, the deltoid muscle of singularity cloud algorithm engineer has made a new breakthrough in the field of target detection algorithm.

Abstract

Convolution neural network has a significant improvement in the accuracy of target detection, and with the deepening of the depth of convolution neural network, the greater the improvement of target detection accuracy, but also requires more floating-point computation. Through the method of knowledge distillation, many researchers transfer the knowledge from a deeper and larger teacher network to a small student network, in order to improve the effect of student network in target detection. Most of the knowledge distillation methods need to design complex cost functions, and for most of the two-step target detection algorithms, this paper proposes a clean and effective knowledge distillation scheme for one-step target detection algorithms. The feature layer generated by the teacher network is taken as the real sample, and the feature layer generated by the student network is taken as the false sample, and the two are trained to generate confrontation, in order to improve the performance of the student network in one-step target detection.

1 Introduction

In recent years, with the development of target detection algorithms, researchers have found that the deeper and larger convolution neural network is used as the skeleton, the greater the accuracy of the target detection algorithm. And with the improvement of the detection accuracy of the target detection algorithm, the visual detection algorithm gradually changes from the non-critical field to the key field (such as self-driving and medical fields). However, in order to ensure the detection accuracy, a larger convolution neural network has to be used as the skeleton, resulting in a decrease in detection speed and an increase in the cost of computing equipment. Therefore, many researchers have proposed many methods and summaries to improve the detection speed on the premise of ensuring the detection accuracy, such as reducing the number of floating-point operations of convolutional neural networks through deep separation convolution [1d2], or through point group convolution (pointwise group convolution) and channel mixing (channel shuffle) [3,4], so as to reduce the amount of computation while ensuring the accuracy and capacity of the skeleton network. Although considerable speed-up effect has been achieved, these methods require careful design and adjustment of the skeleton network. Many researchers believe that although the deeper skeleton network has larger network capacity, it has better performance in image classification, target detection and other tasks. But some specific tasks do not need such a large capacity, so in the case of ensuring the accuracy of convolution neural network, convolution neural network compression, quantization, channel branch reduction and so on [5, 6, 7, 8, 9].

On the other hand, the work on knowledge distillation shows that [10, 11, 12, 13], using a deeper and larger model, and after full training as teacher net, and then selecting a relatively shallow model as student net, and finally using the output results of teacher net or intermediate results as soft label combined with real sample true label while training student net, can greatly improve the performance of student net on specific tasks. However, most of these methods need to design very complex cost functions and training methods, and these methods are mostly used for image classification and two-step target detection, but rarely for one-step target detection. Therefore, we need a knowledge distillation method which is more simple and effective and can be applied to one-step target detection. In this paper, a simple and effective knowledge distillation neural network architecture is proposed, which can significantly improve the performance of student net in one-step target detection network. Different from the conventional knowledge distillation method, with reference to the neural network architecture generated by confrontation [14], we split the skeletons of the heavy target detection neural network and the light target detection neural network as teacher net and student net respectively, and then take the feature map generated by teacher net as the real sample, while the student net as the generator, and the feature map generated by student net as the false sample. Finally, according to the real samples and false samples, a neural network is designed as a discriminator to do generating confrontation training.

There are two main points of our contribution:

The main contributions of this paper are as follows: 1 A network architecture which does not need to design complex cost function is proposed, and it can be applied to one-step target detection.

(2) using confrontation to generate network architecture to avoid complex knowledge transfer design, and let student net automatically acquire dark knowledge from teacher net.

2 Related Works

The architecture of deep learning target detection algorithm is mainly divided into two kinds, one is one-step detection, such as SSD [15] proposed by Liu W et al. It directly returns the position and category of objects through convolution neural network, and the other is two-step detection, such as fast rcnn [16] proposed by girshick et al., and later Faster-RCNN [17] and R-FCN [18]. First, the candidate box is regressed through convolution neural network. Finally, the category of each candidate box is identified again according to the candidate box, and the correct position is returned.

Network tailoring, many researchers believe that the deep neural network is over-parameterized and there are many redundant neurons and connections. He Y et al think that every layer of neurons in cnn is sparse. Lasso regression regression is used to find out the most representative neurons in each layer of cnn to reconstruct the output of this layer. Zhuang Z et al. [9] believe that channel pruning by layer-by-layer will affect the discrimination ability of cnn, so the discrimination ability of each layer of cnn can be preserved by adding auxiliary loss in fine-tune and pruning stage.

Network quantization, Wu J et al. [20] accelerates and compresses the convolution layer and full connection layer of the model through k-means clustering algorithm, and achieves better quantization results by reducing the estimation error of the output response of each layer, and proposes an effective training scheme to suppress the multi-layer cumulative error after quantization. Jacob B [21] and others proposed to quantify weights and inputs as uint8 bias and unit32 while using forward quantization during training, and reverse correction error not quantization, in order to improve the speed of inference in the case of cnn performance.

Knowledge distillation is a method to compress the model and ensure accuracy. Hinton et al proposed [2] to take the result of teacher net output as soft label, and advocated the use of temperature cross-entropy instead of L2 loss. Romero et al. [19] think that more unlabeled data is needed to send student net to mimic so that student net can be close to teacher net,Chen G [12] when optimizing the two-step target detection network to extract the dark knowledge of teacher net's intermediate feature map and rpn/rcnn respectively and let student net go to mimic. Other researchers have also transferred the attention information of teacher net to the student network, such as Zagoruyko S [22] and others proposed spatial-attention to transmit the thermal information of teacher net to student net. Yim J et al. [23] regard the relationship between teacher net layer and layer as the goal of student network mimic. But the knowledge distillation they designed is to design very complex loss function and complex dark knowledge extraction methods, and these methods are rarely used in one-step target detection in two-step target detection algorithms. In order to use a simple and effective way of knowledge distillation, we refer to the architecture of generating confrontation network [14] to take the feature layer generated by the teacher network as the real sample and the feature layer generated by the student network as the false sample. and do the two generation confrontation training, in order to improve the performance of student network in one-step target detection.

3 Method

In this paper, we use one-step target detection algorithm SSD [15] as our target detection algorithm. The structure of SSD target detection algorithm is mainly divided into two parts: 1) skeleton network as feature extractor. 2) Head, based on the features extracted by the skeleton network, detect the category and location of the target. In order to obtain better knowledge distillation effect, it is very important to make rational use of these two parts.

3.1 Overall Structure

Fig 1 is the overall structure of our algorithm model. We first use a larger capacity SSD model. After full training, the SSD model is divided into a skeleton network and a SSD-Head, in which the skeleton network is used as a teacher net, and then a smaller capacity CNN is selected as a student net. We take multiple feature map generated by teacher net as true sample, and multiple feature map generated by student net as fake sample, and send true sample and fake sample into each discriminant network (fig 2) corresponding to D Net, and input fake sample into SSD-Head at the same time.

3.2 Training Process

(1)

N in formula 1 represents the size of batchsize, D represents discriminant network, Teacher and Student represent teacher net and student net respectively, and θ t, θ s and θ d represent the weights of each discriminant network in teacher net, student net and D Net module, respectively. Lconf represents the loss function classified in SSD, and Lloc represents the loss function of the bounding box in SSD.

4 Experiment

In this section, we will do experiments in PASCAL VOC to verify our approach, which consists of 20 categories. And the hardware of our method is two NVIDIA GTX 1080Ti GPUs. The software framework used for training is gluoncv.

4.1 Training and testing data

Due to the time constraint, we trained to use the dataset Pascal Voc 2012trainval and Pascal Voc 2007 trainval sets, and the test dataset was Pascal Voc 2007 test sets. The data set contains the category and location information of the detected object. The evaluation criteria are as agreed in the Pascal Voc competition, and the model inspection accuracy is evaluated by the mAP at the time of iou=0.5. On the coco data set, coco 2017 trainset is used as the training set and coco 2017 test is used as the test set.

4.2 Results

We compare the native SSD with the SSD distilled by knowledge under different Teacher net, and the highest student net 2.8mAP can be improved. Interestingly, however, when teacher net is ResNet101,student net and ResNet18, the effect of promotion is not as good as ResNet50. On the other hand, resnet50 is used as teacher net,moblinet as student net on coco to promote 4 mAP of Moblient-SSD.

Table 1. Different student nets are not used GAN-knowledge distillation and the use of a GAN-knowledge distillation in different teacher net test results.

Currently, this method has been used on faster rcnn, and considering the time, it is only being tested on pascal voc 2007, and coco is being trained.

Table 2. Moblienetv1 use GAN-knowledge distillation in coco.

Table 3. Teacher net is a faster rcnn whose skeleton network is ResNet101, and uses Pascal Voc 2007 trainval as the training set, and the mAP is 74.8 + on the Pascal Voc 2007 test test set. The first and second lines use the GAN Knowledge Distillation [1] method, and the third behavior is the method effect of cvpr2019's Distilling Object Detectors with Fine-grained Feature Imitation [2].

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.