In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)05/31 Report--
This article will explain in detail the example analysis of model compression and acceleration in SpringBoot integration MybatisPlus. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.
1. Compression and acceleration of the model
Let's first take a look at the definition of compression and acceleration of the deep learning model:
The compression and acceleration of deep learning model means to simplify the model by making use of the redundancy of neural network parameters and network structure to obtain a model with fewer parameters and more concise structure without affecting the completion of tasks. The compressed model needs less computing resources and memory, and can meet a wider range of application requirements than the original model.
On what background is the compression and acceleration of deep learning model proposed?
With the improvement of the performance of the deep learning model, the computing is becoming more and more complex, and the computing overhead and memory requirements are gradually increasing. As shown in the figure below, an 8-layer AlexNet requires 61 million network parameters and 729 million floating-point calculations, which costs about 233MB memory. The network parameter of the subsequent VGG-16 reaches 138 million, and the number of floating-point calculations is 156 million, which requires about 553MB memory. In order to overcome the problem of gradient disappearance of the deep network, he Kaiming proposed the ResNet network, which for the first time achieved less than 5% top-5 classification errors in the ILSVRC competition, the shallow ResNet-50 network parameters reached 25 million, the number of floating-point calculations was as high as 390 million, and the memory cost was about 102MB.
Large network parameters mean more memory storage, while the increase in floating-point computing means an increase in training costs and computing time, which greatly limits the deployment of resource-constrained devices, such as smartphones, smart bracelets, and so on.
II. Overview of compression and acceleration methods
Compression methods can be divided into the following seven categories from the point of view of compression parameters and compression structure, as shown in the following table:
2.1 Parametric pruning
Parameter pruning refers to the design of evaluation criteria for network parameters on the basis of pre-trained large-scale models, based on which the "redundant" parameters are deleted. According to the size of pruning, parametric pruning can be divided into unstructured pruning and structured pruning.
The granularity of unstructured pruning is relatively fine, and any "redundant" parameters of the desired proportion in the network can be removed indefinitely, but this will lead to the problem that the network structure is irregular and it is difficult to accelerate effectively.
The granularity of structured pruning is relatively coarse, and the smallest unit of pruning is the combination of parameters in filter. By setting evaluation factors on filter or feature map, even the whole filter or some channel can be deleted to "narrow" the network, which can be effectively accelerated directly in the existing software / hardware, but may lead to the decline of prediction accuracy (accuracy). It is necessary to fine-tune the model (fine-tuning) to restore performance.
(1) unstructured pruning
As shown in the following figure, there are dense connections between the inputs and outputs of the convolution layer and the full connection layer. The purpose of model compression can be achieved by designing evaluation criteria for the importance of connections between neurons and deleting redundant connections.
According to the different design evaluation criteria for the importance of connections between neurons, there are several common forms:
According to the norm value of the neuron connection weight, the connection whose norm value is less than the specified threshold can be deleted, and the recovery performance can be retrained.
Synaptic strength is used to indicate the importance of connections between neurons. Using the concept of neural synapse in biology, the synaptic strength is defined as the product of the Batch Normalization (BN) layer scaling factor γ\ gamma γ and the Frobinus norm of filter.
In the initialization stage of the model, the importance of the connection is judged by sampling the training set many times, and the pruning template is generated and then trained without the need for iterative pruning-fine-tuning process.
(2) structured pruning
Group level pruning means that the filter of each layer is set to the same sparse mode (that is, each cube in the figure deletes the same square in the same position) and becomes a sparse matrix with the same structure. As shown in the following figure:
Filter level pruning can also be regarded as channel level pruning. As shown in the figure above, deleting some of the filter of this layer (that is, deleting the entire cube in the figure) is equivalent to deleting the part of feature map it produces and the part of the lower layer filter that is originally required to convolution with this part of feature map.
The evaluation criteria for filter can be divided into the following four categories: (1) based on the size of filter norm (2) self-defined filter score factor (3) minimizing reconstruction error (4) other methods
2.2 Parameter quantization
Parameter quantization means that the typical 32-bit floating-point network parameters are represented by lower bit widths. The network parameters include weights, activation values, gradients and errors, etc., which can use unified bit widths (such as 16-bit, 8-bit, 2-bit, 1-bit, etc.), or freely combine different bit widths according to experience or certain strategies.
The advantages of parameter quantization are:
It can significantly reduce the parameter storage space and memory footprint, and quantify the parameters from 32-bit floating-point type to 8-bit integer type, thus reducing the storage space by 75%. This is of great help for edge devices with limited computing resources to deploy and use deep learning models.
It can speed up the operation speed, reduce the energy consumption of the equipment, the bandwidth needed to read 32-bit floating-point numbers can be read into four 8-bit integers at the same time, and the integer operation is faster than the floating-point operation, which can naturally reduce the power consumption of the device.
There are also some limitations:
The reduction of the bit width of the network parameters loses part of the information, which will lead to the decrease of the reasoning accuracy, although it can recover part of the accuracy through fine tuning, but it also brings the increase of time cost; when quantized to the special bit width, many existing training methods and hardware platforms are no longer applicable, so it is necessary to design a special system architecture, and the flexibility is not high.
2.3 low rank decomposition
Low-rank decomposition refers to the sparse convolution kernel matrix by merging dimensions and imposing low-rank constraints. because most of the weight vectors are distributed in low-rank subspaces, a few basis vectors can be used to reconstruct the convolution kernel matrix. to achieve the purpose of reducing the storage space.
The filter of neural network can be regarded as a four-dimensional tensor: width w, height h, channel number c, convolution kernel number n. Because c and n have great influence on the whole network structure, based on the characteristics of information redundancy and low rank of convolution kernel (w, h) matrix, the low rank decomposition method can be used for network compression.
The low-rank decomposition method has a good compression and acceleration effect on large convolution kernels and small and medium-sized networks. The past research has been relatively mature, but it is no longer popular in the past two years. The reason lies in: in addition to the high operating cost of matrix decomposition, layer-by-layer decomposition is not conducive to global parameter compression, and a lot of retraining is needed to achieve convergence, the new network proposed in the past two years increasingly uses 1x1 convolution. This small convolution kernel is not conducive to the use of low-rank decomposition method, and it is difficult to achieve network compression and acceleration.
2.4 Parameter sharing
Parameter sharing refers to the use of structured matrix or clustering to map network parameters to reduce the number of parameters. The principle of parameter sharing method is similar to parameter pruning, which takes advantage of a large number of redundant parameters in order to reduce the number of parameters. However, unlike parameter pruning, which is not important directly, parameter sharing designs a mapping form, which maps all parameters to a small amount of data and reduces the need for storage space.
Because there are a large number of parameters in the full connection layer, and the parameter storage accounts for most of the whole network model, parameter sharing can play a good effect in removing the redundancy of the full connection layer, and because it is easy to operate, it is suitable to be used in combination with other methods. However, its disadvantage is that it is not easy to generalize, so how to remove the redundancy of convolution layer is still a challenge. At the same time, for the common mapping form of structured matrix, it is difficult to find a suitable structured matrix for weight matrix, and its theoretical basis is not sufficient.
2.5 Compact Network
Although the above four methods using parameter redundancy to reduce the number of parameters or reduce the accuracy of parameters can simplify the network structure, they often need a large pre-training model to compress parameters on this basis. and most of these methods have the problem of decreasing accuracy, and need fine-tuning to improve network performance.
The design of a more compact new network structure is a new concept of network compression and acceleration. Filter, network layer and even networks with special structures are constructed and trained from scratch to obtain network performance suitable for deployment to resource-limited devices such as mobile platforms. It is no longer necessary to store pre-training models like parameter compression methods, nor to improve performance through fine tuning, thus reducing time costs. It has the characteristics of small storage, low amount of calculation and good network performance.
But its disadvantage is that because its special structure is difficult to be combined with other compression and acceleration methods, and its generalization is poor, it is not suitable to be used as a pre-training model to help other models to train.
(1) convolution kernel level
Here are some typical networks as examples:
Squeezenet, using 1x1 convolution instead of 3x3 convolution, to reduce the number of feature map, transfer the convolution layer
It becomes two layers: squeeze layer and expand layer, reducing the pooling layer.
MobileNet, split the ordinary convolution into depth-wise convolution and point-wise convolution, reducing the number of multiplication
MobileNetV2, which has more features than MobileNet by adding an extra 1x1 expand layer before depth-wise convolution to increase the number of channels.
ShuffleNet, in order to overcome the high cost and channel constraints of point-wise convolution, point-by-point group convolution (point-wise group convolution) and channel mixed washing (channel shuffle) are adopted.
Compared with ShuffleNet, ShuffleNetV2 puts forward the concept of channel segmentation (channel split) in order to reduce the cost of memory access.
Wan et al. [proposed a fully learnable group convolution module (FLGC), which can be embedded in any depth neural network for acceleration.
(2) layer level
Huang et al. proposed random depth for the training of networks similar to ResNet with residual connections. For each mini-batch, randomly delete the block subset and bypass them with identity functions. Dong et al. equipped each convolution layer with a low-cost collaboration layer (LCCL) to predict which locations will become zero after ReLU. Li et al. divided the network layer into weight layer (such as convolution layer and fully connected layer) and non-weight layer (such as pooling layer, ReLU layer, etc.), and proposed a method of combining non-weight layer with weight layer. After removing independent non-weight layer, the running time is significantly reduced.
(3) Network structure level
Kim et al. proposed SplitNet, which automatically learned to divide the network layer into multiple groups to obtain a tree-like network, with each subnet sharing the underlying weight. Gordon et al. [proposed Morphnet, which circularly optimizes the network through the contraction and expansion phases: in the contraction phase, inefficient neurons are removed from the network by sparse regularization terms. In the expansion phase, the width multiplier is used to uniformly expand the size of all layers, so the layers with more important neurons have more computing resources. Kim et al proposed a nested sparse network NestedNet, each layer is composed of multi-level networks. High-level networks and low-level networks share parameters in the way of Network in network (NIN): low-level networks learn common knowledge, high-level networks learn knowledge of specific tasks.
2.6 knowledge distillation
Knowledge distillation was first proposed by Bucilu Wei et al to train the compression model of strong classifiers with pseudo-data tags and to copy the output of the original classifiers. Unlike other compression and acceleration methods which only use target networks that need to be compressed, knowledge distillation requires two types of networks: teacher model and student model.
The pre-trained teacher model is usually a large neural network model with good performance. As shown in the following figure, the output of the softmax layer of the teacher model as soft target and the output of the softmax layer of the student model as hard target are sent into the total loss calculation to guide the training of the student model, and transfer the knowledge of the teacher model to the student model, so that the performance of the student model is equivalent to that of the teacher model. The student model is more compact and efficient, which plays the purpose of model compression.
Knowledge distillation can make the deep network shallower and greatly reduce the computational cost, but it also has its limitations. Because the output of softmax layer is used as knowledge, it is generally used for classification tasks with softmax loss function, and its generalization is not good in other tasks, and there is still much room for improvement between its compression ratio and the performance of the model after distillation.
2.7 mixed mode
The above compression and acceleration methods can achieve good results when used alone, but they all have their own limitations, and their combined use can make them complement each other. By combining different compression and acceleration methods or selecting different compression and acceleration methods for different network layers, researchers design an integrated compression and acceleration framework, which can achieve better compression ratio and acceleration effect. Parameter pruning, parameter quantization, low-rank decomposition and parameter sharing are often combined to greatly reduce the memory and storage requirements of the model, and facilitate the deployment of the model to mobile platforms with limited computing resources.
Knowledge distillation can be combined with compact network to choose a compact network structure for the student model, which can not only ensure the compression ratio, but also improve the performance of the student model. The hybrid method can integrate the advantages of all kinds of compression and acceleration methods, and further strengthen the effect of compression and acceleration, which will be an important research direction in the field of deep learning model compression and acceleration in the future.
As shown in the following figure, Han et al proposed Deep compression, which combines parameter pruning, parameter quantization and Huffman coding to achieve a good compression effect.
3. Comparison of compression effect
(1) the following table shows some representative methods of parameter pruning, compact network, parameter sharing, knowledge distillation and hybrid compression. Using the compression effect of MNIST data set on LeNet-5, we can see that except for the large accuracy loss caused by Ref, the compression results of other methods are good. From the point of view of accuracy, the effect of adaptive fastfood transform is better. It not only achieves the compression effect, but also improves the accuracy;. From the point of view of parameter compression, the hybrid mode achieves a larger compression ratio when the accuracy decreases slightly.
(2) the small table shows the compression effect of CIFAR-10 dataset on VGG-16 using some representative methods of parameter pruning, compact network, parameter sharing and hybrid mode. It can be seen that the compression effects of these four kinds of methods are quite different. Overall, the structured pruning effect is better, while playing the effect of network compression and acceleration, accuracy and even some improvement. The weighted random coding method can achieve a parameter compression ratio as high as 159x, and the accuracy decreases slightly.
IV. Future research direction
Up to now, the compression and acceleration technology of deep learning model is not yet mature, and there is still a lot of room for progress in the actual deployment and production level. Here are some research directions worthy of attention and discussion.
As a form of transfer learning, knowledge distillation can make small models learn as much as possible about large models, and has the characteristics of flexible method and independent of hardware platform, but at present, its compression ratio and post-distillation performance need to be improved. Future knowledge distillation can be studied from the following directions: breaking the limitation of softmax function, combining with intermediate feature layer, using different forms of knowledge; selecting the structure of student model, integrating with other methods; breaking the limitations of tasks, such as transferring knowledge from image classification to other fields.
Model compression technology is combined with hardware architecture design. At present, most of the compression and acceleration methods optimize the model only from the software level, and because of the different hardware platforms, it is difficult to compare the acceleration effect of different methods. In the future, the hardware architecture can be specially designed for the mainstream compression and acceleration methods, which can not only accelerate the model on the existing basis, but also facilitate the comparison of different methods.
Develop a more intelligent model structure selection strategy. At present, both the parametric pruning method and the design of more compact network structure are based on the existing model as the backbone network, and the structure reduction is manually selected or heuristic strategy is used to reduce the model search space. In the future, strategies such as reinforcement learning can be used to automatically search the network structure to get a better network structure.
This is the end of this article on "sample Analysis of Model Compression and acceleration in SpringBoot Integration MybatisPlus". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.