In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
In this issue, the editor will bring you about how to achieve convolution neural network training quantification in Pytorch. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.
1. Preface
Deep learning is more and more widely used in mobile, while the computing power and storage space of mobile is lower than that of GPU service. Based on this, we need to customize some deep learning networks for mobile to meet our daily needs. Lightweight networks such as SqueezeNet,MobileNet,ShuffleNet are designed for mobile. However, in addition to improving the network, model pruning and quantification should be the most commonly used optimization methods. Pruning is to delete the unimportant channels of the trained "big model" and accelerate the network without affecting the accuracy. Quantization is to approximate the weights and offsets represented by floating-point numbers (high precision) with low-precision integers (commonly used are INT8). After quantization to low-precision, optimization techniques such as NEON on the mobile platform can be used to accelerate the calculation process, and the model capacity of the original model quantized will be reduced, so that it can be better applied to the mobile environment. However, it is necessary to pay attention to the problem that there must be a decline in accuracy when quantifying the high-precision model to the low-precision model, and how to obtain the performance and accuracy of the TradeOff is very important.
This article introduces the use of Pytorch to reproduce this paper: some details of https://arxiv.org/abs/1806.08342 and gives some self-test results. Note that the code implements "Quantization Aware Training", and then quantifies "Post Training Quantization" and may talk about it separately. The code implementation is the https://github.com/666DZY666/model-compression from the 666DZY666 blogger implementation.
two。 Symmetrical quantization
In the last video, Liang Depeng's author has made these concepts very clear. If you do not want to see the text expression, you can go to this video link to watch the video: in-depth learning quantitative technology popular science. Then skip directly to the fourth section, but in order to ensure the integrity of this story, I will still introduce these two quantitative methods.
The quantization formula of symmetric quantization is as follows:
Symmetrical quantization formula
It represents the quantized scaling factor, and represents the values before and after quantization, respectively. Here, the raw floating-point data is quantized into an interval by dividing it by the scaling factor, for example, for "signed 8Bit" (unsigned is 0 to 255).
There is a Trick, that is, the weight is quantized to, which is to reduce the risk of spillover when accumulating.
Because the value interval of 8bit is [- 2 ^ 7, 2 ^ 7-1], the value interval after multiplication of two 8bit is (- 2 ^ 14, 2 ^ 14), and the accumulation is reached twice (- 2 ^ 15, 2 ^ 15), so it can only be accumulated twice at most and there is a risk of spillover the second time. For example, the result of two adjacent multiplications is exactly 2 ^ 14 will exceed 215-1 (the maximum value that can be represented by positive int16).
Therefore, if the quantized weight is limited to (- 127127), the result of a multiplication operation will always be less than-128 ^ 14.
The corresponding inverse quantization formula is:
Inverse quantization Formula of symmetric quantization
The inverse quantization result is obtained by multiplying the quantized value, of course, this process is lossy, as shown in the following figure, the orange line represents the range before quantization, while the blue line represents the quantized data range, pay attention to the weight.
Schematic diagram of quantification and inverse quantification
Let's take a look at the "float32 value of the black dot" of the orange line above, divide it by the scaling factor and quantize it to a value in between, and then after rounding, if it is inverse quantization, multiply it by the scaling factor to return the "first black dot" above, and use this number to replace the previous number to continue to do the Forward of the network.
So how do you get this scaling factor? As follows:
The scaling factor Delta3. Asymmetric quantization
Compared with symmetric quantization, asymmetric quantization has one more zero offset. The steps of asymmetric quantization of a floating-point number of a float32 to an integer of an int8 (if signed, if unsigned) are scaling, rounding, zero offset, and overflow protection, as shown in the following figure:
The value of 8Bit unsigned integer Nlevel in the process of asymmetric quantization
Then the formulas for calculating the scaling factor and zero offset are as follows:
4. Summary of the middle part
The above two algorithms are directly applied to each network for quantization (quantization after training PTQ), and the accuracy results of the test model are as follows:
The red part will apply the above two quantization algorithms to each network to do accuracy test results 5. Training simulation quantification
We need to quantify this process in the process of network training, and then the network is divided into two stages: forward and reverse, and the quantification of the forward stage is the content of sections 2 and 3. However, it is important to note that for the calculation of scaling factors, weights and activation values are now calculated differently.
For the weight scaling factor, it is still the same as that in sections 2 and 3, that is:
Weight scale = max (abs (weight)) / 127
However, the calculation of the scaling factor of the activation value is no longer a simple calculation of the maximum value, but a moving average (EMA) method is used to calculate this quantization range during training. The updated formula is as follows:
Moving_max = moving_max * momenta + max (abs (activation)) * (1-momenta)
Among them, momenta can take a number close to 1, and in the later Pytorch experiment, take 0.99, and then scale the factor:
Activation scale = moving_max / 128
Then the formula for calculating the gradient in the back propagation phase is as follows:
The Formula for calculating the gradient in the stage of QAT back Propagation
The gradient obtained in the back propagation is the gradient of the weights after the simulated quantization, and this gradient is used to update the weights before quantization.
The code of this part is as follows. Note that the int8 simulated by float32 in this experiment does not have a real board-end acceleration effect, but only to verify the feasibility of the algorithm:
Class Quantizer (nn.Module):
Def _ _ init__ (self, bits, range_tracker):
Super (). _ _ init__ ()
Self.bits = bits
Self.range_tracker = range_tracker
Self.register_buffer ('scale', None) # quantize the scale factor
Self.register_buffer ('zero_point', None) # quantized zero
Def update_params (self):
Raise NotImplementedError
# Quantification
Def quantize (self, input):
Output = input * self.scale-self.zero_point
Return output
Def round (self, input):
Output = Round.apply (input)
Return output
# truncation
Def clamp (self, input):
Output = torch.clamp (input, self.min_val, self.max_val)
Return output
# inverse quantization
Def dequantize (self, input):
Output = (input + self.zero_point) / self.scale
Return output
Def forward (self, input):
If self.bits = = 32:
Output = input
Elif self.bits = = 1:
Print ('! Binary quantization is not supported!')
Assert self.bits! = 1
Else:
Self.range_tracker (input)
Self.update_params ()
Output = self.quantize (input) # quantization
Output = self.round (output)
Output = self.clamp (output) # truncation
Output = self.dequantize (output) # inverse quantization
Return output
6. Code implementation
Based on https://github.com/666DZY666/model-compression/blob/master/quantization/WqAq/IAO/models/util_wqaq.py, symmetrical quantization and asymmetric quantization are realized here. The detail that should be paid attention to is that the quantization of the weight needs to calculate the scaling factor in different channels, and then find a scaling factor for the whole quantization of the activation value, which is the best (mentioned in this paper).
The code implementation of this part is as follows:
# * range_trackers (range statistic, range before quantization) *
Class RangeTracker (nn.Module):
Def _ _ init__ (self, q_level):
Super (). _ _ init__ ()
Self.q_level = q_level
Def update_range (self, min_val, max_val):
Raise NotImplementedError
@ torch.no_grad ()
Def forward (self, input):
If self.q_level = = 'Little: # A minute mineMaxially shapee = (1,1,1,1), layer level
Min_val = torch.min (input)
Max_val = torch.max (input)
Elif self.q_level = = 'Che: # W ·min min Maxially shape = (N, 1, 1, 1), channel level
Min_val = torch.min (input, 3, keepdim=True) [0], 2, keepdim=True) [0], 1, keepdim=True) [0]
Max_val = torch.max (input, 3, keepdim=True) [0], 2, keepdim=True) [0], 1, keepdim=True) [0]
Self.update_range (min_val, max_val)
Class GlobalRangeTracker (RangeTracker): # W mincing maxillary shape = (N, 1, 1), channel level, take the min_max compared this time with before-- (N, C, W, H)
Def _ _ init__ (self, q_level, out_channels):
Super (). _ _ init__ (q_level)
Self.register_buffer ('min_val', torch.zeros (out_channels, 1,1,1))
Self.register_buffer ('max_val', torch.zeros (out_channels, 1,1,1))
Self.register_buffer ('first_w', torch.zeros (1))
Def update_range (self, min_val, max_val):
Temp_minval = self.min_val
Temp_maxval = self.max_val
If self.first_w = = 0:
Self.first_w.add_ (1)
Self.min_val.add_ (min_val)
Self.max_val.add_ (max_val)
Else:
Self.min_val.add_ (- temp_minval) .add _ (torch.min (temp_minval, min_val))
Self.max_val.add_ (- temp_maxval) .add _ (torch.max (temp_maxval, max_val))
Class AveragedRangeTracker (RangeTracker): # A ~ (1, 1, 1, 1), layer, take running_min_max-(N, C, W, H)
Def _ _ init__ (self, q_level, momentum=0.1):
Super (). _ _ init__ (q_level)
Self.momentum = momentum
Self.register_buffer ('min_val', torch.zeros (1))
Self.register_buffer ('max_val', torch.zeros (1))
Self.register_buffer ('first_a', torch.zeros (1))
Def update_range (self, min_val, max_val):
If self.first_a = = 0:
Self.first_a.add_ (1)
Self.min_val.add_ (min_val)
Self.max_val.add_ (max_val)
Else:
Self.min_val.mul_ (1-self.momentum) .add _ (min_val * self.momentum)
Self.max_val.mul_ (1-self.momentum) .add _ (max_val * self.momentum)
The self.register_buffer line of code can set a constant in memory, and at the same time, the model can be written and read when the model is saved and loaded, that is, this variable will not participate in back propagation.
❝
In general, pytorch saves the parameters in the network as orderedDict. There are actually two kinds of parameters here. One is the parameters contained in various module in the model, that is, nn.Parameter. Of course, we can define other nn.Parameter parameters in the network, and the other is buffer. The former is updated every time the optim.step is updated, but not the latter.
❞
In addition, because the convolution layer is often followed by a BN layer, and in order to accelerate the forward reasoning, the parameters of the BN layer are often integrated into the parameters of the convolution layer, so the training simulation quantization should also follow this process. That is, we first need to integrate the parameters of the BN layer with the parameters of the convolution layer, and then quantify this parameter. The specific process can be explained by this page of PPT in Depeng:
Made By Liang Tak-peng
Therefore, the code implementation consists of two versions, one is the training simulation quantization that does not integrate BN, and the other is the training simulation quantization that integrates BN, but why is it like the above figure after fusion? Take a look at the following formula:
So:
The, and in the formula represent the weight and offset of the convolution layer, respectively, and are the input and output of the convolution layer, respectively. According to the calculation formula, the weight and offset after combining the batchnorm parameters can be derived.
The training simulation quantization code without BN is implemented as follows (with comments):
# * quantize convolution (also quantize A _ hand W and do convolution) *
Class Conv2d_Q (nn.Conv2d):
Def _ _ init__ (
Self
In_channels
Out_channels
Kernel_size
Stride=1
Padding=0
Dilation=1
Groups=1
Bias=True
A_bits=8
W_bits=8
Q_type=1
First_layer=0
):
Super (). _ init__ (
In_channels=in_channels
Out_channels=out_channels
Kernel_size=kernel_size
Stride=stride
Padding=padding
Dilation=dilation
Groups=groups
Bias=bias
)
# instantiation quantizer (A-layer level, W-channel level)
If q_type = = 0:
Self.activation_quantizer = SymmetricQuantizer (bits=a_bits, range_tracker=AveragedRangeTracker (qefficiency leveling L'))
Self.weight_quantizer = SymmetricQuantizer (bits=w_bits, range_tracker=GlobalRangeTracker (qaked levelling Cellular out_channels=out_channels))
Else:
Self.activation_quantizer = AsymmetricQuantizer (bits=a_bits, range_tracker=AveragedRangeTracker (qefficiency leveling L'))
Self.weight_quantizer = AsymmetricQuantizer (bits=w_bits, range_tracker=GlobalRangeTracker (qaked levelling Cellular out_channels=out_channels))
Self.first_layer = first_layer
Def forward (self, input):
# quantize An and W
If not self.first_layer:
Input = self.activation_quantizer (input)
Q_input = input
Q_weight = self.weight_quantizer (self.weight)
# quantized convolution
Output = F.conv2d (
Input=q_input
Weight=q_weight
Bias=self.bias
Stride=self.stride
Padding=self.padding
Dilation=self.dilation
Groups=self.groups
)
Return output
The code implementation considering collapsing BN is as follows (with comments):
Def reshape_to_activation (input):
Return input.reshape (1,-1, 1,1)
Def reshape_to_weight (input):
Return input.reshape (- 1,1,1,1)
Def reshape_to_bias (input):
Return input.reshape (- 1)
# * bn Fusion _ quantization convolution (after bn fusion, quantization of A _ hand W and convolution) *
Class BNFold_Conv2d_Q (Conv2d_Q):
Def _ _ init__ (
Self
In_channels
Out_channels
Kernel_size
Stride=1
Padding=0
Dilation=1
Groups=1
Bias=False
Eps=1e-5
Considering the jitter effect caused by quantization, momentum=0.01 # adjusts the momentum to weaken the proportion of batch statistical parameters and restrain the jitter to a certain extent. Through the experiment, the effect of quantitative training is better, and acc is increased by about 1%.
A_bits=8
W_bits=8
Q_type=1
First_layer=0
):
Super (). _ init__ (
In_channels=in_channels
Out_channels=out_channels
Kernel_size=kernel_size
Stride=stride
Padding=padding
Dilation=dilation
Groups=groups
Bias=bias
)
Self.eps = eps
Self.momentum = momentum
Self.gamma = Parameter (torch.Tensor (out_channels))
Self.beta = Parameter (torch.Tensor (out_channels))
Self.register_buffer ('running_mean', torch.zeros (out_channels))
Self.register_buffer ('running_var', torch.ones (out_channels))
Self.register_buffer ('first_bn', torch.zeros (1))
Init.uniform_ (self.gamma)
Init.zeros_ (self.beta)
# instantiation quantizer (A-layer level, W-channel level)
If q_type = = 0:
Self.activation_quantizer = SymmetricQuantizer (bits=a_bits, range_tracker=AveragedRangeTracker (qefficiency leveling L'))
Self.weight_quantizer = SymmetricQuantizer (bits=w_bits, range_tracker=GlobalRangeTracker (qaked levelling Cellular out_channels=out_channels))
Else:
Self.activation_quantizer = AsymmetricQuantizer (bits=a_bits, range_tracker=AveragedRangeTracker (qefficiency leveling L'))
Self.weight_quantizer = AsymmetricQuantizer (bits=w_bits, range_tracker=GlobalRangeTracker (qaked levelling Cellular out_channels=out_channels))
Self.first_layer = first_layer
Def forward (self, input):
# training mode
If self.training:
# do ordinary convolution to get A to get the BN parameter
Output = F.conv2d (
Input=input
Weight=self.weight
Bias=self.bias
Stride=self.stride
Padding=self.padding
Dilation=self.dilation
Groups=self.groups
)
# Update BN statistical parameters (batch and running)
Dims = [dim for dim in range (4) if dim! = 1]
Batch_mean = torch.mean (output, dim=dims)
Batch_var = torch.var (output, dim=dims)
With torch.no_grad ():
If self.first_bn = = 0:
Self.first_bn.add_ (1)
Self.running_mean.add_ (batch_mean)
Self.running_var.add_ (batch_var)
Else:
Self.running_mean.mul_ (1-self.momentum) .add _ (batch_mean * self.momentum)
Self.running_var.mul_ (1-self.momentum) .add _ (batch_var * self.momentum)
# BN Fusion
If self.bias is not None:
Bias = reshape_to_bias (self.beta + (self.bias-batch_mean) * (self.gamma / torch.sqrt (batch_var + self.eps)
Else:
Bias = reshape_to_bias (self.beta-batch_mean * (self.gamma / torch.sqrt (batch_var + self.eps) # b integrates batch
Weight = self.weight * reshape_to_weight (self.gamma / torch.sqrt (self.running_var + self.eps)) # w melts running
# Test state
Else:
# print (self.running_mean, self.running_var)
# BN Fusion
If self.bias is not None:
Bias = reshape_to_bias (self.beta + (self.bias-self.running_mean) * (self.gamma / torch.sqrt (self.running_var + self.eps)
Else:
Bias = reshape_to_bias (self.beta-self.running_mean * (self.gamma / torch.sqrt (self.running_var + self.eps) # b integrates running
Weight = self.weight * reshape_to_weight (self.gamma / torch.sqrt (self.running_var + self.eps)) # w melts running
# quantify the W after the fusion of An and bn
If not self.first_layer:
Input = self.activation_quantizer (input)
Q_input = input
Q_weight = self.weight_quantizer (weight)
# quantized convolution
If self.training: # training mode
Output = F.conv2d (
Input=q_input
Weight=q_weight
Bias=self.bias, # Note, bias is not added here (self.bias is None)
Stride=self.stride
Padding=self.padding
Dilation=self.dilation
Groups=self.groups
)
# (here change the effect of fusing running parameter in convolution into that of fusing batch parameter in training state) running-> batch
Output * = reshape_to_activation (torch.sqrt (self.running_var + self.eps) / torch.sqrt (batch_var + self.eps))
Output + = reshape_to_activation (bias)
Else: # Test mode
Output = F.conv2d (
Input=q_input
Weight=q_weight
Bias=bias, # Note, add bias here to make a complete conv+bn
Stride=self.stride
Padding=self.padding
Dilation=self.dilation
Groups=self.groups
)
Return output
Note that bias is set to None during training, that is, bias is not quantified during training.
7. Experimental results
The Quantization Aware Training experiment is done in CIFAR10. The network structure is as follows:
Import torch
Import torch.nn as nn
Import torch.nn.functional as F
From .util _ wqaq import Conv2d_Q, BNFold_Conv2d_Q
Class QuanConv2d (nn.Module):
Def _ _ init__ (self, input_channels, output_channels
Kernel_size=-1, stride=-1, padding=-1, groups=1, last_relu=0, abits=8, wbits=8, bn_fold=0, q_type=1, first_layer=0):
Super (QuanConv2d, self). _ _ init__ ()
Self.last_relu = last_relu
Self.bn_fold = bn_fold
Self.first_layer = first_layer
If self.bn_fold = = 1:
Self.bn_q_conv = BNFold_Conv2d_Q (input_channels, output_channels
Kernel_size=kernel_size, stride=stride, padding=padding, groups=groups, a_bits=abits, w_bits=wbits, q_type=q_type, first_layer=first_layer)
Else:
Self.q_conv = Conv2d_Q (input_channels, output_channels
Kernel_size=kernel_size, stride=stride, padding=padding, groups=groups, a_bits=abits, w_bits=wbits, q_type=q_type, first_layer=first_layer)
Self.bn = nn.BatchNorm2d (output_channels, momentum=0.01) # considering the jitter effect caused by quantization, adjust momentum to weaken the proportion of batch statistical parameters and restrain jitter to some extent. Through the experiment, the effect of quantitative training is better, and acc is increased by about 1%.
Self.relu = nn.ReLU (inplace=True)
Def forward (self, x):
If not self.first_layer:
X = self.relu (x)
If self.bn_fold = = 1:
X = self.bn_q_conv (x)
Else:
X = self.q_conv (x)
X = self.bn (x)
If self.last_relu:
X = self.relu (x)
Return x
Class Net (nn.Module):
Def _ init__ (self, cfg = None, abits=8, wbits=8, bn_fold=0, q_type=1):
Super (Net, self). _ _ init__ ()
If cfg is None:
Cfg = [192,160,96,192,192,192,192,192]
# model-Amax W full quantization (except input and output)
Self.quan_model = nn.Sequential (
QuanConv2d (3, cfg [0], kernel_size=5, stride=1, padding=2, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type, first_layer=1)
QuanConv2d (cfg [0], cfg [1], kernel_size=1, stride=1, padding=0, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)
QuanConv2d (cfg [1], cfg [2], kernel_size=1, stride=1, padding=0, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)
Nn.MaxPool2d (kernel_size=3, stride=2, padding=1)
QuanConv2d (cfg [2], cfg [3], kernel_size=5, stride=1, padding=2, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)
QuanConv2d (cfg [3], cfg [4], kernel_size=1, stride=1, padding=0, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)
QuanConv2d (cfg [4], cfg [5], kernel_size=1, stride=1, padding=0, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)
Nn.MaxPool2d (kernel_size=3, stride=2, padding=1)
QuanConv2d (cfg [5], cfg [6], kernel_size=3, stride=1, padding=1, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)
QuanConv2d (cfg [6], cfg [7], kernel_size=1, stride=1, padding=0, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)
QuanConv2d (cfg [7], 10, kernel_size=1, stride=1, padding=0, last_relu=1, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)
Nn.AvgPool2d (kernel_size=8, stride=1, padding=0)
)
Def forward (self, x):
X = self.quan_model (x)
X = x.view (x.size (0),-1)
Return x
The number of training Epoch is 30, and the learning rate adjustment strategy is:
Def adjust_learning_rate (optimizer, epoch):
If args.bn_fold = = 1:
If args.model_type = = 0:
Update_list = [12, 15, 25]
Else:
Update_list = [8, 12, 20, 25]
Else:
Update_list = [15,17,20]
If epoch in update_list:
For param_group in optimizer.param_groups:
Param_group ['lr'] = param_group [' lr'] * 0.1
Return
Type AccNote original model (nin) 91.01% full-precision symmetric quantization, bn does not fuse 88.88%INT8 symmetric quantization, bn fuses 86.66%INT8 asymmetric quantization, bn does not fuse 88.89%INT8 asymmetric quantization, bn fuses 87.30%INT8
It is not clear why the quantized accuracy has been lost by 1-2 points. According to the experimental results of Depeng in MxNet, the classification task will not lose accuracy, so we do not know whether there is a problem with this code. Experienced bosses are welcome to point out the problem.
Then the training simulation quantization accuracy of some classification networks provided in the white paper are as follows:
The above is how to achieve convolution neural network training quantification in Pytorch shared by Xiaobian. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 217
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.