How to realize the training Quantification of convolution Neural Network in Pytorch 10/20 Update SLTechnology News&Howtos

How to realize the training Quantification of convolution Neural Network in Pytorch

2025-10-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

In this issue, the editor will bring you about how to achieve convolution neural network training quantification in Pytorch. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

1. Preface

Deep learning is more and more widely used in mobile, while the computing power and storage space of mobile is lower than that of GPU service. Based on this, we need to customize some deep learning networks for mobile to meet our daily needs. Lightweight networks such as SqueezeNet,MobileNet,ShuffleNet are designed for mobile. However, in addition to improving the network, model pruning and quantification should be the most commonly used optimization methods. Pruning is to delete the unimportant channels of the trained "big model" and accelerate the network without affecting the accuracy. Quantization is to approximate the weights and offsets represented by floating-point numbers (high precision) with low-precision integers (commonly used are INT8). After quantization to low-precision, optimization techniques such as NEON on the mobile platform can be used to accelerate the calculation process, and the model capacity of the original model quantized will be reduced, so that it can be better applied to the mobile environment. However, it is necessary to pay attention to the problem that there must be a decline in accuracy when quantifying the high-precision model to the low-precision model, and how to obtain the performance and accuracy of the TradeOff is very important.

This article introduces the use of Pytorch to reproduce this paper: some details of https://arxiv.org/abs/1806.08342 and gives some self-test results. Note that the code implements "Quantization Aware Training", and then quantifies "Post Training Quantization" and may talk about it separately. The code implementation is the https://github.com/666DZY666/model-compression from the 666DZY666 blogger implementation.

two。 Symmetrical quantization

In the last video, Liang Depeng's author has made these concepts very clear. If you do not want to see the text expression, you can go to this video link to watch the video: in-depth learning quantitative technology popular science. Then skip directly to the fourth section, but in order to ensure the integrity of this story, I will still introduce these two quantitative methods.

The quantization formula of symmetric quantization is as follows:

Symmetrical quantization formula

It represents the quantized scaling factor, and represents the values before and after quantization, respectively. Here, the raw floating-point data is quantized into an interval by dividing it by the scaling factor, for example, for "signed 8Bit" (unsigned is 0 to 255).

There is a Trick, that is, the weight is quantized to, which is to reduce the risk of spillover when accumulating.

Because the value interval of 8bit is [- 2 ^ 7, 2 ^ 7-1], the value interval after multiplication of two 8bit is (- 2 ^ 14, 2 ^ 14), and the accumulation is reached twice (- 2 ^ 15, 2 ^ 15), so it can only be accumulated twice at most and there is a risk of spillover the second time. For example, the result of two adjacent multiplications is exactly 2 ^ 14 will exceed 215-1 (the maximum value that can be represented by positive int16).

Therefore, if the quantized weight is limited to (- 127127), the result of a multiplication operation will always be less than-128 ^ 14.

The corresponding inverse quantization formula is:

Inverse quantization Formula of symmetric quantization

The inverse quantization result is obtained by multiplying the quantized value, of course, this process is lossy, as shown in the following figure, the orange line represents the range before quantization, while the blue line represents the quantized data range, pay attention to the weight.

Schematic diagram of quantification and inverse quantification

Let's take a look at the "float32 value of the black dot" of the orange line above, divide it by the scaling factor and quantize it to a value in between, and then after rounding, if it is inverse quantization, multiply it by the scaling factor to return the "first black dot" above, and use this number to replace the previous number to continue to do the Forward of the network.

So how do you get this scaling factor? As follows:

The scaling factor Delta3. Asymmetric quantization

Compared with symmetric quantization, asymmetric quantization has one more zero offset. The steps of asymmetric quantization of a floating-point number of a float32 to an integer of an int8 (if signed, if unsigned) are scaling, rounding, zero offset, and overflow protection, as shown in the following figure:

The value of 8Bit unsigned integer Nlevel in the process of asymmetric quantization

Then the formulas for calculating the scaling factor and zero offset are as follows:

4. Summary of the middle part

The above two algorithms are directly applied to each network for quantization (quantization after training PTQ), and the accuracy results of the test model are as follows:

The red part will apply the above two quantization algorithms to each network to do accuracy test results 5. Training simulation quantification

We need to quantify this process in the process of network training, and then the network is divided into two stages: forward and reverse, and the quantification of the forward stage is the content of sections 2 and 3. However, it is important to note that for the calculation of scaling factors, weights and activation values are now calculated differently.

For the weight scaling factor, it is still the same as that in sections 2 and 3, that is:

Weight scale = max (abs (weight)) / 127

However, the calculation of the scaling factor of the activation value is no longer a simple calculation of the maximum value, but a moving average (EMA) method is used to calculate this quantization range during training. The updated formula is as follows:

Moving_max = moving_max * momenta + max (abs (activation)) * (1-momenta)

Among them, momenta can take a number close to 1, and in the later Pytorch experiment, take 0.99, and then scale the factor:

Activation scale = moving_max / 128

Then the formula for calculating the gradient in the back propagation phase is as follows:

The Formula for calculating the gradient in the stage of QAT back Propagation

The gradient obtained in the back propagation is the gradient of the weights after the simulated quantization, and this gradient is used to update the weights before quantization.

The code of this part is as follows. Note that the int8 simulated by float32 in this experiment does not have a real board-end acceleration effect, but only to verify the feasibility of the algorithm:

Class Quantizer (nn.Module):

Def _ _ init__ (self, bits, range_tracker):

Super (). _ _ init__ ()

Self.bits = bits

Self.range_tracker = range_tracker

Self.register_buffer ('scale', None) # quantize the scale factor

Self.register_buffer ('zero_point', None) # quantized zero

Def update_params (self):

Raise NotImplementedError

# Quantification

Def quantize (self, input):

Output = input * self.scale-self.zero_point

Return output

Def round (self, input):

Output = Round.apply (input)

Return output

# truncation

Def clamp (self, input):

Output = torch.clamp (input, self.min_val, self.max_val)

Return output

# inverse quantization

Def dequantize (self, input):

Output = (input + self.zero_point) / self.scale

Return output

Def forward (self, input):

If self.bits = = 32:

Output = input

Elif self.bits = = 1:

Print ('! Binary quantization is not supported!')

Assert self.bits! = 1

Else:

Self.range_tracker (input)

Self.update_params ()

Output = self.quantize (input) # quantization

Output = self.round (output)

Output = self.clamp (output) # truncation

Output = self.dequantize (output) # inverse quantization

Return output

6. Code implementation

Based on https://github.com/666DZY666/model-compression/blob/master/quantization/WqAq/IAO/models/util_wqaq.py, symmetrical quantization and asymmetric quantization are realized here. The detail that should be paid attention to is that the quantization of the weight needs to calculate the scaling factor in different channels, and then find a scaling factor for the whole quantization of the activation value, which is the best (mentioned in this paper).

The code implementation of this part is as follows:

# * range_trackers (range statistic, range before quantization) *

Class RangeTracker (nn.Module):

Def _ _ init__ (self, q_level):

Super (). _ _ init__ ()

Self.q_level = q_level

Def update_range (self, min_val, max_val):

Raise NotImplementedError

@ torch.no_grad ()

Def forward (self, input):

If self.q_level = = 'Little: # A minute mineMaxially shapee = (1,1,1,1), layer level

Min_val = torch.min (input)

Max_val = torch.max (input)

Elif self.q_level = = 'Che: # W ·min min Maxially shape = (N, 1, 1, 1), channel level

Min_val = torch.min (input, 3, keepdim=True) [0], 2, keepdim=True) [0], 1, keepdim=True) [0]

Max_val = torch.max (input, 3, keepdim=True) [0], 2, keepdim=True) [0], 1, keepdim=True) [0]

Self.update_range (min_val, max_val)

Class GlobalRangeTracker (RangeTracker): # W mincing maxillary shape = (N, 1, 1), channel level, take the min_max compared this time with before-- (N, C, W, H)

Def _ _ init__ (self, q_level, out_channels):

Super (). _ _ init__ (q_level)

Self.register_buffer ('min_val', torch.zeros (out_channels, 1,1,1))

Self.register_buffer ('max_val', torch.zeros (out_channels, 1,1,1))

Self.register_buffer ('first_w', torch.zeros (1))

Def update_range (self, min_val, max_val):

Temp_minval = self.min_val

Temp_maxval = self.max_val

If self.first_w = = 0:

Self.first_w.add_ (1)

Self.min_val.add_ (min_val)

Self.max_val.add_ (max_val)

Else:

Self.min_val.add_ (- temp_minval) .add _ (torch.min (temp_minval, min_val))

Self.max_val.add_ (- temp_maxval) .add _ (torch.max (temp_maxval, max_val))

Class AveragedRangeTracker (RangeTracker): # A ~ (1, 1, 1, 1), layer, take running_min_max-(N, C, W, H)

Def _ _ init__ (self, q_level, momentum=0.1):

Super (). _ _ init__ (q_level)

Self.momentum = momentum

Self.register_buffer ('min_val', torch.zeros (1))

Self.register_buffer ('max_val', torch.zeros (1))

Self.register_buffer ('first_a', torch.zeros (1))

Def update_range (self, min_val, max_val):

If self.first_a = = 0:

Self.first_a.add_ (1)

Self.min_val.add_ (min_val)

Self.max_val.add_ (max_val)

Else:

Self.min_val.mul_ (1-self.momentum) .add _ (min_val * self.momentum)

Self.max_val.mul_ (1-self.momentum) .add _ (max_val * self.momentum)

The self.register_buffer line of code can set a constant in memory, and at the same time, the model can be written and read when the model is saved and loaded, that is, this variable will not participate in back propagation.

❝

In general, pytorch saves the parameters in the network as orderedDict. There are actually two kinds of parameters here. One is the parameters contained in various module in the model, that is, nn.Parameter. Of course, we can define other nn.Parameter parameters in the network, and the other is buffer. The former is updated every time the optim.step is updated, but not the latter.

❞

In addition, because the convolution layer is often followed by a BN layer, and in order to accelerate the forward reasoning, the parameters of the BN layer are often integrated into the parameters of the convolution layer, so the training simulation quantization should also follow this process. That is, we first need to integrate the parameters of the BN layer with the parameters of the convolution layer, and then quantify this parameter. The specific process can be explained by this page of PPT in Depeng:

Made By Liang Tak-peng

Therefore, the code implementation consists of two versions, one is the training simulation quantization that does not integrate BN, and the other is the training simulation quantization that integrates BN, but why is it like the above figure after fusion? Take a look at the following formula:

So:

The, and in the formula represent the weight and offset of the convolution layer, respectively, and are the input and output of the convolution layer, respectively. According to the calculation formula, the weight and offset after combining the batchnorm parameters can be derived.

The training simulation quantization code without BN is implemented as follows (with comments):

# * quantize convolution (also quantize A _ hand W and do convolution) *

Class Conv2d_Q (nn.Conv2d):

Def _ _ init__ (

Self

In_channels

Out_channels

Kernel_size

Stride=1

Padding=0

Dilation=1

Groups=1

Bias=True

A_bits=8

W_bits=8

Q_type=1

First_layer=0

Super (). _ init__ (

In_channels=in_channels

Out_channels=out_channels

Kernel_size=kernel_size

Stride=stride

Padding=padding

Dilation=dilation

Groups=groups

Bias=bias

)

# instantiation quantizer (A-layer level, W-channel level)

If q_type = = 0:

Self.activation_quantizer = SymmetricQuantizer (bits=a_bits, range_tracker=AveragedRangeTracker (qefficiency leveling L'))

Self.weight_quantizer = SymmetricQuantizer (bits=w_bits, range_tracker=GlobalRangeTracker (qaked levelling Cellular out_channels=out_channels))

Else:

Self.activation_quantizer = AsymmetricQuantizer (bits=a_bits, range_tracker=AveragedRangeTracker (qefficiency leveling L'))

Self.weight_quantizer = AsymmetricQuantizer (bits=w_bits, range_tracker=GlobalRangeTracker (qaked levelling Cellular out_channels=out_channels))

Self.first_layer = first_layer

Def forward (self, input):

# quantize An and W

If not self.first_layer:

Input = self.activation_quantizer (input)

Q_input = input

Q_weight = self.weight_quantizer (self.weight)

# quantized convolution

Output = F.conv2d (

Input=q_input

Weight=q_weight

Bias=self.bias

Stride=self.stride

Padding=self.padding

Dilation=self.dilation

Groups=self.groups

)

Return output

The code implementation considering collapsing BN is as follows (with comments):

Def reshape_to_activation (input):

Return input.reshape (1,-1, 1,1)

Def reshape_to_weight (input):

Return input.reshape (- 1,1,1,1)

Def reshape_to_bias (input):

Return input.reshape (- 1)

# * bn Fusion _ quantization convolution (after bn fusion, quantization of A _ hand W and convolution) *

Class BNFold_Conv2d_Q (Conv2d_Q):

Def _ _ init__ (

Self

In_channels

Out_channels

Kernel_size

Stride=1

Padding=0

Dilation=1

Groups=1

Bias=False

Eps=1e-5

Considering the jitter effect caused by quantization, momentum=0.01 # adjusts the momentum to weaken the proportion of batch statistical parameters and restrain the jitter to a certain extent. Through the experiment, the effect of quantitative training is better, and acc is increased by about 1%.

A_bits=8

W_bits=8

Q_type=1

First_layer=0

Super (). _ init__ (

In_channels=in_channels

Out_channels=out_channels

Kernel_size=kernel_size

Stride=stride

Padding=padding

Dilation=dilation

Groups=groups

Bias=bias

)

Self.eps = eps

Self.momentum = momentum

Self.gamma = Parameter (torch.Tensor (out_channels))

Self.beta = Parameter (torch.Tensor (out_channels))

Self.register_buffer ('running_mean', torch.zeros (out_channels))

Self.register_buffer ('running_var', torch.ones (out_channels))

Self.register_buffer ('first_bn', torch.zeros (1))

Init.uniform_ (self.gamma)

Init.zeros_ (self.beta)

# instantiation quantizer (A-layer level, W-channel level)

If q_type = = 0:

Self.activation_quantizer = SymmetricQuantizer (bits=a_bits, range_tracker=AveragedRangeTracker (qefficiency leveling L'))

Self.weight_quantizer = SymmetricQuantizer (bits=w_bits, range_tracker=GlobalRangeTracker (qaked levelling Cellular out_channels=out_channels))

Else:

Self.activation_quantizer = AsymmetricQuantizer (bits=a_bits, range_tracker=AveragedRangeTracker (qefficiency leveling L'))

Self.weight_quantizer = AsymmetricQuantizer (bits=w_bits, range_tracker=GlobalRangeTracker (qaked levelling Cellular out_channels=out_channels))

Self.first_layer = first_layer

Def forward (self, input):

# training mode

If self.training:

# do ordinary convolution to get A to get the BN parameter

Output = F.conv2d (

Input=input

Weight=self.weight

Bias=self.bias

Stride=self.stride

Padding=self.padding

Dilation=self.dilation

Groups=self.groups

)

# Update BN statistical parameters (batch and running)

Dims = [dim for dim in range (4) if dim! = 1]

Batch_mean = torch.mean (output, dim=dims)

Batch_var = torch.var (output, dim=dims)

With torch.no_grad ():

If self.first_bn = = 0:

Self.first_bn.add_ (1)

Self.running_mean.add_ (batch_mean)

Self.running_var.add_ (batch_var)

Else:

Self.running_mean.mul_ (1-self.momentum) .add _ (batch_mean * self.momentum)

Self.running_var.mul_ (1-self.momentum) .add _ (batch_var * self.momentum)

# BN Fusion

If self.bias is not None:

Bias = reshape_to_bias (self.beta + (self.bias-batch_mean) * (self.gamma / torch.sqrt (batch_var + self.eps)

Else:

Bias = reshape_to_bias (self.beta-batch_mean * (self.gamma / torch.sqrt (batch_var + self.eps) # b integrates batch

Weight = self.weight * reshape_to_weight (self.gamma / torch.sqrt (self.running_var + self.eps)) # w melts running

# Test state

Else:

# print (self.running_mean, self.running_var)

# BN Fusion

If self.bias is not None:

Bias = reshape_to_bias (self.beta + (self.bias-self.running_mean) * (self.gamma / torch.sqrt (self.running_var + self.eps)

Else:

Bias = reshape_to_bias (self.beta-self.running_mean * (self.gamma / torch.sqrt (self.running_var + self.eps) # b integrates running

Weight = self.weight * reshape_to_weight (self.gamma / torch.sqrt (self.running_var + self.eps)) # w melts running

# quantify the W after the fusion of An and bn

If not self.first_layer:

Input = self.activation_quantizer (input)

Q_input = input

Q_weight = self.weight_quantizer (weight)

# quantized convolution

If self.training: # training mode

Output = F.conv2d (

Input=q_input

Weight=q_weight

Bias=self.bias, # Note, bias is not added here (self.bias is None)

Stride=self.stride

Padding=self.padding

Dilation=self.dilation

Groups=self.groups

)

# (here change the effect of fusing running parameter in convolution into that of fusing batch parameter in training state) running-> batch

Output * = reshape_to_activation (torch.sqrt (self.running_var + self.eps) / torch.sqrt (batch_var + self.eps))

Output + = reshape_to_activation (bias)

Else: # Test mode

Output = F.conv2d (

Input=q_input

Weight=q_weight

Bias=bias, # Note, add bias here to make a complete conv+bn

Stride=self.stride

Padding=self.padding

Dilation=self.dilation

Groups=self.groups

)

Return output

Note that bias is set to None during training, that is, bias is not quantified during training.

7. Experimental results

The Quantization Aware Training experiment is done in CIFAR10. The network structure is as follows:

Import torch

Import torch.nn as nn

Import torch.nn.functional as F

From .util _ wqaq import Conv2d_Q, BNFold_Conv2d_Q

Class QuanConv2d (nn.Module):

Def _ _ init__ (self, input_channels, output_channels

Kernel_size=-1, stride=-1, padding=-1, groups=1, last_relu=0, abits=8, wbits=8, bn_fold=0, q_type=1, first_layer=0):

Super (QuanConv2d, self). _ _ init__ ()

Self.last_relu = last_relu

Self.bn_fold = bn_fold

Self.first_layer = first_layer

If self.bn_fold = = 1:

Self.bn_q_conv = BNFold_Conv2d_Q (input_channels, output_channels

Kernel_size=kernel_size, stride=stride, padding=padding, groups=groups, a_bits=abits, w_bits=wbits, q_type=q_type, first_layer=first_layer)

Else:

Self.q_conv = Conv2d_Q (input_channels, output_channels

Kernel_size=kernel_size, stride=stride, padding=padding, groups=groups, a_bits=abits, w_bits=wbits, q_type=q_type, first_layer=first_layer)

Self.bn = nn.BatchNorm2d (output_channels, momentum=0.01) # considering the jitter effect caused by quantization, adjust momentum to weaken the proportion of batch statistical parameters and restrain jitter to some extent. Through the experiment, the effect of quantitative training is better, and acc is increased by about 1%.

Self.relu = nn.ReLU (inplace=True)

Def forward (self, x):

If not self.first_layer:

X = self.relu (x)

If self.bn_fold = = 1:

X = self.bn_q_conv (x)

Else:

X = self.q_conv (x)

X = self.bn (x)

If self.last_relu:

X = self.relu (x)

Return x

Class Net (nn.Module):

Def _ init__ (self, cfg = None, abits=8, wbits=8, bn_fold=0, q_type=1):

Super (Net, self). _ _ init__ ()

If cfg is None:

Cfg = [192,160,96,192,192,192,192,192]

# model-Amax W full quantization (except input and output)

Self.quan_model = nn.Sequential (

QuanConv2d (3, cfg [0], kernel_size=5, stride=1, padding=2, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type, first_layer=1)

QuanConv2d (cfg [0], cfg [1], kernel_size=1, stride=1, padding=0, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)

QuanConv2d (cfg [1], cfg [2], kernel_size=1, stride=1, padding=0, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)

Nn.MaxPool2d (kernel_size=3, stride=2, padding=1)

QuanConv2d (cfg [2], cfg [3], kernel_size=5, stride=1, padding=2, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)

QuanConv2d (cfg [3], cfg [4], kernel_size=1, stride=1, padding=0, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)

QuanConv2d (cfg [4], cfg [5], kernel_size=1, stride=1, padding=0, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)

Nn.MaxPool2d (kernel_size=3, stride=2, padding=1)

QuanConv2d (cfg [5], cfg [6], kernel_size=3, stride=1, padding=1, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)

QuanConv2d (cfg [6], cfg [7], kernel_size=1, stride=1, padding=0, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)

QuanConv2d (cfg [7], 10, kernel_size=1, stride=1, padding=0, last_relu=1, abits=abits, wbits=wbits, bn_fold=bn_fold, q_type=q_type)

Nn.AvgPool2d (kernel_size=8, stride=1, padding=0)

)

Def forward (self, x):

X = self.quan_model (x)

X = x.view (x.size (0),-1)

Return x

The number of training Epoch is 30, and the learning rate adjustment strategy is:

Def adjust_learning_rate (optimizer, epoch):

If args.bn_fold = = 1:

If args.model_type = = 0:

Update_list = [12, 15, 25]

Else:

Update_list = [8, 12, 20, 25]

Else:

Update_list = [15,17,20]

If epoch in update_list:

For param_group in optimizer.param_groups:

Param_group ['lr'] = param_group [' lr'] * 0.1

Return

Type AccNote original model (nin) 91.01% full-precision symmetric quantization, bn does not fuse 88.88%INT8 symmetric quantization, bn fuses 86.66%INT8 asymmetric quantization, bn does not fuse 88.89%INT8 asymmetric quantization, bn fuses 87.30%INT8

It is not clear why the quantized accuracy has been lost by 1-2 points. According to the experimental results of Depeng in MxNet, the classification task will not lose accuracy, so we do not know whether there is a problem with this code. Experienced bosses are welcome to point out the problem.

Then the training simulation quantization accuracy of some classification networks provided in the white paper are as follows:

The above is how to achieve convolution neural network training quantification in Pytorch shared by Xiaobian. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.