In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly shows you "what problems need to be paid attention to in Pytorch semi-precision network training". The content is simple and clear. I hope it can help you solve your doubts. Let me lead you to study and study this article "what problems need to be paid attention to in Pytorch semi-precision network training".
It is necessary to pay attention to the following problems in semi-precision floating-point network training with Pytorch2.0:
1. If the network wants to run on GPU, both the model and input sample data need cuda (). Half ()
2. The model parameters are converted to half type, so you don't need to index to each layer, just model.cuda (). Half () directly.
3. For semi-precision model, optimization algorithm, Adam I use, when the gradient of some parameters is 0, after updating the weight, the weight with zero gradient becomes NAN, which is very strange, but the Adam algorithm does not have this problem for full-precision data types.
In addition, the SGD algorithm has no problem with both half-precision and full-precision calculations.
Another problem is that we do not know whether the network structure is relatively small, and the training speed of using semi-precision is not as fast as that of full precision. This is worthy of further exploration.
For the above problem, it is true that when the network is very small, the semi-precision floating-point type has no obvious advantage on the 1080Ti, but when the network becomes larger, the semi-precision floating-point type is faster than the full-precision floating-point type.
However, the specific speed has something to do with the size of the model and the size of the input sample. I tested it to be fast 1Accord 6. At the same time, the semi-precision floating-point type has an advantage in taking up memory, and the impact on accuracy has not been explored.
If the network becomes larger, the number of epoch will also increase, and the time difference between half-precision and full-precision will be shown during training.
Add: pytorch semi-precision, mixed precision, single-precision training difference amp.initialize
Look at the code ~ mixed_precision = Truetry: # Mixed precision training https://github.com/NVIDIA/apex from apex import ampexcept: mixed_precision = False # not installed model, optimizer = amp.initialize (model, optimizer, opt_level='O1', verbosity=1)
In order to help improve the training efficiency of Pytorch, Nvidia provides a hybrid precision training tool Apex. It claims to be able to increase the training speed of the model by 2-4 times without reducing the performance, and reduce the training memory consumption to half of the previous one.
The document address is: https://nvidia.github.io/apex/index.html
The tool provides three functions, amp, parallel, and normalization. As the tool is still version 0.1, the function is still very basic, in the last normalization function only provides the reproduction of the LayerNorm layer, in fact, in the subsequent use process, we will find that the most problems are the BN layer of pytorch.
The second tool is the reproduction of pytorch distributed training, which is described in the document as equivalent to the implementation in pytorch, and any one can be selected in the code. In practice, it is found that when using mixed precision training, the use of Apex reproduction of parallel tools can avoid some bug.
The default training method is single precision float32import torchmodel = torch.nn.Linear (D_in, D_out) optimizer = torch.optim.SGD (model.parameters (), lr=1e-3) for img, label in dataloader: out = model (img) loss = LOSS (out, label) loss.backward () optimizer.step () optimizer.zero_grad () semi-precision model (img.half ())
Next is the implementation of mixed precision, which mainly uses Apex's amp tool.
The code is modified to:
Add this package.
Model, optimizer = amp.initialize (model, optimizer, opt_level= "O1") import torchmodel = torch.nn.Linear (D_in, D_out). Cuda () optimizer = torch.optim.SGD (model.parameters (), lr=1e-3) model, optimizer = amp.initialize (model, optimizer, opt_level= "O1") for img, label in dataloader: out = model (img) loss = LOSS (out, label) # loss.backward () with amp.scale_loss Optimizer) as scaled_loss: scaled_loss.backward () optimizer.step () optimizer.zero_grad ()
The actual process is to call amp.initialize to set model and optimizer according to the predetermined opt_level. Amp.scale_loss is used for backhaul when calculating loss.
You should pay attention to the following points:
Before calling amp.initialize, the model needs to be placed on GPU, that is, you need to call cuda () or to ().
The model cannot call any distributed settings functions before calling amp.initialize.
At this point, the input data does not need to be converted to semi-precision.
When calculating with mixed precision, the most critical parameter is opt_level. It contains a total of four settings:'00','01','02', '03'. In fact, the entire amp.initialize has a lot of input parameters:
However, in the actual use process, it is found that you can set opt_level, which is also the use of the examples in the document, even under different opt_level settings, other parameters will become invalid. (known BUG: setting the value of keep_batchnorm_fp32 when using '01' will cause an error)
To sum up:
00 is equivalent to the original single-precision training. 01 semi-precision is used in most calculations, but all the model parameters still maintain single precision, and for a small number of calculations with good single precision (such as softmax). 02 compared with 01, the model parameters are also changed to semi-precision.
03 is basically equal to the full half-precision operation of the initial experiment. It is worth mentioning that no matter whether the model is semi-precision or not in the process of optimization, the saved model is a single-precision model, which can ensure the normal use of the model in other applications. This is also a big selling point of Apex.
In Pytorch, the BN layer is divided into train and eval operations.
If it is a single-precision network, CUDNN will be called to accelerate the calculation. During regular training, the BN layer is set to train. Apex optimizes this situation by setting keep_batchnorm_fp32 parameters to ensure that the BN layer uses CUDNN for calculation at this time and achieves the best computing speed.
But in some fine tunning scenarios, the BN layer is set to eval (this is the case with my model). At this point, the setting of keep_batchnorm_fp32 does not work, and training will produce bug with incorrect data types. At this point, you need to artificially set all BN layers to semi-precision, so that CUDNN acceleration cannot be used.
The reference code for a setting is as follows:
Def fix_bn (m): classname = m.classic classrooms. Roommates _ if classname.find ('BatchNorm')! =-1: m.eval () .half () model.apply (fix_bn)
In the actual test, the accuracy of the final model does not feel much different, there may be a slight decline; time does not change much, which may vary from model to model; video memory overhead is indeed greatly reduced.
These are all the contents of this article entitled "what should be paid attention to in Pytorch semi-precision network training". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.