What are the actual combat skills of PyTorch model training? 07/13 Update SLTechnology News&Howtos

What are the actual combat skills of PyTorch model training?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what the actual combat skills of PyTorch model training are, and the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Guide reading

A guide to step by step, very practical.

Let's face it, your model may still be in the Stone Age. I bet you still use 32-bit precision or GASP or even train on one GPU.

I understand that there are all kinds of neural network acceleration guides online, but there is no checklist (there is now). Use this list to make sure you can drain all the performance of your model step by step.

This guide ranges from the simplest structure to the most complex changes to maximize the benefits of your network. I'll show you the sample Pytorch code and the related flags that can be used in Pytorch- lightning Trainer, so you don't have to write the code yourself!

Pytorch-Lightning

You can find each of the optimizations I discuss here in Pytorch's library Pytorch- lightning. Lightning is a package on top of Pytorch that trains automatically while giving researchers complete control over key model components. Lightning uses the latest best practices and minimizes your possible mistakes.

We define LightningModel for MNIST and use Trainer to train the model.

From pytorch_lightning import Trainer

Model = LightningModule (…)

Trainer = Trainer ()

Trainer.fit (model)

1. DataLoaders

This is probably the easiest place to get speed gain. Gone are the days of saving h6py or numpy files to speed up data loading, and loading image data using Pytorch dataloader is easy (for NLP data, see TorchText).

In lightning, you don't need to specify a training loop, just define dataLoaders and Trainer and call them when needed.

Dataset = MNIST (root=self.hparams.data_root, train=train, download=True)

Loader = DataLoader (dataset, batch_size=32, shuffle=True)

For batch in loader:

X, y = batch

Model.training_step (x, y)

...

2. Number of workers in DataLoaders

Another magic of acceleration is that it allows batch parallel loading. Therefore, you can load nb_workers batch at a time instead of one batch at a time.

# slow

Loader = DataLoader (dataset, batch_size=32, shuffle=True)

# fast (use 10 workers)

Loader = DataLoader (dataset, batch_size=32, shuffle=True, num_workers=10)

3. Batch size

Increase batch size to the maximum range allowed by CPU-RAM or GPU-RAM before starting the next optimization step.

The next section focuses on how to help reduce memory footprint so that you can continue to add batch size.

Remember, you may need to update your learning rate again. A good rule of thumb is that if batch size doubles, then the learning rate doubles.

4. Gradient accumulation

When you have reached the computing resource limit, your batch size is still too small (such as 8), and then we need to simulate a larger batch size to perform a gradient decline to provide a good estimate.

Suppose we want to reach the batch size size of128. We need to perform 16 forward and backward propagations with batch size as 8, and then perform another optimization step.

# clear last step

Optimizer.zero_grad ()

# 16 accumulated gradient steps

Scaled_loss = 0

For accumulated_step_i in range (16):

Out = model.forward ()

Loss = some_loss (out,y)

Loss.backward ()

Scaled_loss + = loss.item ()

# update weights after 8 steps. Effective batch = 816

Optimizer.step ()

# loss is now scaled up by the number of accumulated batches

Actual_loss = scaled_loss / 16

In lightning, everything is done for you, just set accumulate_grad_batches=16:

Trainer = Trainer (accumulate_grad_batches=16)

Trainer.fit (model)

5. Reserved calculation chart

One of the easiest ways to burst your memory is to log and store your loss.

Losses = []

...

Losses.append (loss)

Print (f'current loss: {torch.mean (losses)'})

The problem with the above is that loss still contains a copy of the entire diagram. In this case, call .item () to release it.

! [1_CER3v8cok2UOBNsmnBrzPQ] (9 Tips For Training Lightning-Fast Neural Networks In Pytorch.assets/1_CER3v8cok2UOBNsmnBrzPQ.gif) # bad

Losses.append (loss)

# good

Losses.append (loss.item ())

Lightning will be very careful to ensure that a copy of the calculation chart is not retained.

6. Single GPU training

Once you have completed the previous steps, it is time to enter GPU training. The training on GPU will parallelize the mathematical calculations between multiple GPU cores. The acceleration you get depends on the type of GPU you are using. I recommend 2080Ti for individuals and V100 for companies.

At first glance, this may overwhelm you, but you really only need to do two things: 1) move your model to GPU, and 2) whenever you run the data through it, put the data on the GPU.

# put model on GPU

Model.cuda (0)

# put data on gpu (cuda on a variable returns a cuda copy)

X = x.cuda (0)

# runs on GPU now

Model (x)

If you use Lightning, you don't have to do anything but set Trainer (gpus=1).

# ask lightning to use gpu 0 for training

Trainer = Trainer (gpus= [0])

Trainer.fit (model)

When training on GPU, the main thing to pay attention to is to limit the number of transfers between CPU and GPU.

# expensive

X = x.cuda (0) # very expensive

X = x.cpu ()

X = x.cuda (0)

If memory runs out, do not move the data back to CPU to save memory. Before turning to GPU, try to optimize the memory distribution between your code or GPU in other ways.

Another thing to note is to invoke operations that force GPU synchronization. Clearing the memory cache is an example.

# really bad idea. Stops all the GPUs until they all catch up

Torch.cuda.empty_cache ()

However, if you use Lightning, the only possible problem is when defining Lightning Module. Lightning will be very careful not to make such mistakes.

7. 16-bit accuracy

16bit precision is an amazing technology that halves memory footprint. Most models are trained with 32bit precision numbers. However, recent research has found that the 16bit model can also work well. Blending precision means using 16bit for some content, but keeping content such as weights at 32bit.

To use 16bit precision in Pytorch, install NVIDIA's apex library and make these changes to your model.

# enable 16-bit on the model and the optimizer

Model, optimizers = amp.initialize (model, optimizers, opt_level='O2')

# when doing .backward, let amp do it so it can scale the loss

With amp.scale_loss (loss, optimizer) as scaled_loss:

Scaled_loss.backward ()

The amp package will take care of most things. If the gradient explodes or tends to zero, it even scales the loss.

In lightning, enabling 16bit does not need to modify anything in the model, nor do you need to do what I wrote above. Just set Trainer (precision=16).

Trainer = Trainer (amp_level='O2', use_amp=False)

Trainer.fit (model)

8. Move to multiple GPUs

Now, things are getting very interesting. There are three (maybe more?) Methods to conduct multi-GPU training.

Sub-batch training

A) copy the model into each GPU, B) give each GPU a portion of the batch

The first method is called "sub-batch training". This strategy copies the model to each GPU, and each GPU gets a portion of the batch.

# copy model on each GPU and give a fourth of the batch to each

Model = DataParallel (model, devices= [0,1,2,3])

# out has 4 outputs (one for each gpu)

Out = model (x.cuda (0))

In lightning, you just need to increase the number of GPUs and tell trainer that you don't have to do anything else.

# ask lightning to use 4 GPUs for training

Trainer = Trainer (gpus= [0,1,2,3])

Trainer.fit (model)

Model distribution training

Put different parts of the model on different GPU, and the batch moves sequentially

Sometimes your model may be too big to be completely in memory. For example, a sequence-to-sequence model with encoders and decoders may take up 20GB RAM when generating output. In this case, we want to put the encoder and decoder on a separate GPU.

# each model is sooo big we can't fit both in memory

Encoder_rnn.cuda (0)

Decoder_rnn.cuda (1)

# run input through encoder on GPU 0

Encoder_out = encoder_rnn (x.cuda (0))

# run output through decoder on the next GPU

Out = decoder_rnn (encoder_out.cuda (1))

# normally we want to bring all outputs back to GPU 0

Out = out.cuda (0)

For this type of training, you don't need to specify any GPU in Lightning, you should put the modules in LightningModule on the correct GPU.

Class MyModule (LightningModule):

Def _ _ init__ ():

Self.encoder = RNN (...)

Self.decoder = RNN (...)

Def forward (x):

# models won't be moved after the first forward because

# they are already on the correct GPUs

Self.encoder.cuda (0)

Self.decoder.cuda (1)

Out = self.encoder (x)

Out = self.decoder (out.cuda (1))

# don't pass GPUs to trainer

Model = MyModule ()

Trainer = Trainer ()

Trainer.fit (model)

Mixture of the two

In the above case, the encoder and decoder can still benefit from the parallelization operation.

# change these lines

Self.encoder = RNN (...)

Self.decoder = RNN (...)

# to these

# now each RNN is based on a different gpu set

Self.encoder = DataParallel (self.encoder, devices= [0,1,2,3])

Self.decoder = DataParallel (self.encoder, devices= [4,5,6,7])

# in forward...

Out = self.encoder (x.cuda (0))

# notice inputs on first gpu in device

Sout = self.decoder (out.cuda (4)) #

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.