What are the ways of pytorch's multi-GPU training? 07/15 Update SLTechnology News&Howtos

What are the ways of pytorch's multi-GPU training?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail the ways of multi-GPU training about pytorch. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

Method 1: torch.nn.DataParallel1. Principle

As shown in the following picture: children do 4 assignments alone, assuming that one needs 60min, and a total of 240min is needed.

The job here is the data to be processed in pytorch.

At the same time, he can also spend 3min to assign homework to three accomplices, and all of them 60min it together. Finally, he spent 3min to put away his homework. He needed 66min altogether.

This child is the main GPU. His process is: distribution-> parallel operation-> result recovery.

This is the first parallel method that pytorch will use: torch.nn.DataParallel

This method is also known as single-process multi-GPU training mode: DP mode, in which parallel multi-cards are controlled by one process. In other words, the propagation of the gradient is carried out on the main GPU.

When using torch.nn.DataParallel for multi-GPU parallel training, the data reading code matched with it is: torch.utils.data.DataLoader

two。 The commonly used matching code is as follows: train_datasets = customData (train_txt) # create datasettrain_dataloaders = torch.utils.data.DataLoader (train_datasets,opt.batch_size,num_workers=train_num_workers,shuffle=True) # create dataloadermodel = efficientnet_b0 (num_classes = opt.num_class) # create modeldevice_list = list (map (int,list (opt.device_id) print ("Using gpu" "" .join ([str (v) for v in device_list]) device = device_list [0] # main GPU That is, the GPU recovered by the distribution task and result, that is, GPUmodel = torch.nn.DataParallel (model,device_ids=device_list) model.to (device) for data in train_dataloaders: model.train (True) inputs of gradient propagation update, labels = data inputs = Variable (inputs.to (device)) # put the data into the main GPU labels = Variable (labels.to (device)) 3. Advantages and disadvantages

Advantages: it is very convenient to configure.

Disadvantages: the load of GPU is uneven, the load of the main GPU is very large, while the load of other GPU is very small

Method 2: torch.distributed1. Code description

This method is originally used for multi-machine multi-card (multi-node multi-card) training, but it can also be used for stand-alone multi-card training (that is, the number of nodes is set to 1).

The initialization code is as follows, and this must be written at the top.

From torch.utils.data.distributed import DistributedSamplertorch.distributed.init_process_group (backend= "nccl")

Here is a simple demo.py as an illustration:

Import torchimport torch.nn as nnfrom torch.autograd import Variablefrom torch.utils.data import Dataset, DataLoaderimport osfrom torch.utils.data.distributed import DistributedSampler# 1) initialize torch.distributed.init_process_group (backend= "nccl") input_size = 5output_size = 2batch_size = 30data_size = 90 # 2) configure gpulocal_rank = torch.distributed.get_rank () print ('local_rank',local_rank) torch.cuda.set_device (local_rank) device = torch.device ("cuda") for each process Local_rank) class RandomDataset (Dataset): def _ init__ (self, size, length): self.len = length self.data = torch.randn (length, size). To ('cuda') def _ getitem__ (self, index): return self.data [index] def _ len__ (self): return self.len dataset= RandomDataset (input_size, data_size) # 3) use DistributedSamplerrand_loader = DataLoader (dataset=dataset) Batch_size=batch_size, sampler=DistributedSampler (dataset) class Model (nn.Module): def _ _ init__ (self, input_size, output_size): super (Model, self). _ _ init__ () self.fc = nn.Linear (input_size, output_size) def forward (self Input): output = self.fc (input) print ("In Model: input size", input.size (), "output size", output.size ()) return output model = Model (input_size, output_size) # 4) move the model to the corresponding gpumodel.to (device) if torch.cuda.device_count () > 1: print ("Let's use", torch.cuda.device_count ()) before encapsulation "GPUs!") # 5) Encapsulation model = torch.nn.parallel.DistributedDataParallel (model, device_ids= [local _ rank] Output_device=local_rank) for data in rand_loader: if torch.cuda.is_available (): input_var = data else: input_var = data output = model (input_var) print ("Outside: input size", input_var.size (), "output_size", output.size ())

(1) Startup mode: a program torch.distributed.launch for startup is provided in torch.distributed. This helper can be used to start multiple processes for distributed training for each node, and it generates multiple distributed training processes on each training node.

(2) start the command:

CUDA_VISIBLE_DEVICES=1,2,3,4 python-m torch.distributed.launch-- nproc_per_node=2 torch_ddp.py

The parameters need to be explained here:

CUDA_VISIBLE_DEVICES: set the id of the GPU that we can use

Torch.distributed.launch: used to start multi-node and multi-GPU training

Nproc_per_node: indicates the number of processes set, which is generally set to the number of available GPU, that is, the number of processes is set as many GPU are available.

Local rank: we will explain the meaning of this parameter in a later case.

(3) descriptions of some situations:

Scenario 1: run the above command directly

The result of the operation is as follows:

Local_rank 1

Local_rank 0

Let's use 4 GPUs!

In Model: input size torch.Size ([30,5]) output size torch.Size ([30,2])

Outside: input size torch.Size ([30,5]) output_size torch.Size ([30,2])

In Model: input size torch.Size ([15,5]) output size torch.Size ([15,2])

Outside: input size torch.Size ([15,5]) output_size torch.Size ([15,2])

In Model: input size torch.Size ([30,5]) output size torch.Size ([30,2])

Outside: input size torch.Size ([30,5]) output_size torch.Size ([30,2])

In Model: input size torch.Size ([15,5]) output size torch.Size ([15,2])

Outside: input size torch.Size ([15,5]) output_size torch.Size ([15,2])

You can see that the output of local rank is 0 and 1, the number of which is the same as the nproc_per_node we set, regardless of the number of available GPU we set. Here is the meaning of local rank.

Local rank: represents the number of the current process on the current node, because we have set 2 processes, so the process numbers are 0 and 1

In many blogs, it is stated directly that local_rank equals the GPU number in the process, which is actually inaccurate. This number is not the number of GPU!

When using the startup command, the torch.distributed.launch tool passes in the local_rank parameter according to nproc_per_node by default, and then you can get the local_rank.

Local_rank = torch.distributed.get_rank ()

Because the default parameter local_rank is passed in, you can also write this, and the output is the same as torch.distributed.get_rank ()

Import argparseparser = argparse.ArgumentParser () # Note this parameter, which must be specified in this form, even if it is not used in the code. Because the launch tool passes this parameter parser.add_argument ("--local_rank", type=int) args = parser.parse_args () local_rank = args.local_rankprint ('local_rank',args.local_rank) by default

Scenario 2: set nproc_per_node to 4, that is, the number of processes is set to the number of available GPU

The running results are as follows:

Local_rank 2

Local_rank 3

Local_rank 1

Local_rank 0

Let's use 4 GPUs!

In Model: input size torch.Size ([23,5]) output size torch.Size ([23,2])

Outside: input size torch.Size ([23,5]) output_size torch.Size ([23,2])

In Model: input size torch.Size ([23,5]) output size torch.Size ([23,2])

Outside: input size torch.Size ([23,5]) output_size torch.Size ([23,2])

In Model: input size torch.Size ([23,5]) output size torch.Size ([23,2])

Outside: input size torch.Size ([23,5]) output_size torch.Size ([23,2])

In Model: input size torch.Size ([23,5]) output size torch.Size ([23,2])

Outside: input size torch.Size ([23,5]) output_size torch.Size ([23,2])

You can see that there are 4 local_rank at this time, which is the same as the number of processes. And the id of the available GPU we set is 1Magee 2JE3 id 4, while the output of local_rank is 0Power1 Magi 2P3, which shows that local_rank is not the number of GPU.

Although the device_ids for model parallelism in the code is set to local_rank, and local_rank is set to 0meme 1, 2 and 3, the available GPU:1,2,3,4 is actually used. You can check it through nvidia-smi. The PID is 86478, 86479, 86480, 864782.

Model = torch.nn.parallel.DistributedDataParallel (model, device_ids= [local _ rank], output_device=local_rank)

Scenario 3: set nproc_per_node to 4, but do not set available GPU ID

Python-m torch.distributed.launch-- nproc_per_node=4 ddp.py

At this point, we will use nvidia-smi to view the usage of GPU, as shown below. You can see that the GPU used at this time is the id of local rank. Compared to scenario 2, we can summarize:

When no available GPU ID is set, the GPU id used is equal to the id of local rank. Essentially, the number of the process is used as the GPU number, so the definition that local_rank equals the number of the process is unchanged.

When you set the available GPU ID, the GPU id used is equal to GPU id.

Scenario 4: set nproc_per_node to 5, which exceeds the number of GPU that can be used

The output is as follows, and you can see that the error is reported because the number of processes exceeds the number of GPU that can be used

Local_rank 3

Local_rank 2

Local_rank 4

Local_rank 1

Local_rank 0

THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101: invalid device ordinal

Traceback (most recent call last):

File "ddp.py", line 18, in

Torch.cuda.set_device (local_rank)

File "/ home/yckj3822/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device

Torch._C._cuda_setDevice (device)

RuntimeError: cuda runtime error: invalid device ordinal at / pytorch/torch/csrc/cuda/Module.cpp:59

This is the end of this article on "what are the ways of multi-GPU training in pytorch?". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.