How to use Dataset and DataLoader in Pytorch 07/01 Update SLTechnology News&Howtos

How to use Dataset and DataLoader in Pytorch

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to use Dataset and DataLoader in Pytorch, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

I. Preface

Ensure installation

Scikit-image

Numpy

II. Dataset

An example:

# Import the required package import torchimport torch.utils.data.dataset as Datasetimport numpy as np # fabricate data Data = np.asarray ([[1dag2], [3,4], [5,6], [7,8]]) Label = np.asarray ([[0], [1], [0], [2]) # data [1Magin2], corresponding label is [0] Data [3J4], the corresponding label is [1] # create subclass class subDataset (Dataset.Dataset): # initialization Define the data content and label def _ _ init__ (self, Data, Label): self.Data = Data self.Label = Label # return the dataset size def _ _ len__ (self): return len (self.Data) # get the data content and label def _ getitem__ (self Index): data = torch.Tensor (self.data [index]) label = torch.IntTensor (self.Label [index]) return data, label # main function if _ _ name__ ='_ _ main__': dataset = subDataset (Data, Label) print (dataset) print ('dataset size is:', dataset.__len__ () print (dataset.__getitem__ (0)) print (dataset [0])

Output result

We have an overall grasp of Dataset, and then analyze the details:

# create a subclass class subDataset (Dataset.Dataset):

When you create a subclass, it inherits Dataset.Dataset, not a Dataset. Because Dataset is a module module, not a class class, you need to call class in module, so it's Dataset.Dataset!

The len and getitem functions, the former giving the size of the dataset * *, and the latter are used to find data and labels. Are the two most important functions. If we want to do some operations on the data later, we will basically do it on the basis of these two functions.

3. DatasetLoaderDataLoader (dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_works=0, clollate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None)

Function: build an iterative data loader

The dataset:Dataset class, which determines where and how the data is read; the path to the dataset

Batchsize: batch siz

Num_works: whether data is read by multiple processes; only for CPU

Shuffle: whether each epoch is disrupted

Drop_last: whether to discard the last batch of data when the sample size is not divisible by batchsize

Epoch: all training samples have been input into the model, called an Epoch

Iteration: a batch of samples are input into the model, called an Iteration

Batchsize: batch size, which determines how many Iteration are in an Epoch

Let me give you an example:

Import torchimport torch.utils.data.dataset as Datasetimport torch.utils.data.dataloader as DataLoaderimport numpy as np Data = np.asarray ([[1,2], [3,4], [5,6], [7,8]]) Label = np.asarray ([[0], [1], [0], [2]]) # create subclass class subDataset (Dataset.Dataset): # initialization Define the data content and label def _ _ init__ (self, Data, Label): self.Data = Data self.Label = Label # return the dataset size def _ _ len__ (self): return len (self.Data) # get the data content and label def _ getitem__ (self Index): data = torch.Tensor (self.data [index]) label = torch.IntTensor (self.Label [index]) return data, label if _ _ name__ ='_ _ main__': dataset = subDataset (Data, Label) print (dataset) print ('dataset size is:' Dataset.__len__ () print (dataset.__getitem__ (0)) print (dataset [0]) # create DataLoader iterator It is equivalent to defining the Dataset mentioned earlier, and then using Dataloader to perform some operations on the data, such as whether it needs to be disrupted. Then shuffle=True, whether you need multiple processes to read data num_workers=4, that is, four processes dataloader = DataLoader.DataLoader (dataset,batch_size= 2, shuffle= False, num_workers=4) for I, item in enumerate (dataloader): # you can use enumerate to extract the data print ('data:', data, I) data, label = item # data is a tuple print (' data:', data) print ('label:') Label) fourth, put Dataset data and tags on GPU (if the code execution order is wrong, there will be bug)

To sum up, there are two ways to solve the problem.

1. If you convert the data to a GPU type when you define the _ _ getitem__ method when creating a class for Dataset. Then you need to set the parameter num_workers in Dataloader to 0, because this parameter is for CPU. If the data is changed to GPU, it can only be a single process. If it is in the part of Dataloader, which is read by multiple child processes and then converted to GPU, the num_wokers does not need to be modified. This is the code for the _ _ getitem__ section above, which is moved to the Dataloader section.

two。 In general, however, datasets and tags are not as simple as we edited above. Generally speaking, the tags on the kaggle are stored in the file CSV. We need the cooperation of pandas.

The above is all the content of the article "how to use Dataset and DataLoader in Pytorch". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.