How to use Moco-V2 in PyTorch to reduce Computational constraints 07/19 Update SLTechnology News&Howtos

How to use Moco-V2 in PyTorch to reduce Computational constraints

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces you how to use Moco-V2 in PyTorch to reduce computational constraints. The content is very detailed. Interested friends can refer to it for reference. I hope it can help you.

introduced

The SimCLR paper (http://cse.iitkgp.ac.in/<$arastogi/papers/simclr. pdf) explains how this framework benefits from larger models and larger batches, and can produce results similar to supervised models if sufficient computational power is available.

But these requirements make the framework computationally expensive. Wouldn't it be nice if we could have the simplicity and power of this framework and have less computational requirements so everyone could access it? Moco-v2 came to the rescue.

data set

This time we will implement Moco-v2 on a larger dataset in Pytorch and train our model on Google Colab. This time we will use Imagenette and Imagewoof datasets

Some images from Imagenette dataset

A quick summary of these datasets (more information here: https://github.com/fastai/imagenette):

Imagenette consists of Imagenet's 10 easily classifiable classes with a total of 9479 training images and 3935 validation set images.

Imagewoof is a dataset of 10 difficult classifications provided by Imagenet, because all classes are dog breeds. There are 9035 training images and 3939 validation set images.

contrastive learning

The role of contrast learning in self-supervised learning is based on the idea that we want different image views within the same category to have similar representations. However, since we do not know which images belong to the same category, what is usually done is to zoom in on representations of different appearances of the same image. We call these different appearances positive pairs.

In addition, we want images of different categories to have different appearances, keeping their representations away from each other. The presentation of different appearances of different images is independent of category and can be pushed away from each other. We call these different appearances negative pairs.

In this case, what is the foreground of an image? Foreground can be thought of as viewing certain parts of an image in a modified way, which is essentially a transformation of the image.

Depending on the task at hand, some transformations can work better than others. SimCLR shows that applying random cropping and color dithering works well for a variety of tasks, including image classification. This essentially comes from mesh search, choosing a pair of transforms from rotation, crop, crop, noise, blur, Sobel filtering, etc. options.

The mapping from appearance to presentation space is done by neural networks, and typically, resnets are used for this purpose. Below is the pipeline from image to representation

How does negative pairing arise?

Within the same image, we can get multiple representations due to random cropping. In this way, we can create a positive.

But how do you generate negative pairs? Negative pairs are representations from different images. SimCLR papers created these in the same batch. If a batch contains N images, then for each image we will get 2 representations, which makes a total of 2*N representations. For a particular representation x, one representation forms a positive pair with x (a representation from the same image as x), and all other representations (exactly 2*N − 2) form a negative pair with x.

If we have a large number of negative samples on hand, these representations will be improved. However, in SimCLR, a large number of negative samples can only be achieved when the batch size is large, which leads to higher requirements for computational power. MoCo-v2 provides another way to generate negative samples. Let's find out more.

dynamic dictionary

We can look at contrast learning in a slightly different way, matching queries to keys. We now have two encoders, one for queries and one for keys. Furthermore, in order to get a large number of negative samples, we need a large key-coding dictionary.

A positive in this context indicates that the query matches the key. If the query and key are both from the same image, they match. Encoded queries should be similar to the keys they match and different from other queries.

For negative pairs, we maintain a large dictionary containing the encoding keys of previous batches. They serve as negative samples of queries. We maintain dictionaries in queues. New batches are enqueued and older batches are dequeued. You can change the number of negative samples by changing the size of this queue.

The challenge of this approach

As key encoders change, keys queued at later points in time may not be consistent with keys queued earlier. To use contrast learning, all keys compared to the query must come from the same or similar encoder for the comparison to be meaningful and consistent.

Another challenge is that learning the encoder parameters using backpropagation is not feasible because it would require computing the gradients of all samples in the queue (which would result in a large computation graph).

To solve these two problems, MoCo implements the key encoder as a moving average of a momentum-based query encoder [1]. This means that it updates key encoder parameters in this way:

where m is very close to 1 (e.g., 0.999 is typical), which ensures that we obtain encoding keys from similar encoders at different times.

Loss Function-InfoNCE

We want the query to be close to all its positive samples and away from all its negative samples. InfoNC function E captures it. It stands for information-noise contrast estimation. For queries q and key k, the InfoNCE loss function is:

We can rewrite it as:

When the similarity between q and k increases and the similarity between q and negative samples decreases, the loss value decreases.

Here is the code for the loss function:

τ = 0.05def loss_function(q, k, queue): # N is the batch size N = q.shape[0] # C is the dimension of representation C = q.shape[1] # bmm stands for batch matrix multiplication #If mat1 is a b×n×m tensor, then mat2 is a b×m×p tensor, #Then output a b×n×p tensor. pos = torch.exp(torch.div(torch.bmm(q.view(N,1,C), k.view(N,C,1)).view(N, 1),τ)) #Perform matrix multiplication between query and queue tensor neg = torch.sum(torch.exp(torch.div(torch.mm(q.view(N,C), torch.t(queue)),τ)), dim=1) #Summation denominator = neg + pos return torch.mean(-torch.log(torch.div(pos,denominator)))

Let's look at this loss function again and compare it to the categorical cross-entropy loss function.

where pred is the probability prediction that the data point is in class i, and true is the actual probability that the point belongs to class i (which can be fuzzy, but in most cases is a one-hot).

If you are unfamiliar with this topic, you can watch this video to better understand cross entropy. Also note that we often convert scores to probability values via functions like softmax: www.youtube.com/watch? v=ErfnhcEV1O8

We can think of the information loss function as cross-entropy loss. The correct class for data sample "q" is class r, and the underlying classifier is based on softmax, which tries to classify between classes K+1.

Info-NCE is also concerned with mutual information between coded representations; see [4] for more details on this.

MoCo-v2 framework

Now, let's put everything together and see what the entire Moco-v2 algorithm looks like.

Step 1:

We have to get queries and key encoders. Initially, the key encoder has the same parameters as the query encoder. They are copies of each other. As training progresses, the key encoder becomes a moving average of the query encoder (progress is slow at this point).

Due to computing power limitations, we use the Resnet-18 architecture for implementation. On top of the usual resnet architecture, we added a few dense layers to bring the dimensionality of the representation down to 25. Some of these layers will later act as projections.

#Define our deep learning architecture resnetq = resnet18 (pretrained=False)classifier = nn.Sequential(OrderedDict([ ('fc1', nn.Linear(resnetq.fc.in_features, 100)), ('added_relu1', nn.ReLU(inplace=True)), ('fc2', nn.Linear(100, 50)), ('added_relu2', nn.ReLU(inplace=True)), ('fc 3', nn.Linear(50, 25))])resnetq.fc = classifierresnetk = copy.deepcopy(resnetq)#Migrate resnet schema to device resnetq.to (device)resnetk.to (device) Step 2:

Now that we have the encoder, and assuming we have set up the other important data structures, it's time to start training the loop and understanding the pipeline.

This step is to retrieve the encoded query and key from the training batch. We normalize the representation with the L2 norm.

Just a caveat by convention that code in all subsequent steps will be in batches and epoch loops. We also separate the tensor "k" from its gradient because we don't need to compute the key encoder part of the graph because the momentum update equation updates the key encoder.

#gradient nulling optimizer.zero_grad()#retrieve xq and xk batchxq = sample_batched <$'image 1'] xk = sample_batched <$'image 2']#move them to the device xq = xq.to (device)xk = xk.to (device)#Get their outputs q = resnetq (xq)k = resnetk (xk)k = k.detach()#Normalize the outputs so they are unit vectors q = torch.div(q,torch.norm(q,dim=1).reshape(-1,1))k = torch. div(k,torch.norm(k, dim = 1).reshape(-1,1)) Step 3:

Now we pass the query, key, and queue to the loss function defined earlier and store the values in a list. Then, as usual, call the backward function on the loss value and run the optimizer.

#Get loss = loss_function(q, k, queue)#Put this loss in epoch loss list epoch_losses_train.append(loss.cpu().data.item())#Backpropagation loss.backward()#Run optimizer.step() Step 4:

We add the latest batch to our queue. If our queue size is larger than our defined maximum queue size (K), then we take the oldest batch from it. You can use torch.cat for queue operations.

#update queue = torch.cat ((queue, k), 0) #dequeue if queue is larger than max queue size (k)# batch size is 256, can be replaced with variable if queue.shape[0] > K: queue = queue[256:,:] Step 5:

Now we move on to the last step of the training cycle, which is updating the key encoder. We use the following for loop to do this.

#update resnetfor θ_k, θ_q in zip(resnetk.parameters(), resnetq.parameters()): θ_k.data.copy_(moment *θ_k.data + θ_q.data*(1.0 -moment)) Some training details

The GPU time to train the Imagenette and Imagewoof datasets for the resnet-18 model was close to 18 hours. For this, we used a GPU (16GB) from Googlecolab. We used a batch size of 256, tau value of 0.05, learning rate of 0.001, and eventually reduced to 1e-5 with a weight decay of 1e-6. Our queue size is 8192 and the momentum value of the key encoder is 0.999.

results

The first 3 layers (consider relu as one layer) define the projection head, which we remove for downstream tasks for image classification. On the remaining network, we train a linear classifier.

We got a 64.2% accuracy rate compared to using 10% labeled training data, using MoCo-v2. By contrast, using state-of-the-art supervised learning methods, the accuracy rate is close to 95 percent.

For Imagewoof, we get 38.6% accuracy for 10% of the labeled data. Contrast learning on this dataset was less effective than we expected. We suspect this because first of all, the dataset is very difficult because all classes are dog classes.

Second, we consider color to be an important distinguishing feature of these classes. Applying color dithering may cause multiple images from different classes to be mixed with each other. In contrast, the accuracy of surveillance methods approaches 90 percent.

Design changes that bridge the gap between self-supervised and supervised models:

Use larger and wider models.

By using larger batches and dictionary sizes.

Use more data if you can. All unlabeled data is introduced simultaneously.

Train large models on large amounts of data and then extract them.

How to use Moco-V2 in PyTorch to reduce computational constraints is shared here. I hope the above content can help you and learn more. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.