What is the high-performance large-scale image recognition without BN 09/20 Update SLTechnology News&Howtos

What is the high-performance large-scale image recognition without BN

2025-09-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article analyzes "how high performance large-scale image recognition without BN is". The content is detailed and easy to understand. Friends interested in "what is high-performance large-scale image recognition without BN" can read it slowly and deeply according to the idea of Xiaobian. I hope it can help everyone after reading. Let's learn more about "how high-performance large-scale image recognition without BN" together with Xiaobian.

Introduction and overview

Therefore, the following focuses on convolutional residual neural networks that do not use BN to build image recognition. But without BN, these networks often don't work well or scale to larger batch sizes, but the networks built in this paper can be networked using large batches and are more efficient than previous state-of-the-art methods such as LambdaNets. The following chart shows that NFnet is 8.7 times faster than EffNet-B7 for the same top-1 accuracy score performed on ImageNet. This model is state-of-the-art without any other training data and is also new to the latest transfer learning. NFnets are currently ranked second in the global rankings behind methods that use semi-supervised pre-training and extra data.

What's wrong with BN?

If a piece of data propagates through a network, it will undergo various transformations as it passes through the layers, but if the network is constructed in the wrong way, the propagation becomes even more wrong. In machine learning, it's a good practice to cluster data around the mean and scale it to unit variables, but as you advance through layers, especially if you have activation layers like ReLU, they extract only the positive part of the signal. So over time, intermediate representations between deeper layers may be very skewed and not centered. Current methods in machine learning work better if your data has good conditional numbers (i.e., centered on the mean, less skewed, etc.).

BN has three notable shortcomings. First, it is a very expensive computation, which causes memory overhead. You need to compute averages, and scaling needs to store them in memory for backpropagation algorithms. This increases the time required to evaluate gradients in some networks.

Second, it introduces a differential behavior between model training and reasoning. Because you don't want batch dependencies when reasoning, Twenty wants to be able to fit a data point, and the results of both operations should be the same.

Third, BN breaks the independence between small batches of training examples. This means that what other examples in the batch are important now.

This has two main consequences. First, batch size affects batch normalization. If you have a small batch, the average will be a very noisy approximation, whereas if you have a large batch, the average will be a good approximation. We know that for some applications large batches of training are beneficial, they stabilize training, reduce training time, etc.

Secondly, distributed training becomes very cumbersome because, for example, if you have data parallelism, that is, you have this batch of data divided into three different parts, and these three parts propagate forward to all the neural networks used for training on three different machines. If there is a BN layer in all 3 networks, then all you technically have to do is forward the signal to the BN layer and then you have to pass batch statistics between BN layers because otherwise there is no mean and variance in the entire batch. This allows the network to "cheat" certain loss functions.

Paper contributions

The authors propose adaptive gradient clipping (AGC), which clips gradients based on the unit ratio of gradient norm to parameter norm. They show that AGC allows us to train unnormalized networks with larger batch processing and stronger data enhancement.

The authors design a family of unnormalized ResNet called NFNet that sets the best validation accuracy for various training latency on ImageNet. The NFNet-F1 model achieves similar accuracy to EfficientNet-B7, while training 8.7 times faster, and the largest model sets a whole new height (86.5% top-1 accuracy) without additional data.

The authors also mention that when fine-tuning ImageNet after pre-training on 300 million labeled large private datasets, NFNet has a much higher verification accuracy than batch normalized networks. The best model can reach 89.2% of top-1 after fine tuning

Adaptive Gradient Clipping (AGC)

Gradient clipping is commonly used in language modeling to stabilize training, and recent work has shown that it allows training with a greater learning rate than gradient descent. Gradient clipping is usually achieved by constraining the modulus of the gradient. Specifically, for a gradient vector G = L/θ, where L represents the loss and θ represents the vector containing all model parameters, the standard clipping algorithm clips the gradient before updating θ:

During training, it's not a good thing for the optimizer to make huge jumps in order to reach a global minimum, so gradient clipping just says that whenever the gradient of any parameter is very large, we'll clip that gradient. If the gradient is good, we will definitely see it again, but if the gradient is bad, we want to limit its impact. The problem is that it is very sensitive to the clipping parameter λ because it is not adaptive.

What AGC does is it scales gradients, not only to their norm, but also to the ratio (size of gradient/weight of gradient). This may be confusing at first glance, but please refer to page 4 for a clearer understanding of AGC.

The shear threshold λ is a scalar hyperparameter that must be adjusted. Empirically, the authors found that while this clipping algorithm enabled them to train at higher batch sizes than before, the training stability was extremely sensitive to the choice of clipping threshold, requiring fine-grained adjustments when changing model depth, batch size, or learning rate. The authors ignore the proportion of gradients by choosing an adaptive learning rate that is inversely proportional to the norm of the gradients.

Note that the optimal shear parameter λ may depend on the choice of optimizer, learning rate, and batch size. Based on experience, the author found that λ should be smaller for mass production.

Ablation of Adaptive Gradient Clipping (AGC)

For example, if you compare the batch specification networks (NF-ResNet and NF-ResNet + AGC) in Figure 1, you can see that after a certain batch size (2048), the non-AGC will simply collapse and the AGC will prevail. This seems to be a hidden problem with mass production. The authors complain that the shear threshold for λ is very critical. In Figure 2, you can see that λ has a critical dependence on batch size, and the graph above shows that shearing can occur at fairly large thresholds for small batch sizes. For large batches, the threshold must be kept very low because it crashes if you trim the threshold higher.

About how high-performance large-scale image recognition without BN is shared here, I hope the above content can improve everyone. If you want to learn more, please pay more attention to the updates of Xiaobian. Thank you for your attention to the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.