What are the gradient disappearance and gradient expansion in big data and what are the six solutions? 07/15 Update SLTechnology News&Howtos

What are the gradient disappearance and gradient expansion in big data and what are the six solutions?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Big data in the gradient disappearance and gradient expansion and what are the six solutions, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions. Through this article, I hope you can solve this problem.

1. Gradient vanishing

According to the chain rule, if the result of multiplying the partial derivative of the output of each layer by the weight of the previous layer is less than 1, then even if the result is 0.99, after enough layers of propagation, the partial derivative of the error to the input layer will tend to 0.

This situation results in minimal adjustment of neurons in the hidden layer near the input layer.

two。 Gradient expansion

According to the chain rule, if the result of multiplying the partial derivative of the output of each layer by the weight of the upper layer is greater than 1, after enough layers of propagation, the partial derivative of the error to the input layer will tend to infinity.

This situation will lead to great changes in the adjustment of neurons in the hidden layer near the input layer.

3. The solution of gradient disappearance and gradient expansion

This article provides six common solutions to the disappearance and expansion of gradients. You are welcome to read and learn.

3.1 pre-training plus fine-tuning

This method comes from a paper published by Hinton in 2006. In order to solve the problem of gradient, Hinton proposes an unsupervised layer-by-layer training method. The basic idea is to train one layer of hidden nodes each time, the output of the upper layer of hidden nodes is used as the input, and the output of this layer of hidden nodes is used as the input of the next layer of hidden nodes. This process is layer by layer "pre-training" (pre-training). After the completion of the pre-training, the whole network is "fine-tunning".

Hinton uses this method in training deep belief network (Deep Belief Networks). After the pre-training of each layer is completed, the BP algorithm is used to train the whole network. This idea is equivalent to finding the local optimization first, and then integrating it to find the global optimization. This method has some advantages, but it is not widely used at present.

3.2 gradient shear, regular

The scheme of gradient shearing is mainly proposed for gradient explosion. its idea is to set a gradient shearing threshold, and then update the gradient, if the gradient exceeds this threshold, then the gradient can be forcibly limited to this range. Gradient explosion can be prevented by this direct method.

Note: there is also a gradient clipping restriction operation in WGAN, but unlike this, WGAN restricts gradient update information to guarantee lipchitz conditions.

Introduction to WGAN (Wasserstein GAN)

We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.

Another way to solve the gradient explosion is to use weight regularization (weithts regularization). The more common regularization is L1 regularization and L2 regularization. There are API regularization in each depth framework. For example, in tensorflow, if the regularization parameters have been set when the network is built, the regularization loss can be calculated directly by calling the following code:

Regularization_loss = tf.add_n (tf.losses.get_regularization_losses (scope='my_resnet_50'))

If the initialization parameters are not set, you can also use the following code to calculate the L2 regular loss:

L2_loss = tf.add_n ([tf.nn.l2_loss (var) for var in tf.trainable_variables () if 'weights' in var.name])

Regularization is to over-fit the weight of the network by regular restriction, and look carefully at the regular term in the form of loss function:

Among them, α refers to the coefficient of the regularization term, so if there is a gradient explosion, the norm of the weight will become very large, and the occurrence of the gradient explosion can be partially limited by the regularization term.

Note: in fact, gradients tend to disappear more often in deep neural networks.

3.3Activation functions such as relu, leakrelu, elu, etc.

Relu: the idea is also very simple. If the derivative of the activation function is 1, then there is no problem of gradient disappearance and explosion. Each layer of the network can get the same update speed, and relu arises at the historic moment.

The main contributions of Relu are:

Solved the problem of gradient disappearance and explosion.

The calculation is convenient and fast.

Speed up the training of the network

At the same time, there are some shortcomings:

Because the negative part is always 0, it will cause some neurons to fail to activate (can be partially solved by setting the primary school exercise rate).

The output is not centered on 0

Leakrelu is to solve the influence of the 0 interval of relu, and its mathematical expression is: leakrelu=max (k ∗ x Magnex)

Where k is the leak coefficient, generally choose 0.01or 0.02, or come from learning. Leakrelu addresses the impact of the 0 range and includes all the advantages of relu

3.4 batchnorm

Batchnorm is one of the most important achievements since the development of deep learning. At present, it has been widely used in major networks, with the effect of accelerating network convergence and improving training stability. Batchnorm is essentially to solve the gradient problem in the process of back propagation.

Batchnorm's full name is batch normalization, or BN for short.

The output x is normalized through the normalization operation to ensure the stability of the network.

Batchnorm eliminates the influence of magnification and reduction caused by w through the method that the output of each layer is consistent with the mean and variance, and then solves the problem of gradient disappearance and explosion.

For details, please refer to the article:

Http://blog.csdn.net/qq_25737169/article/details/79048516

3.5 residual structure

In fact, it is the emergence of the residual network that led to the end of the image net competition. Since the residual was proposed, almost all the depth networks can not be separated from the residual. Compared with the previous several layers and dozens of layers of the depth network, it is not worth mentioning in front of the residual network. The residual can easily build a network with hundreds of layers and more than a thousand layers without worrying about the gradient disappearing too quickly. The reason lies in the shortcut part of the residual.

When it comes to residual structure, I have to mention this paper:

Deep Residual Learning for Image Recognition

3.6 LSTM

LSTM, the full name of long-term and short-term memory network (long-short term memory networks), is not so easy to disappear gradient, mainly because of the complex "gates" within LSTM. Through its internal "gate", LSTM can "remember" the "residual memory" of previous training, so it is often used to generate text.

After reading the above, have you mastered big data's methods of gradient disappearance and gradient expansion and what are the six solutions respectively? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.