Example Analysis of regularization method, dropout and dataset Amplification in big data 07/19 Update SLTechnology News&Howtos

Example Analysis of regularization method, dropout and dataset Amplification in big data

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces big data in the regularization method, dropout, dataset amplification example analysis, the article is very detailed, has a certain reference value, interested friends must read it!

Regularization method: prevent overfitting and improve generalization ability

When there is not enough training data, or when overtraining, it often leads to overfitting (over-fitting). Its intuitive performance is shown in the following figure, with the progress of the training process, the complexity of the model increases, and the error on training data decreases gradually, but the error on the verification set increases gradually-because the trained network over-fits the training set, but does not work the data outside the training set.

There are many methods that can be used to prevent overfitting, which will be expanded below. There is a concept that needs to be explained first. In machine learning algorithms, we often divide the original data set into three parts: training data and validation data,testing data. What is this validation data? In fact, it is used to avoid over-fitting, in the training process, we usually use it to determine some super parameters (such as determining the epoch size of the early stopping according to the accuracy on the validation data, determining the learning rate according to the validation data, etc.). So why not just do this on testing data? Because if we do this in testing data, then as the training goes on, our network is actually overfitting our testing data bit by bit, resulting in the final testing accuracy without any reference significance. Therefore, the function of training data is to calculate the gradient update weight, validation data as mentioned above, and testing data gives an accuracy to judge whether the network is good or bad.

There are many ways to avoid overfitting: early stopping, data set expansion (Data augmentation), regularization (Regularization) including L1, L2 (L2 regularization is also called weight decay), dropout.

L2 regularization (weight falloff)

L2 regularization is to add a regularization term after the cost function:

C0 represents the original cost function, and the latter term is the L2 regularization term, which comes from this: the sum of the squares of all parameters w divided by the sample size n of the training set. λ is the coefficient of the regular term, weighing the proportion of the regular term and the C0 term. In addition, there is also a coefficient 1, which is often seen, mainly for the convenience of the later derivation, which will produce a 2, which is just the right multiplication of 1 and 2.

How does the L2 regularization term avoid overfitting? Let's deduce it and take a look at it. Let's first take the derivative:

It can be found that the L2 regularization term has no effect on the update of b, but has an effect on the update of w:

When L2 regularization is not used, the coefficient before w in the derivation result is 1, and now the coefficient in front of w is 1 − η λ / n. Because η, λ and n are all positive, 1 − η λ / n is less than 1, its effect is to reduce w, which is the origin of weight attenuation (weight decay). Of course, considering the following derivative terms, the final value of w may increase or decrease.

In addition, it should be mentioned that for mini-batch-based random gradient descent, the formulas for w and b updates are a little different from those given above:

Comparing the update formula of w above, we can find that the latter term has changed to the sum of all the derivatives, multiplied by η and divided by mforce m is the number of samples in a mini-batch.

So far, we have only explained that the L2 regularization term has the effect of making w "smaller", but we have not explained why w "smaller" can prevent overfitting? One explanation for the so-called "obvious" is that the smaller weight w, in a sense, means that the complexity of the network is lower, and the fitting of the data is just right (this rule is also called Occam's razor). It has also been verified that the effect of L2 regularization is often better than that of unregularized ones. Of course, to many people (including me), this explanation doesn't seem so obvious, so add a slightly more mathematical explanation here.

When over-fitting, the coefficient of the fitting function is often very large, why? As shown in the following figure, over-fitting, that is, the fitting function needs to worry about every point, and the final fitting function fluctuates greatly. In some very small intervals, the value of the function changes dramatically. This means that the derivative value (absolute value) of the function in some small intervals is very large, because the independent variable value can be large or small, so only the coefficient is large enough to ensure that the derivative value is very large.

Regularization is to constrain the norm of the parameter so that it is not too large, so the over-fitting can be reduced to a certain extent.

L1 regularization

Add an L1 regularization term after the original cost function, that is, the sum of the absolute values of the ownership weight w multiplied by lambda / n (unlike the L2 regularization term, it needs to be multiplied by 1 prime 2, the specific reason has been mentioned above. )

Again, calculate the derivative first:

In the above formula, sgn (w) represents the symbol of w. Then the update rule for weight w is:

The term η * λ * sgn (w) / n is more than the original update rule. When w is positive, the updated w becomes smaller. When w is negative, the updated w becomes larger-so its effect is to let w lean to 0 and make the weight in the network as 0 as possible, which is equivalent to reducing the complexity of the network and preventing overfitting.

In addition, the above does not mention a problem, what to do when w is 0? When w equals 0, | W | is non-differentiable, so we can only update w according to the original unregularized method, which is equivalent to removing the term η * λ * sgn (w) / n, so we can specify that sgn (0) = 0, so we unify the case of w = 0. (when programming, make sgn (0) = 0gn (w > 0) = 1gn (w)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.