Summary of gradient descent (Gradient Descent) 10/25 Update SLTechnology News&Howtos

Summary of gradient descent (Gradient Descent)

2025-10-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

Gradient descent (Gradient Descent) is one of the most commonly used methods when solving the model parameters of machine learning algorithm, that is, unconstrained optimization problems, and another commonly used method is the least square method. Here is a complete summary of the gradient descent method.

1. Gradient

In calculus, the partial derivative of the parameters of the multivariate function is calculated, and the partial derivative of each parameter is written in the form of a vector, which is the gradient. For example, the function f (x ~ ~ () y) calculates the partial derivative of x ~ y respectively, and the gradient vector obtained is (f ~ x, f ~ ~ y) T, abbreviated as grad f (x ~ y) or ▽ f ~ (x ~ y). The specific gradient vector at the point (x0) is (f/x0, f/y0) T. Or ▽ f (x0dy0), if it's a vector gradient of three parameters, it's (fcompxx, fcompydydyadiz) T, and so on.

So what's the point of finding this gradient vector? His meaning, in a geometric sense, is the place where the function changes most rapidly. Specifically, for the function f (x _ 0), at the point (x _ 0), the direction of (f/x0, f/y0) T along the gradient vector is the place where f (x _ () ~ y) increases fastest. In other words, it is easier to find the maximum value of the function along the direction of the gradient vector. On the other hand, along the opposite direction of the gradient vector, that is, the direction of-(f/x0, f/y0) T, the gradient decreases most rapidly, which means it is easier to find the minimum value of the function.

two。 Gradient decline and gradient rise

In the machine learning algorithm, when minimizing the loss function, the gradient descent method can be used to solve the problem step by step, and the minimized loss function and model parameters can be obtained. On the other hand, if we need to solve the maximum value of the loss function, we need to use the gradient rising method to iterate.

The gradient descent method and the gradient rise method can be transformed into each other. For example, we need to solve the minimum value of the loss function f (θ), then we need to use the gradient descent method to iteratively solve the problem. But in fact, we can solve the maximum value of the loss function-f (θ) in reverse, and the gradient rise method will come in handy.

The following is a detailed summary of the gradient descent method.

3. Detailed explanation of gradient descent algorithm

3.1 intuitive explanation of gradient decline

First of all, let's take a look at an intuitive explanation of the gradient decline. For example, we are somewhere on a big mountain, and because we don't know how to get down the mountain, we decide to take one step at a time, that is, every time we get to a position, we solve the gradient of the current position, along the negative direction of the gradient, that is, take a step down the steepest position, and then continue to solve the current position gradient, and take a step along the steepest and easiest position to get down the hill. We walked on like this step by step until we felt that we had reached the foot of the mountain. Of course, if we go on like this, it is possible that we will not be able to get to the foot of the mountain, but to the lower part of the mountain.

From the above explanation, we can see that the gradient descent may not necessarily find the global optimal solution, but may be a local optimal solution. Of course, if the loss function is convex, the solution obtained by the gradient descent method must be the global optimal solution.

3.2 related concepts of gradient decline

Before we take a closer look at the gradient descent algorithm, let's take a look at some related concepts.

1. Step size (Learning rate): the step size determines the length of each step along the negative direction of the gradient during the iterative process of gradient descent. Using the example of going downhill above, the step length is the length of the step along the steepest and easiest position to go downhill at the current step.

two。 Feature (feature): refers to the input part of the sample, such as the sample (x0dint y0), (x1dint y1), then the feature of the sample is x and the output of the sample is y.

3. Hypothetical function (hypothesis function): in supervised learning, the hypothetical function used to fit the input sample is written as h θ (x). For example, for the sample (xi,yi), the fitting function can be used as follows: h θ (x) = θ 0 + θ 1x.

4. Loss function (loss function): in order to evaluate the quality of model fitting, loss function is usually used to measure the degree of fitting. The minimization of the loss function means that the fitting degree is the best, and the corresponding model parameters are the optimal parameters. In linear regression, the loss function is usually the square of the difference between the sample output and the hypothesis function. For example, for the sample (xi,yi), linear regression is used, and the loss function is:

J (θ 0, θ 1) = ∑ ionization 1m (h θ (xi) yi) 2J (θ 0, θ 1) = ∑ ionization 1m (h θ (xi) yi) 2

Where xixi represents the I element of the sample feature x, yiyi represents the I element of the sample output y, and h θ (xi) h θ (xi) is a hypothetical function.

3.3 detailed algorithm for gradient descent

The algorithm of gradient descent method can be expressed by algebraic method and matrix method (also known as vector method). If you are not familiar with matrix analysis, the algebraic method is easier to understand. However, the matrix method is more concise, and because of the use of matrix, the realization of logic is more clear at a glance. The algebraic method is introduced first, and then the matrix method is introduced.

3.3.1 Algebraic description of gradient descent method

1. Prerequisites: confirm the hypothetical function and loss function of the optimization model.

For example, for linear regression, suppose that the function is expressed as h θ (x1Magi x2Magne.xn) = θ 0 + θ 1x1j. + θ nxnh θ (x1Magi x2Magne.xn) = θ 0 + θ 1x1j. + θ nxn, where θ I θ I (I = 0Magne1Col 2. N) is the parameter of the model, xixi (I = 0, 1, 2, 2). N) is the n eigenvalues of each sample. This representation can be simplified, and we add a characteristic x0=1x0=1, so that h θ (x0memex1Magne.xn) = ∑ ionome0n θ ixih θ (x0memx1recoveryxn) = ∑ ionome0n θ ixi.

It is also a linear regression, corresponding to the above hypothetical function, the loss function is:

J (θ 0, θ 1, θ n) = ∑ ionomer 0m (h θ (x 0, θ 1) yi) 2J (θ 0, θ 1, θ n) = ∑ I m (h θ (x 0, θ 1, θ n) yi) 2

two。 Initialization of related parameters of the algorithm: mainly initializing θ 0, θ 1, θ n θ 0, θ 1, θ n, algorithm termination distance ε ε and step α α. When I don't have any prior knowledge, I like to initialize all theta theta to 0 and the step size to 1. Optimize again when tuning.

3. Algorithm process:

1) determine the gradient of the loss function of the current position. For θ I θ I, the gradient expression is as follows:

θ iJ (θ 0, θ 1, θ n) θ iJ (θ 0, θ 1, θ n)

2) by multiplying the step size by the gradient of the loss function, the descending distance of the current position is obtained, that is, α θ iJ (θ 0, θ 1, θ n) α θ iJ (θ 0, θ 1, θ n) corresponds to a step in the previous mountaineering example.

3) to determine whether all θ I θ I and the distance of gradient descent are less than ε ε, if it is less than ε ε, then the algorithm is terminated, and all current θ I θ I (iMagol 1) is the final result. Otherwise proceed to step 4.

4) update all theta theta, and for θ I θ I, the update expression is as follows. Continue to step 1 after the update.

θ I = θ I α θ iJ (θ 0, θ 1, θ n) θ I = θ I α θ iJ (θ 0, θ 1, θ n)

The gradient decline is described in detail with an example of linear regression. Suppose our sample is (x (0) 1) xn (0) 2), (x (1) 1) 1, x (1) 2,... x (1)), (x (m) 1), (x (m) 2), (x (m) n) yn) (x1 (0), x2 (0),... xn (0), y0), (x1 (1), x2 (1),... xn (1), y1). (x1 (m), x2 (m),... xn (m), yn), the loss function is described in the previous prerequisites:

J (θ 0, θ 1, θ n) = ∑ ionome0m (h θ (x 0, θ 1) yi) 2J (θ 0, θ 1, θ n) = ∑ I m (h θ (x 0, x 1), x n) yi) 2.

Then the partial derivative of θ I θ I in step 1 of the algorithm process is calculated as follows:

θ iJ (θ 0, θ 1, θ n) = 1m ∑ jig0m (h θ (xj0,xj1,...xjn) yj) xji θ iJ (θ 0, θ 1..., θ n) = 1m ∑ jung0m (h θ (x0j) yj. Xnj) xij

Because there is no x0x0 in the sample, all xj0x0j is 1.

The updated expression of θ I θ I in step 4 is as follows:

θ I = θ I α 1m ∑ jig0m (h θ (xj0,xj1,...xjn) yj) xji θ I = θ I α 1m ∑ jung0m (h θ (x0j) yj. Xnj) xij

From this example, we can see that the gradient direction of the current point is determined by all the samples, and 1m1m is added for easy understanding. Because the step size is also constant and their flight is also constant, here α 1m α 1m can be expressed as a constant.

The main difference between the variants of the gradient descent method discussed in detail in section 4 below is the use of the sample. Here we use all the samples.

3.3.2 Matrix description of gradient descent method

This part mainly explains the matrix expression of the gradient descent method. Compared with the 3.3.1 algebraic method, it requires some basic knowledge of matrix analysis, especially the knowledge of matrix derivation.

1. Prerequisites: similar to 3.3.1, you need to confirm the hypothetical function and loss function of the optimization model. For linear regression, it is assumed that the matrix expression of the function h θ (x1mai x2m... xn) = θ 0 + θ 1x1m. + θ nxnh θ (x 1m x 2m .xn) = θ 0 + θ 1x1m. + θ nxn is as follows:

H θ (x) = X θ h θ (x) = X θ, where the function h θ (X) h θ (X) is assumed to be the vector of mx1 and θ θ is the vector of nx1, in which there are n model parameters of algebraic method. XX is a matrix of mxn dimension. M represents the number of samples, and n represents the number of features of the sample.

The expression of loss function is J (θ) = 12 (X θ Y) T (X θ Y) J (θ) = 12 (X θ Y) T (X θ Y), where YY is the output vector of the sample and the dimension is mx1.

two。 Algorithm-related parameters initialization: θ θ vector can be initialized to the default value, or the value after tuning. The termination distance ε ε and the step size α α and 3.3.1 ratio of the algorithm do not change.

3. Algorithm process:

1) determine the gradient of the loss function of the current position. For the θ θ vector, the gradient expression is as follows:

θ J (θ) θ J (θ)

2) by multiplying the step size by the gradient of the loss function, the descending distance of the current position is obtained, that is, α θ J (θ) α θ J (θ) corresponds to a step in the previous mountaineering example.

3) for each value in the θ θ vector, the distance of gradient descent is less than ε ε. If it is less than ε ε, the algorithm is terminated, and the current θ θ vector is the final result. Otherwise proceed to step 4.

4) update the θ θ vector, the update expression is as follows. Continue to step 1 after the update.

θ = θ α θ J (θ) θ = θ α θ J (θ)

Or use the example of linear regression to describe the specific algorithm process.

The partial derivative of the loss function for the θ θ vector is calculated as follows:

θ J (θ) = XT (X θ Y) θ J (θ) = XT (X θ Y)

The update expression of the θ θ vector in step 4 is as follows: θ = θ α XT (X θ Y) θ = θ α XT (X θ Y)

For the 3.3.1 algebraic method, we can see that the matrix method is much simpler. The matrix derivation chain rule and two matrix derivation formulas are used in this.

Formula 1 XXT X (XXT) = 2XX (XXT) = 2X

Formula 2: θ (X θ) = XT θ (X θ) = XT

If you need to be familiar with matrix derivation, please refer to Zhang Xianda's book "Matrix Analysis and Application".

3.4 algorithm tuning of gradient descent

When using gradient descent, tuning is required. What areas need to be tuned?

1. The step size selection of the algorithm. In the previous algorithm description, I mentioned that the step size is 1, but in fact, the value depends on the data sample. We can take more values, from large to small, and run the algorithm separately to see the iterative effect. If the loss function is getting smaller, it means that the value is valid. Otherwise, increase the step size. As I said earlier. If the step size is too large, the iteration will be too fast, and the optimal solution may even be missed. The step size is too small, the iterative speed is too slow, and the algorithm can not be finished for a long time. Therefore, the step size of the algorithm needs to be run many times in order to get a better value.

two。 The initial value selection of algorithm parameters. If the initial value is different, the minimum value may be different, so the gradient decline only obtains the local minimum value; of course, if the loss function is a convex function, it must be the optimal solution. Because of the risk of local optimal solution, it is necessary to run the algorithm with different initial values for many times, the minimum value of the key loss function, and select the initial value of the loss function minimization.

3. Normalization. Because the range of values of different features of the sample is different, it may lead to slow iteration. In order to reduce the influence of feature values, the feature data can be normalized, that is, for each feature x, its expected x and standard deviation std (x) can be obtained, and then transformed into:

Xx contacts std (x) xx std (x)

The new expectation of this feature is 0, the new variance is 1, and the number of iterations can be greatly increased.

4. Large family of gradient descent method (BGD,SGD,MBGD)

4.1 batch gradient descent method (Batch Gradient Descent)

The batch gradient descent method is the most commonly used form of the gradient descent method, that is, all the samples are used to update the parameters, which corresponds to the gradient descent algorithm of linear regression in the previous 3.3.1, that is to say, the gradient descent algorithm of 3.3.1 is the batch gradient descent method.

θ I = θ I α ∑ jig0m (h θ (xj0,xj1,...xjn) yj) xji θ I = θ I α ∑ jig0m (h θ (x0j) yj) xij

Since we have m samples, the gradient data of all m samples are used to calculate the gradient here.

4.2 Stochastic gradient Descent method (Stochastic Gradient Descent)

In fact, the principle of random gradient descent method is similar to that of batch gradient descent method, except that it does not use the data of all m samples, but only one sample j is selected to calculate the gradient. The corresponding update formula is:

θ I = θ I α (h θ (xj0,xj1,...xjn) yj) xji θ I = θ I α (h θ (x0j yj x1j) yj) xij

The random gradient descent method and the 4.1 batch gradient descent method are two extremes, one using all the data to gradient descent and the other using one sample to gradient descent. Naturally, their respective advantages and disadvantages are very prominent. For the training speed, the random gradient descent method uses only one sample at a time, the training speed is very fast, while the batch gradient descent method is not satisfactory when the sample size is very large. In terms of accuracy, the random gradient descent method is used to determine the gradient direction with only one sample, so it is likely that the solution is not optimal. For the convergence rate, because the random gradient descent method iterates one sample at a time, the iterative direction changes greatly and can not converge to the local optimal solution quickly.

So is there a moderate way to combine the advantages of the two methods? Yes! This is the small batch gradient descent method of 4.3.

4.3 small batch gradient descent method (Mini-batch Gradient Descent)

The small batch gradient descent method is a tradeoff between the batch gradient descent method and the random gradient descent method, that is, for m samples, we use x patterns to iterate, 1

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.