[machine learning] (2): gradient descent algorithm 07/09 Update SLTechnology News&Howtos

[machine learning] (2): gradient descent algorithm

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

In the previous chapter, we briefly introduced the general situation of machine learning, and today we begin to learn the relevant algorithms in machine learning step by step. Before touching on the classical algorithm, let's take a look at the "gradient descent" algorithm.

First, the algorithm background

As the background of an algorithm demonstration, we still use the relationship between house price and house size mentioned in the previous chapter. Houses of different sizes correspond to different house prices. We need to pre-estimate the house prices of a new sample by analyzing the existing data samples. In fact, this is a simple application of regression problem in supervised learning. If all characteristic dependent variables are one-time relationship, then it is called a linear regression problem. Here are our basic ideas for solving the problem:

First, we use the learning algorithm from the training set to get a hypothesis about the problem (essentially a mapping of H (X) = Y). Then we select a new house and predict its price from its size. To solve this problem, it is also necessary to specify some symbols for later explanation, such as:

1. M: sample set, representing all training sets

2. X: input variable, also called feature, in this case the size of the house

3. Y: output variable, also known as targeted variable, in this case the house price

4. (XQuery Y): represents an instance of a training sample

5. The first sample example:

Second, gradient decline

For the above example of house price and house size, if we only consider the linear relationship, then Y should be the linear relationship of X. As an example, we might as well assume that the house size and the number of rooms are two linear dependent variables of the house price grid, then our assumption H (X) can actually write a function about X. for convenience, we will write the following formula and results together:

(1) the expression is a linear function when two dependent variables, the size of the house and the number of rooms, are taken into account, where the value of X0 is 1, and X1 represents the size of the house, and X2 represents the number of rooms.

(2) the J function in the formula is our objective function, that is, if there is a function that can best fit the existing training data set M, then the variance of all samples on that function must be minimum. Multiplying the front by 1max 2 is for the convenience of later operation simplification, and there is no need to study it in detail.

(3) Formula is what we call the updated formula of the gradient descent algorithm, taking into account all the samples of the existing training set, then the variable becomes two coefficients, so we constantly change the coefficient to find the value that makes the H function reach the minimum value. Here is a constant variable, which represents the step size of each descent. If the value is too small, the convergence time of the algorithm is too long, if the value is too large. It is possible to cross the minimum point.

(4) Formula is a formula about gradient decline obtained when there is only one sample.

(5) the formula is to consider the gradient descent formula with m samples.

(6) the formula is a random gradient descent formula.

Be careful! The idea of gradient descent is actually very simple. If all the sample values are drawn as a contour map, then it is like a person standing at one of the points and considering which direction to take the fastest way down the hill whenever he wants to take a step. The fastest downhill direction chosen here is actually the partial derivative of the point. As shown in the figure:

By the same token, we can also look at the contour map:

We can see from the formula (5) above that every time we update the coefficient, we need to traverse all the sample sets for operation, so it is called "batch gradient algorithm", but if we encounter a very large sample set, this is undoubtedly very inefficient. So we have a "random gradient algorithm", the idea is to use only one sample at a time to update all parameter values, but need to update m _ n times (m is the sample size, n is the number of dependent variables), the pseudo code is:

Repeat {

For j = 1 to m {

(6) equation (for all i)

}

The convergence speed of the random gradient algorithm is faster than that of the batch gradient, but the final convergence may not be as accurate as the batch gradient, and may swing back and forth around the minimum point. Because the core of the gradient algorithm is to subtract the partial derivative, the gradient algorithm must have a convergence value, and for linear problems, the local minimum is often the global minimum.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.