How to quickly master Adam optimization algorithm 04/21 Update SLTechnology News&Howtos

How to quickly master Adam optimization algorithm

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to quickly master Adam optimization algorithm", the content of the article is simple and clear, easy to learn and understand, now please follow the editor's ideas slowly in depth, together to study and learn "how to quickly master Adam optimization algorithm"!

This tutorial is divided into three parts: they are:

Gradient decline

Adam optimization algorithm

Adam gradient decline

Two-dimensional test problem

Gradient descent optimization of Adam

Adam visualization

Gradient decline

Gradient descent is an optimization algorithm. It is technically called the first-order optimization algorithm because it clearly uses the first derivative of the objective function. The first derivative, or "derivative" for short, is the rate of change or slope of the objective function at a particular point (for example, a point). For specific input. If the objective function takes multiple input variables, it is called a multivariate function, and the input variable can be treated as a vector. Conversely, the derivative of a multivariate objective function can also be regarded as a vector, usually called a gradient.

Gradient: the first derivative of a multivariate objective function.

For a particular input, the derivative or gradient points to the steepest upward direction of the objective function.

Gradient descent refers to a minimization optimization algorithm, which follows the negative value of the downhill gradient of the objective function to locate the minimum value of the function. The gradient descent algorithm requires an objective function being optimized and the derivative function of the objective function. The objective function f () returns the score of a given input set, and the derivative function f'() gives the derivative of the objective function of the given input set. The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.

Suppose we are minimizing the objective function, then calculating the derivative and taking a step in the input space, which will cause the objective function to move downhill. Downhill motion is done by first calculating the amount of motion in the input space by multiplying the step size (called alpha or learning rate) by the slope. Then subtract the value from the current point to ensure that we move the target function against the gradient or down.

X (t) = x (tmur1)-step* f'(x (tmur1))

The steeper the objective function at a given point, the greater the magnitude of the gradient, and in turn, the greater the steps taken in the search space. Use the step size super parameter to scale the step size.

Step size (alpha): a super parameter that controls how far the algorithm moves in the search space relative to the gradient at each iteration.

If the step size is too small, the movement in the search space will be small, and the search will take a long time. If the step size is too large, the search may bounce near the search space and skip the optimal value.

Now that we are familiar with the gradient descent optimization algorithm, let's take a look at the Adam algorithm.

Adam optimization algorithm

Adaptive motion estimation algorithm (Adam) is an extension of gradient descent optimization algorithm. Diederik Kingma and Jimmy Lei Ba described the algorithm in a paper entitled "Adam: stochastic Optimization method" published in 2014. Adam is designed to speed up the optimization process, such as reducing the number of functional evaluations required to reach the best state, or improving the functionality of optimization algorithms, such as producing better final results. This is achieved by calculating the step size for each input parameter to be optimized. Importantly, each step size automatically adjusts the throughput of the search process based on the gradient (partial derivative) encountered by each variable.

Let's introduce each element of the algorithm step by step. First, for each parameter optimized as part of the search, we must maintain a moment vector and an exponentially weighted infinite norm, called m and v (really the Greek letter nu). Initialize them to 0.0 at the beginning of the search.

M = 0

V = 0

The algorithm iterates within the time t starting from tweak 1, and each iteration involves calculating a new set of parameter values x, for example. From x (tmur1) to x (t). If we focus on updating one parameter, it may be easy to understand the algorithm, which is summarized as updating all parameters through vector operations. First, the gradient (partial derivative) of the current time step is calculated.

G (t) = f'(x (tmur1))

Next, update the first moment with gradient and hyperparameter beta1.

M (t) = beta1 * m (tmur1) + (1-beta1) * g (t)

Then, the second moment is updated with the square gradient and the hyperparameter beta2.

V (t) = beta2 * v (tmur1) + (1-beta2) * g (t) ^ 2

Since the first and second moments are initialized with zero values, they are biased. Next, the deviation of the first torque and the second torque is corrected, and the first torque is taken as the starting point:

Mhat (t) = m (t) / (1-beta1 (t))

And then the second moment:

Vhat (t) = v (t) / (1-beta2 (t))

Note that beta1 (t) and beta2 (t) refer to the beta1 and beta 2 superparameters, which decay according to schedule during the iteration of the algorithm. Static attenuation schedules can be used, although the paper recommends the following:

Beta1 (t) = beta1 ^ t

Beta2 (t) = beta2 ^ t

Finally, we can calculate the value of the parameter for the iteration.

X (t) = x (tmur1)-alpha * mhat (t) / (sqrt (vhat (t)) + eps)

Where alpha is the step size superparameter, eps is a smaller value (epsilon), such as 1e-8, which ensures that we will not encounter an error divided by zero, and sqrt () is the square root function.

Note that you can use to reorder the update rules listed in this article more effectively:

Alpha (t) = alpha * sqrt (1-beta2 (t)) / (1-beta1 (t)) x (t) = x (tmur1)-alpha (t) * m (t) / (sqrt (v (t)) + eps)

Recall that the algorithm has three super parameters, which are:

Alpha: initial step size (learning rate), typical value is 0.001.

Beta1: the first attenuation factor of momentum, typically 0.9.

Beta2: the attenuation factor of the infinite norm, the typical value is 0.999.

Next, let's look at how to implement the algorithm from scratch in Python.

Adam gradient decline

In this section, we will explore how to use Adam to implement the gradient descent optimization algorithm.

Two-dimensional test problem

First, let's define an optimization function. We will use a simple two-dimensional function that squares the input of each dimension and defines the range of valid inputs (from-1.0 to 1.0).

The following Objective () function implements this function

# objective function def objective (x, y): return Xbox 2.0 + yawning 2.0

We can create a 3D diagram of the dataset to understand the curvature of the response surface. A complete example of drawing the target function is listed below.

# 3D plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective (x, y): return x sample input range uniformly at 2.0 # define range for input r_min, r_max =-1.0,1.0 # sample input range uniformly at 0.1 increments xaxis = arange (r_min, r_max, 0.1) yaxis = arange (r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid (xaxis Yaxis) # compute targets results = objective (x, y) # create a surface plot with the jet color scheme figure = pyplot.figure () axis = figure.gca (projection='3d') axis.plot_surface (x, y, results, cmap='jet') # show the plot pyplot.show ()

Running the example creates a 3D surface view of the target function. We can see the familiar bowl shape with a global minimum of f (0mem0) = 0.

We can also create a two-dimensional graph of the function. This can be very helpful when you want to plot the search progress in the future. The following example creates an outline of the target function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective (x, y): return x-ray 2.0 + y-ray 2.0 # define range for input bounds = asarray ([- 1.0,1.0], [- 1.0,1.0]]) # sample input range uniformly at 0.1increments xaxis = arange (bounds [0LECO], bounds [0LING 1], 0.1) yaxis = arange (bounds [1ZOOO], bounds [1MIEL1] # create a mesh from the axis x, y = meshgrid (xaxis, yaxis) # compute targets results = objective (x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf (x, y, results, levels=50, cmap='jet') # show the plot pyplot.show ()

Running the example creates a two-dimensional outline of the objective function. We can see that the shape of the bowl is compressed into an outline displayed as a color gradient. We will use this graph to draw specific points to explore during the search.

Now that we have a test objective function, let's take a look at how to implement the Adam optimization algorithm.

Adam gradient descent optimization

We can apply the gradient drop with Adam to the test problem. First, we need a function to calculate the derivative of this function.

F (x) = x ^ 2

F'(x) = x * 2

The derivative of x ^ 2 is x * 2 on each dimension. The derived () function does this below.

# derivative of objective function def derivative (x, y): return asarray ([x * 2.0, y * 2.0])

Next, we can achieve gradient descent optimization. First of all, we can select random points within the scope of the problem as the starting point of the search. Suppose we have an array that defines the search scope, one row for each dimension, and the first column defines the minimum value, and the second column defines the maximum value of the dimension.

# generate an initial point x = bounds [:, 0] + rand (len (bounds)) * (bounds [:, 1]-bounds [:, 0]) score = objective (x [0], x [1])

Next, we need to initialize the first moment and the second moment to zero.

# initialize first and second moments m = [0.0 for _ in range (bounds.shape [0])] v = [0.0 for _ in range (bounds.shape [0])]

Then we run a fixed number of iterations of the algorithm defined by the "n_iter" superparameter.

... # run iterations of gradient descent for t in range (n_iter):...

The first step is to use the derivative () function to calculate the gradient of the current solution.

# calculate gradient gradient = derivative (solution [0], solution [1])

The first step is to calculate the derivative of the current parameter set.

# calculate gradient g (t) g = derivative (x [0], x [1])

Next, we need to perform the Adam update calculation. To improve readability, we will use the imperative programming style to perform these calculations one variable at a time.

In practice, I recommend using NumPy vector operations to improve efficiency.

... # build a solution one variable at a time for i in range (x.shape [0]):...

First, we need to calculate the torque.

# m (t) = beta1 * m (tmur1) + (1-beta1) * g (t) m [I] = beta1 * m [I] + (1.0-beta1) * g [I]

And then the second moment.

# v (t) = beta2 * v (tmur1) + (1-beta2) * g (t) ^ 2 v [I] = beta2 * v [I] + (1.0-beta2) * g [I] * * 2

Then the deviation is corrected at the first and second moments.

# mhat (t) = m (t) / (1-beta1 (t)) mmhat = m [I] / (1.0-beta1** (t)) # vhat (t) = v (t) / (1-beta2 (t)) vvhat = v [I] / (1.0-beta2** (t))

And then finally the updated variable value.

# x (t) = x (tmur1)-alpha * mhat (t) / (sqrt (vhat (t)) + eps) x [I] = x [I]-alpha * mhat / (sqrt (vhat) + eps)

Then repeat this for each parameter you want to optimize. At the end of the iteration, we can evaluate the new parameter values and report the performance of the search.

# evaluate candidate point score = objective (x [0], x [1]) # report progress print ('>% d f (% s) =% .5f'% (t, x, score))

We can combine all of this into a function called adam (), which takes the names of the target and derived functions, as well as algorithm superparameters, and returns the best solution found at the end of the search and its evaluation.

The complete features are listed below.

# gradient descent algorithm with adam def adam (objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8): # generate an initial point x = bounds [:, 0] + rand (len (bounds)) * (bounds [:, 1]-bounds [:, 0]) score = objective (x [0] X [1]) # initialize first and second moments m = [0.0 for _ in range (bounds.shape [0])] v = [0.0 for _ in range (bounds.shape [0])] # run the gradient descent updates for t in range (n_iter): # calculate gradient g (t) g = derivative (x [0]) X [1]) # build a solution one variable at a time for i in range (x.shape [0]): # m (t) = beta1 * m (tmur1) + (1-beta1) * g (t) m [I] = beta1 * m [I] + (1.0-beta1) * g [I] # v (t) = beta2 * v (tMul 1) + (1-beta2) * g (t) ^ 2 v [ I] = beta2* v [I] + (1-beta2) * g [I] * * 2 # mhat (t) = m (t) / (1-beta1 (t)) mmhat = m [I] / (1-beta1** (t)) # vhat (t) = v (t) / (1-beta2 (t)) vvhat = v [I] / (1-beta2** (t)) # x (t) = x (tmur1)-alpha * mhat (t) / (sqrt (vhat (t)) + eps) x [I] = x [I]-alpha * mhat / (sqrt (vhat) + eps) # evaluate candidate point score = objective (x [0]) X [1]) # report progress print ('>% d f (% s) =% .5f'% (t, x, score)) return [x, score]

Note: to improve readability, we intend to use list and imperative coding styles instead of vectorization operations. Feel free to adapt the implementation to a vectorized implementation with an NumPy array to achieve better performance.

We can then define our hyperparameters and call the adam () function to optimize our test target function.

In this case, we will use 60 iterations of the algorithm with an initial step size of 0.02, and the beta 1 and beta 2 values are 0.8 and 0.999, respectively. After some repeated experiments, these super-parameter values were found.

# seed the pseudo random number generator seed (1) # define range for input bounds = asarray ([[- 1.0,1.0], [- 1.0,1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam best, score = adam (objective, derivative, bounds, n_iter, alpha, beta1, beta2) print ('tunnels') Print ('f (% s) =% f'% (best, score))

Taking all this together, a complete example of gradient descent optimization using Adam is listed below.

# gradient descent optimization with adam for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective (x, y): return Xeroids 2.0 + yearly 2.0 # derivative of objective function def derivative (x, y): return asarray ([x * 2.0, y * 2.0]) # gradient descent algorithm with adam def adam (objective, derivative, bounds, n_iter, alpha, beta1, beta2 Eps=1e-8): # generate an initial point x = bounds [:, 0] + rand (len (bounds)) * (bounds [:, 1]-bounds [:, 0]) score = objective (x [0]) X [1]) # initialize first and second moments m = [0.0 for _ in range (bounds.shape [0])] v = [0.0 for _ in range (bounds.shape [0])] # run the gradient descent updates for t in range (n_iter): # calculate gradient g (t) g = derivative (x [0]) X [1]) # build a solution one variable at a time for i in range (x.shape [0]): # m (t) = beta1 * m (tmur1) + (1-beta1) * g (t) m [I] = beta1 * m [I] + (1.0-beta1) * g [I] # v (t) = beta2 * v (tMul 1) + (1-beta2) * g (t) ^ 2 v [ I] = beta2* v [I] + (1-beta2) * g [I] * * 2 # mhat (t) = m (t) / (1-beta1 (t)) mmhat = m [I] / (1-beta1** (t)) # vhat (t) = v (t) / (1-beta2 (t)) vvhat = v [I] / (1-beta2** (t)) # x (t) = x (tmur1)-alpha * mhat (t) / (sqrt (vhat (t)) + eps) x [I] = x [I]-alpha * mhat / (sqrt (vhat) + eps) # evaluate candidate point score = objective (x [0]) X [1]) # report progress print ('>% d f (% s) =% .5f'% (t, x, score)) return [x, score] # seed the pseudo random number generator seed (1) # define range for input bounds = asarray ([[- 1.0,1.0], [- 1.0F] ]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam best, score = adam (objective, derivative, bounds, n_iter, alpha, beta1, beta2) print ('calling') Print ('f (% s) =% f'% (best, score))

Run the example to apply the Adam optimization algorithm to our test problem and report the search performance of the algorithm for each iteration.

Note: your results may be different due to the randomness of the algorithm or evaluation program, or due to differences in numerical accuracy. Consider running the example several times and compare the average results.

In this case, we can see that we have found a near-optimal solution after 53 iterations, with input values close to 0.0 and 0.0, with an evaluation of 0.0.

> 50 f ([- 0.00056912-0.00321961]) = 0.00001 > 51 f ([- 0.00052452-0.00286514]) = 0.00001 > 52 f ([- 0.00043908-0.00251304]) = 0.00001 > 53 f ([- 0.0003283-0.00217044]) = 0.00000 > 54f ([- 0.00020731-0.00184302]) = 0.00000 > 55f ([- 8.95352320e-05-1.53514076e-03]) = 0.00000 > 56f ([1.43050285e]) -05-1.25002847e-03]) = 0.00000 > 57 f ([9.67123406e-05-9.89850279e-04]) = 0.00000 > 58 f ([0.00015359-0.00075587]) = 0.00000 > 59 f ([0.00018407-0.00054858]) = 0.00000 Done! F ([0.00018407-0.00054858]) = 0.000000

Adam visualization

We can plot the progress of the Adam search on the outline of the domain. This can provide an intuitive understanding of the search progress in the iterative process of the algorithm. We must update the adam () function to maintain a list of all solutions found during the search, and then return this list at the end of the search. Newer versions of the features with these changes are listed below.

# gradient descent algorithm with adam def adam (objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8): solutions = list () # generate an initial point x = bounds [:, 0] + rand (len (bounds)) * (bounds [:, 1]-bounds [:, 0]) score = objective (x [0] X [1]) # initialize first and second moments m = [0.0 for _ in range (bounds.shape [0])] v = [0.0 for _ in range (bounds.shape [0])] # run the gradient descent updates for t in range (n_iter): # calculate gradient g (t) g = derivative (x [0]) X [1]) # build a solution one variable at a time for i in range (bounds.shape [0]): # m (t) = beta1 * m (tmur1) + (1-beta1) * g (t) m [I] = beta1 * m [I] + (1.0-beta1) * g [I] # v (t) = beta2 * v (tMul 1) + (1-beta2) * g (t) ^ 2 v [ I] = beta2* v [I] + (1-beta2) * g [I] * * 2 # mhat (t) = m (t) / (1-beta1 (t)) mmhat = m [I] / (1-beta1** (t)) # vhat (t) = v (t) / (1-beta2 (t)) vvhat = v [I] / (1-beta2** (t)) # x (t) = x (tmur1)-alpha * mhat (t) / (sqrt (vhat (t)) + ep) x [I] = x [I]-alpha * mhat / (sqrt (vhat) + eps) # evaluate candidate point score = objective (x [0]) X [1]) # keep track of solutions solutions.append (x.copy ()) # report progress print ('>% d f (% s) =% .5f'% (t, x, score)) return solutions

We can then perform the search as before, this time retrieving the list of solutions rather than the best final solution.

# seed the pseudo random number generator seed (1) # define range for input bounds = asarray ([[- 1.0,1.0], [- 1.0,1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam solutions = adam (objective, derivative, bounds, n_iter, alpha, beta1, beta2)

Then, we can create an outline of the objective function as before.

# sample input range uniformly at 0.1increments xaxis = arange (bounds [0LECO], bounds [0L1], 0.1) yaxis = arange (bounds [1LECO], bounds [1LECO], 0.1) # create a mesh from the axis x, y = meshgrid (xaxis, yaxis) # compute targets results = objective (x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf (x, y, results, levels=50, cmap='jet')

Finally, we can draw each solution found during the search as a white dot connected by a line.

# plot the sample as black circles solutions = asarray (solutions) pyplot.plot (solutions [:, 0], solutions [:, 1],'. -', color='w')

To sum up, the following is a complete example of performing Adam optimization on a test problem and drawing the results on a profile.

# example of plotting the adam search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective (x, y): return x / y / derivative of objective function def derivative (x, y): return asarray ([x * 2.0, y * 2.0]) # gradient descent algorithm with adam def adam (objective, derivative, bounds) N_iter, alpha, beta1, beta2, eps=1e-8): solutions = list () # generate an initial point x = bounds [:, 0] + rand (len (bounds)) * (bounds [:, 1]-bounds [:, 0]) score = objective (x [0] X [1]) # initialize first and second moments m = [0.0 for _ in range (bounds.shape [0])] v = [0.0 for _ in range (bounds.shape [0])] # run the gradient descent updates for t in range (n_iter): # calculate gradient g (t) g = derivative (x [0]) X [1]) # build a solution one variable at a time for i in range (bounds.shape [0]): # m (t) = beta1 * m (tmur1) + (1-beta1) * g (t) m [I] = beta1 * m [I] + (1.0-beta1) * g [I] # v (t) = beta2 * v (tMul 1) + (1-beta2) * g (t) ^ 2 v [ I] = beta2* v [I] + (1-beta2) * g [I] * * 2 # mhat (t) = m (t) / (1-beta1 (t)) mmhat = m [I] / (1-beta1** (t)) # vhat (t) = v (t) / (1-beta2 (t)) vvhat = v [I] / (1-beta2** (t)) # x (t) = x (tmur1)-alpha * mhat (t) / (sqrt (vhat (t)) + ep) x [I] = x [I]-alpha * mhat / (sqrt (vhat) + eps) # evaluate candidate point score = objective (x [0]) X [1]) # keep track of solutions solutions.append (x.copy ()) # report progress print ('>% d f (% s) =% .5f'% (t, x, score)) return solutions # seed the pseudo random number generator seed (1) # define range for input bounds = asarray ([[- 1.0,1.0], [- 1.0,1.0] ) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.999 # factor for average squared gradient beta2 = 0.999 # perform the gradient descent search with adam solutions = adam (objective, derivative, bounds, n_iter, alpha, beta1, beta2) # sample input range uniformly at 0.1increments xaxis = arange (bounds [0Ling 0], bounds [0Jing 1], 0.1) yaxis = arange (bounds [1J 0], bounds [1J 1] # create a mesh from the axis x, y = meshgrid (xaxis, yaxis) # compute targets results = objective (x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf (x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray (solutions) pyplot.plot (solutions [:, 0], solutions [:, 1],'. -', color='w') # show the plot pyplot.show ()

Running the example will perform the search as before, but in this case, an outline of the target function will be created.

In this case, we can see that each solution found during the search shows a white dot, starting with the best point and getting closer to the center of the graph.

Thank you for your reading, the above is the content of "how to quickly master Adam optimization algorithm". After the study of this article, I believe you have a deeper understanding of how to quickly master Adam optimization algorithm, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.