How to optimize gradient descent using Nadam 07/11 Update SLTechnology News&Howtos

How to optimize gradient descent using Nadam

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces how to use Nadam for gradient descent optimization, the article is very detailed, has a certain reference value, interested friends must read it!

Gradient descent is an optimization algorithm that follows the negative gradient of the objective function to locate the minimum value of the function.

The limitation of the gradient drop is that if the gradient becomes flat or large curvature, the progress of the search may slow down. You can add momentum to the gradient drop, which combines some inertia to update. You can further improve this effect by combining the gradient of the expected new position instead of the current position (an accelerated gradient called Nesterov (NAG) or Nesterov momentum).

Another limitation of gradient decline is that all input variables use a single step size (learning rate). For extensions of gradient descent, such as the adaptive motion estimation (Adam) algorithm, the algorithm uses a separate step size for each input variable, but may cause the step size to be rapidly reduced to a very small value. Nesterov accelerated adaptive moment estimation or Nadam is an extension of Adam algorithm, which combines Nesterov momentum and can make the optimization algorithm have better performance.

In this tutorial, you will discover how to use Nadam for gradient descent optimization from scratch. After completing this tutorial, you will know:

Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.

Nadam is an extension of the gradient decline of the Adam version, which includes Nestrov's momentum.

How to implement the Nadam optimization algorithm from scratch and apply it to the objective function and evaluate the results.

Overview of the tutorial

This tutorial is divided into three parts: they are:

Gradient decline

Nadam optimization algorithm

Gradient decline of Nadam

Two-dimensional test problem

Gradient descent optimization of Nadam

Visual Nadam optimization

Gradient decline

Gradient descent is an optimization algorithm. It is technically called the first-order optimization algorithm because it clearly uses the first derivative of the objective function.

The first derivative, or "derivative" for short, is the rate of change or slope of the objective function at a particular point (for example, a point). For specific input.

If the objective function takes multiple input variables, it is called a multivariate function, and the input variable can be treated as a vector. Conversely, the derivative of a multivariate objective function can also be regarded as a vector, usually called a gradient.

Gradient: the first derivative of a multivariate objective function.

For a particular input, the derivative or gradient points to the steepest upward direction of the objective function. Gradient descent refers to a minimization optimization algorithm, which follows the negative value of the downhill gradient of the objective function to locate the minimum value of the function.

The gradient descent algorithm requires an objective function being optimized and the derivative function of the objective function. The objective function f () returns the score of a given input set, and the derivative function f'() gives the derivative of the objective function of the given input set. The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.

Suppose we are minimizing the objective function, then calculating the derivative and taking a step in the input space, which will cause the objective function to move downhill. First of all, the downhill motion is carried out by calculating the distance to move in the input space, which is calculated by multiplying the step size (called alpha or learning rate) by the gradient. Then subtract the value from the current point to ensure that we move the target function against the gradient or down.

X (t) = x (tmur1)-step* f'(x (t))

The steeper the objective function at a given point, the larger the gradient, and in turn, the greater the steps taken in the search space. Use the step size super parameter to scale the step size.

Step size: a super parameter that controls how far the algorithm moves in the search space relative to the gradient in each iteration.

If the step size is too small, the movement in the search space will be small, and the search will take a long time. If the step size is too large, the search may bounce near the search space and skip the optimal value.

Now that we are familiar with the gradient descent optimization algorithm, let's take a look at the Nadam algorithm.

Nadam optimization algorithm

Nesterov accelerated adaptive momentum estimation or Nadam algorithm is an extension of adaptive motion estimation (Adam) optimization algorithm, adding Nesterov accelerated gradient (NAG) or Nesterov momentum, which is an improved momentum. More broadly, Nadam algorithm is an extension of gradient descent optimization algorithm. Timothy Dozat described the algorithm in his 2016 paper "integrating Nesterov momentum into Adam". Although a version of the paper was prepared in 2015 in the form of a Stanford project report of the same name. Momentum adds the exponential decay moving average (first moment) of the gradient to the gradient descent algorithm. This has the effect of eliminating noisy objective function and improving convergence. Adam is an extension of gradient descent, which increases the first and second moments of the gradient and automatically adjusts the learning rate for each parameter being optimized. NAG is an extension of momentum, where momentum updates are performed using the expected update of the parameter rather than the gradient of the actual current variable value. In some cases, the effect of this is to slow down the search when finding the best location, rather than overshoot.

Nadam is an extension of Adam, which uses NAG momentum instead of classical momentum. Let's introduce each element of the algorithm step by step. Nadam uses attenuation step size (alpha) and first order moment (mu) hyperparameters to improve performance. For the sake of simplicity, we will ignore this aspect for the time being and adopt constant values. First of all, for each parameter to be optimized in the search, we must maintain the first moment and the second moment of the gradient, called m and n, respectively. Initialize them to 0.0 at the beginning of the search.

M = 0

N = 0

The algorithm iterates within the time t starting from t = 1, and each iteration involves calculating a new set of parameter values x, for example. From x (tmur1) to x (t). If we focus on updating one parameter, it may be easy to understand the algorithm, which is summarized as updating all parameters through vector operations. First, the gradient (partial derivative) of the current time step is calculated.

G (t) = f'(x (tmur1))

Next, update the first moment with the gradient and the superparameter "mu".

M (t) = mu* m (tmur1) + (1-mu) * g (t)

Then update the second moment with the "nu" superparameter.

N (t) = nu * n (tmur1) + (1-nu) * g (t) ^ 2

Next, the Nesterov momentum is used to correct the deviation at the first moment.

Mhat = (mu * m (t) / (1-mu)) + (1-mu) * g (t) / (1-mu))

Then the deviation is corrected at the second time. Note: deviation correction is an aspect of Adam, as opposed to the fact that the first and second moments are initialized to zero at the beginning of the search.

Nhat = nu * n (t) / (1-nu)

Finally, we can calculate the value of the parameter for the iteration.

X (t) = x (tmur1)-alpha / (sqrt (nhat) + eps) * mhat

Where alpha is the step size (learning rate) superparameter, sqrt () is the square root function, and eps (epsilon) is a small value, such as 1e-8, to avoid dividing by zero error.

In retrospect, the algorithm has three super parameters. They are:

Alpha: initial step size (learning rate), typical value is 0.002. Mu: the attenuation factor of the first moment (beta1 in Adam). The typical value is 0.975. Nu: the attenuation factor of the second moment (beta2 in Adam). The typical value is 0.999.

okay. Next, let's look at how to implement the algorithm from scratch in Python.

Gradient decline of Nadam

In this section, we will explore how to use Nadam momentum to achieve gradient descent optimization algorithms.

Two-dimensional test problem

First, let's define an optimization function. We will use a simple two-dimensional function that squares the input of each dimension and defines the range of valid inputs (from-1.0 to 1.0). The following Objective () function implements this function

# objective function def objective (x, y): return Xbox 2.0 + yawning 2.0

We can create a 3D diagram of the dataset to understand the curvature of the response surface. A complete example of drawing the target function is listed below.

# 3D plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective (x, y): return x sample input range uniformly at 2.0 # define range for input r_min, r_max =-1.0,1.0 # sample input range uniformly at 0.1 increments xaxis = arange (r_min, r_max, 0.1) yaxis = arange (r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid (xaxis Yaxis) # compute targets results = objective (x, y) # create a surface plot with the jet color scheme figure = pyplot.figure () axis = figure.gca (projection='3d') axis.plot_surface (x, y, results, cmap='jet') # show the plot pyplot.show ()

Running the example creates a 3D surface view of the target function. We can see the familiar bowl shape with a global minimum of f (0mem0) = 0.

We can also create a two-dimensional graph of the function. This can be very helpful when you want to plot the search progress in the future. The following example creates an outline of the target function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective (x, y): return x-ray 2.0 + y-ray 2.0 # define range for input bounds = asarray ([- 1.0,1.0], [- 1.0,1.0]]) # sample input range uniformly at 0.1increments xaxis = arange (bounds [0LECO], bounds [0LING 1], 0.1) yaxis = arange (bounds [1ZOOO], bounds [1MIEL1] # create a mesh from the axis x, y = meshgrid (xaxis, yaxis) # compute targets results = objective (x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf (x, y, results, levels=50, cmap='jet') # show the plot pyplot.show ()

Running the example creates a two-dimensional outline of the objective function. We can see that the shape of the bowl is compressed into an outline displayed as a color gradient. We will use this graph to draw specific points to explore during the search.

Now that we have a test objective function, let's take a look at how to implement the Nadam optimization algorithm.

Gradient descent optimization of Nadam

We can apply the gradient drop of Nadam to the test problem. First, we need a function to calculate the derivative of this function.

The derivative of x ^ 2 is x * 2 on each dimension.

F (x) = x ^ 2 f'(x) = x * 2

The derived () function does this below.

# derivative of objective function def derivative (x, y): return asarray ([x * 2.0, y * 2.0])

Next, we can use Nadam to achieve gradient descent optimization. First of all, we can select random points within the scope of the problem as the starting point of the search. Suppose we have an array that defines the search scope, one row for each dimension, and the first column defines the minimum value, and the second column defines the maximum value of the dimension.

# generate an initial point x = bounds [:, 0] + rand (len (bounds)) * (bounds [:, 1]-bounds [:, 0]) score = objective (x [0], x [1])

Next, we need to initialize the torque vector.

# initialize decaying moving averages m = [0.0 for _ in range (bounds.shape [0])] n = [0.0 for _ in range (bounds.shape [0])]

Then we run a fixed number of iterations of the algorithm defined by the "n_iter" superparameter.

... # run iterations of gradient descent for t in range (n_iter):...

The first step is to calculate the derivative of the current parameter set.

... # calculate gradient g (t) g = derivative (x [0], x [1])

Next, we need to perform the Nadam update calculation. To improve readability, we will use imperative programming to perform these calculations for one variable at a time. In practice, I recommend using NumPy vector operations to improve efficiency.

... # build a solution one variable at a time for i in range (x.shape [0]):...

First, we need to calculate the torque vector.

# m (t) = mu * m (tmur1) + (1-mu) * g (t) m [I] = mu * m [I] + (1.0-mu) * g [I]

And then the second moment vector.

# nhat = nu * n (t) / (1-nu) nhat = nu * n [I] / (1.0-nu) # n (t) = nu * n (tmur1) + (1-nu) * g (t) ^ 2 n [I] = nu * n [I] + (1.0-nu) * g [I] * * 2

Then there is the deviation-corrected Nestrov momentum.

# mhat = (mu * m (t) / (1-mu)) + ((1-mu) * g (t) / (1-mu)) mhat = (mu * m [I] / (1.0-mu)) + ((1-mu) * g [I] / (1.0-mu))

The second moment of deviation correction.

# nhat = nu * n (t) / (1-nu) nhat = nu * n [I] / (1.0-nu)

Finally, update the parameters.

# x (t) = x (tmur1)-alpha / (sqrt (nhat) + eps) * mhat x [I] = x [I]-alpha / (sqrt (nhat) + eps) * mhat

Then repeat this for each parameter you want to optimize. At the end of the iteration, we can evaluate the new parameter values and report the performance of the search.

# evaluate candidate point score = objective (x [0], x [1]) # report progress print ('>% d f (% s) =% .5f'% (t, x, score))

We can combine all of this into a function called nadam (), which takes the names of the target and derived functions, as well as algorithm superparameters, and returns the best solution found at the end of the search and its evaluation.

# gradient descent algorithm with nadam def nadam (objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8): # generate an initial point x = bounds [:, 0] + rand (len (bounds)) * (bounds [:, 1]-bounds [:, 0]) score = objective (x [0] X [1]) # initialize decaying moving averages m = [0.0 for _ in range (bounds.shape [0])] n = [0.0 for _ in range (bounds.shape [0])] # run the gradient descent for t in range (n_iter): # calculate gradient g (t) g = derivative (x [0]) X [1]) # build a solution one variable at a time for i in range (bounds.shape [0]): # m (t) = mu * m (tmur1) + (1-mu) * g (t) m [I] = mu * m [I] + (1.0-mu) * g [I] # n (t) = nu * n (tMul 1) + (1-nu) * g (t) ^ 2 n [I] = Nu * n [I] + (1-nu) * g [I] * * 2 # mhat = (mu * m (t) / (1-mu)) + ((1-mu) * g (t) / (1-mu)) mhat = (mu * m [I] / (1-mu)) + ((1-mu) * g [I] / (1-mu)) # nhat = nu * n (t) / (1-nu) nhat = nu * n [I] / (1.0-nu) # x (t) = x (tmur1)-alpha / (sqrt (nhat) + eps) * mhat x [I] = x [I]-alpha / (sqrt (nhat) + eps) * mhat # evaluate candidate point score = objective (x [0]) X [1]) # report progress print ('>% d f (% s) =% .5f'% (t, x, score)) return [x, score]

Then, we can define the boundaries between functions and hyperparameters, and call the function to perform optimization. In this case, we will run the algorithm for 50 iterations with an initial alpha of 0.02, μ of 0.8 and nu of 0.999, which is found after a little bit of trial and error.

# seed the pseudo random number generator seed (1) # define range for input bounds = asarray ([- 1.0,1.0], [- 1.0,1.0]]) # define the total iterations n_iter = 50 # steps size alpha = 0.02 # factor for average gradient mu = 0.8 # factor for average squared gradient nu = 0.999 # perform the gradient descent search with nadam best, score = nadam (objective, derivative, bounds, n_iter, alpha, mu, nu)

At the end of the run, we will report on the best solution found.

# summarize the result print ('dialogues') Print ('f (% s) =% f'% (best, score))

Taking all of this together, the following is a complete example of the Nadam gradient drop that applies to our test problem.

# gradient descent optimization with nadam for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective (x, y): return Xeroids 2.0 + yearly 2.0 # derivative of objective function def derivative (x, y): return asarray ([x * 2.0, y * 2.0]) # gradient descent algorithm with nadam def nadam (objective, derivative, bounds, n_iter, alpha, mu, nu Eps=1e-8): # generate an initial point x = bounds [:, 0] + rand (len (bounds)) * (bounds [:, 1]-bounds [:, 0]) score = objective (x [0]) X [1]) # initialize decaying moving averages m = [0.0 for _ in range (bounds.shape [0])] n = [0.0 for _ in range (bounds.shape [0])] # run the gradient descent for t in range (n_iter): # calculate gradient g (t) g = derivative (x [0]) X [1]) # build a solution one variable at a time for i in range (bounds.shape [0]): # m (t) = mu * m (tmur1) + (1-mu) * g (t) m [I] = mu * m [I] + (1.0-mu) * g [I] # n (t) = nu * n (tMul 1) + (1-nu) * g (t) ^ 2 n [I] = Nu * n [I] + (1-nu) * g [I] * * 2 # mhat = (mu * m (t) / (1-mu)) + ((1-mu) * g (t) / (1-mu)) mhat = (mu * m [I] / (1-mu)) + ((1-mu) * g [I] / (1-mu)) # nhat = nu * n (t) / (1-nu) nhat = nu * n [I] / (1.0-nu) # x (t) = x (tmur1)-alpha / (sqrt (nhat) + eps) * mhat x [I] = x [I]-alpha / (sqrt (nhat) + eps) * mhat # evaluate candidate point score = objective (x [0]) X [1]) # report progress print ('>% d f (% s) =% .5f'% (t, x, score)) return [x, score] # seed the pseudo random number generator seed (1) # define range for input bounds = asarray ([[- 1.0,1.0], [- 1.0F] ]) # define the total iterations n_iter = 50 # steps size alpha = 0.02 # factor for average gradient mu = 0.8 # factor for average squared gradient nu = 0.999 # perform the gradient descent search with nadam best, score = nadam (objective, derivative, bounds, n_iter, alpha, mu, nu) print ('calling') Print ('f (% s) =% f'% (best, score))

Run the example to apply the optimization algorithm and Nadam to our test problem and report the search performance of the algorithm for each iteration.

Note: your results may be different due to the randomness of the algorithm or evaluation program, or due to differences in numerical accuracy. Consider running the example several times and compare the average results.

In this case, we can see that a near-optimal solution has been found after about 44 search iterations, with input values close to 0.0 and 0.0 and evaluated as 0.0.

> 40 f ([5.07445337e-05-3.32910019e-03]) = 0.00001 > 41 f ([- 1.84325171e-05-3.00939427e-03]) = 0.00001 > 42 f ([- 6.78814472e-05-2.69839367e-03]) = 0.00001 > 43 f ([- 9.88339249e-05-2.40042096e-03]) = 0.00001 > 44 f ([- 0.00011368-0.00211861]) = 0.00000 > 45 f ([- 0.00011547-0.00185511]) = 0.00000 > 46 f ([- 0.0001075-0.00161122]) = 0.00000 > 47 f ([- 9.29922627e-05-1.38760991e-03]) = 0.00000 > 48 f ([- 7.48258406e-05-1.18436586e-03]) = 0.00000 > 49 f ([- 5.54299505e-05-1.00116899e-03]) = 0.00000 Done! F ([- 5.54299505e-05-1.00116899e-03]) = 0.000001

Visual Nadam optimization

We can plot the progress of the Nadam search on the contours of the domain. This can provide an intuitive understanding of the search progress in the iterative process of the algorithm. We must update the nadam () function to maintain a list of all solutions found during the search, and then return this list at the end of the search. Newer versions of the features with these changes are listed below.

# gradient descent algorithm with nadam def nadam (objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8): solutions = list () # generate an initial point x = bounds [:, 0] + rand (len (bounds)) * (bounds [:, 1]-bounds [:, 0]) score = objective (x [0] X [1]) # initialize decaying moving averages m = [0.0 for _ in range (bounds.shape [0])] n = [0.0 for _ in range (bounds.shape [0])] # run the gradient descent for t in range (n_iter): # calculate gradient g (t) g = derivative (x [0]) X [1]) # build a solution one variable at a time for i in range (bounds.shape [0]): # m (t) = mu * m (tmur1) + (1-mu) * g (t) m [I] = mu * m [I] + (1.0-mu) * g [I] # n (t) = nu * n (tMul 1) + (1-nu) * g (t) ^ 2 n [I] = Nu * n [I] + (1-nu) * g [I] * * 2 # mhat = (mu * m (t) / (1-mu)) + ((1-mu) * g (t) / (1-mu)) mhat = (mu * m [I] / (1-mu)) + ((1-mu) * g [I] / (1-mu)) # nhat = nu * n (t) / (1-nu) nhat = nu * n [I] / (1.0-nu) # x (t) = x (tmur1)-alpha / (sqrt (nhat) + eps) * mhat x [I] = x [I]-alpha / (sqrt (nhat) + eps) * mhat # evaluate candidate point score = objective (x [0]) X [1]) # store solution solutions.append (x.copy ()) # report progress print ('>% d f (% s) =% .5f'% (t, x, score)) return solutions

We can then perform the search as before, this time retrieving the list of solutions rather than the best final solution.

# seed the pseudo random number generator seed (1) # define range for input bounds = asarray ([- 1.0,1.0], [- 1.0,1.0]]) # define the total iterations n_iter = 50 # steps size alpha = 0.02 # factor for average gradient mu = 0.8 # factor for average squared gradient nu = 0.999 # perform the gradient descent search with nadam solutions = nadam (objective, derivative, bounds, n_iter, alpha, mu, nu)

Then, we can create an outline of the objective function as before.

# sample input range uniformly at 0.1increments xaxis = arange (bounds [0LECO], bounds [0L1], 0.1) yaxis = arange (bounds [1LECO], bounds [1LECO], 0.1) # create a mesh from the axis x, y = meshgrid (xaxis, yaxis) # compute targets results = objective (x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf (x, y, results, levels=50, cmap='jet')

Finally, we can draw each solution found during the search as a white dot connected by a line.

# plot the sample as black circles solutions = asarray (solutions) pyplot.plot (solutions [:, 0], solutions [:, 1],'. -', color='w')

To sum up, the following is a complete example of performing Nadam optimization on a test problem and drawing the results on a profile.

# example of plotting the nadam search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy import product from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective (x, y): derivative of objective function def derivative (x, y): return asarray ([x * 2.0, y * 2.0]) # gradient descent algorithm with nadam def nadam (objective) Derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8): solutions = list () # generate an initial point x = bounds [:, 0] + rand (len (bounds)) * (bounds [:, 1]-bounds [:, 0]) score = objective (x [0] X [1]) # initialize decaying moving averages m = [0.0 for _ in range (bounds.shape [0])] n = [0.0 for _ in range (bounds.shape [0])] # run the gradient descent for t in range (n_iter): # calculate gradient g (t) g = derivative (x [0]) X [1]) # build a solution one variable at a time for i in range (bounds.shape [0]): # m (t) = mu * m (tmur1) + (1-mu) * g (t) m [I] = mu * m [I] + (1.0-mu) * g [I] # n (t) = nu * n (tMul 1) + (1-nu) * g (t) ^ 2 n [I] = Nu * n [I] + (1-nu) * g [I] * * 2 # mhat = (mu * m (t) / (1-mu)) + ((1-mu) * g (t) / (1-mu)) mhat = (mu * m [I] / (1-mu)) + ((1-mu) * g [I] / (1-mu)) # nhat = nu * n (t) / (1-nu) nhat = nu * n [I] / (1.0-nu) # x (t) = x (tmur1)-alpha / (sqrt (nhat) + eps) * mhat x [I] = x [I]-alpha / (sqrt (nhat) + eps) * mhat # evaluate candidate point score = objective (x [0]) X [1]) # store solution solutions.append (x.copy ()) # report progress print ('>% d f (% s) =% .5f'% (t, x, score)) return solutions # seed the pseudo random number generator seed (1) # define range for input bounds = asarray ([[- 1.0,1.0], [- 1.0,1.0] 1]) # define the total iterations n_iter = 50 # steps size alpha = 0.02 # factor for average gradient mu = 0.999 # perform the gradient descent search with nadam solutions = nadam (objective, derivative, bounds, n_iter, alpha, mu, nu) # sample input range uniformly at 0.1increments xaxis = arange (bounds [0Med 0], bounds [0MJ 1], 0.1) yaxis = arange (bounds [1J 0], bounds [1J 1], 0.1) # create a mesh from the axis x Y = meshgrid (xaxis, yaxis) # compute targets results = objective (x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf (x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray (solutions) pyplot.plot (solutions [:, 0], solutions [:, 1],'. -', color='w') # show the plot pyplot.show ()

Running the example will perform the search as before, but in this case, an outline of the target function will be created.

In this case, we can see that each solution found during the search shows a white dot, starting with the best point and getting closer to the center of the graph.

The above is all the content of the article "how to use Nadam for gradient descent optimization". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.