In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the Adam optimization algorithm example analysis, the article is very detailed, has a certain reference value, interested friends must read!
According to the Oxford Dictionary, optimization is the act of making the best or most efficient use of a situation or resource, or simply making one's own things optimal. Often, if something can be mathematically modeled, there is a good chance that it can be optimized. This plays a crucial role in the field of deep learning (and possibly AI as a whole) because the optimization algorithm you choose may be the difference between getting high-quality results in minutes, hours, or days (sometimes even weeks).
What is Adam Optimizer?
Adam Optimizer is an extension of SGD that replaces the classical stochastic gradient descent method to update network weights more efficiently.
Note that the name Adam is not an acronym; in fact, the authors (Diederik P. Kingma of OpenAI and Jimmy Lei Ba of the University of Toronto) point out in their paper that was first published as a conference paper at ICLR 2015, titled Adam: A method for Stochastic Optimization, that the name is derived from adaptive moment estimation.
The author does not hesitate to list the many fascinating benefits of applying Adam to nonconvex optimization problems, and I will continue to share the following:
Simple to implement (we'll implement Adam later in this article, and you'll see firsthand how implementation becomes simpler with less code to run with powerful deep learning frameworks.)
computationally efficient
Low memory requirements
Diagonal rescaling of gradients is invariant (this means that Adam multiplies gradients by diagonal matrices with only positive factors is invariant to better understand this stack exchange)
Ideal for problems with large data and/or parameters
Suitable for non-stationary targets
Suitable for very noisy and/or sparse gradient problems
Hyperparameters have intuitive interpretation and usually require very few adjustments (we'll cover this in detail in the configuration section)
How Adam works.
In short, Adam uses momentum and adaptive learning rates to accelerate convergence.
Momenta (momentum)
When explaining momentum, researchers and practitioners alike like to use analogies that roll faster than a ball rolling down a hill toward a local minimum, but essentially what we have to know is that momentum algorithms accelerate random gradient descent in the relevant direction, such as and suppress oscillations.
To introduce momentum into our neural network, we add a time element to the update vector for the past time step and add it to the current update vector. This increases the momentum of the ball by a certain amount. It can be expressed mathematically, as shown in the figure below.
Momentum update method, where θ is the parameters of the network, i.e. weights, biases or activation values, η is the learning rate, J is the objective function we want to optimize, and γ is the constant term, also called momentum. Vt-1 (note t-1 is subscript) is the past time step, and Vt (note t is subscript) is the current time step.
The momentum term γ is usually initialized to 0.9 or similar to the term mentioned in Sebastian Ruder's paper An overview of gradient descent optimization algorithm.
adaptive learning rate
By reducing the learning rate to the predefined schedule we see in AdaGrad, RMSprop, Adam, and AdaDelta, the adaptive learning rate can be thought of as a learning rate adjustment for the training phase. For more details on this topic, Suki Lau has written a very useful blog post on this topic called " Learning Rate Schedules and Adaptive Learning Rate Methods for Deep Learning." "。
Without spending too much time introducing AdaGrad optimization algorithms, here is an explanation of RMSprop and its improvements on AdaGrad and how the learning rate changes over time.
RMSprop (root mean square propagation) was developed by Geoff Hinton, as described in An Overview of Gradient Descent Optimization Algorithms, to address AdaGrad's steep drop in learning rate. In short, RMSprop changes the learning rate slower than AdaGrad, but RMSprop can still benefit from AdaGrad (faster convergence)-see figure below for mathematical expressions
The first equation for E [g²] t is the exponentially decaying average of squared gradients. Geoff Hinton recommends setting γ to 0.9, while the default value for learning rate η is 0.001
This allows the learning rate to adapt over time, which is important because this phenomenon is also present in Adam. When we put the two together (Moment and RMSprop), we get Adam -the image below shows the algorithm in detail.
If you have heard of Wu Enda's deep learning course, Wu Enda said that "Adam can be understood as RMSprop with Moment". The formula above is the origin of Wu Enda's sentence.
The above is "Adam optimization algorithm example analysis" all the content of this article, thank you for reading! Hope to share the content to help everyone, more relevant knowledge, welcome to pay attention to the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.