How to deeply understand the basic principles of LSTM 04/17 Update SLTechnology News&Howtos

How to deeply understand the basic principles of LSTM

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to understand the basic principles of LSTM in depth, many novices are not very clear about this, in order to help you solve this problem, the following small series will explain in detail for everyone, there are people who need this to learn, I hope you can gain something.

The RNN of the original structure of LSTM is not enough to deal with the more complex sequence modeling problem. It has a serious gradient disappearance problem. The most intuitive phenomenon is that with the increase of the number of layers of the network, the network will gradually become untrainable. Long Short Time Memory (LSTM) is a special RNN structure designed to solve the gradient vanishing problem. Deep neural network trouble: gradient explosion and gradient disappearance In the previous explanation of ordinary deep neural networks and deep convolutional networks, Figure 1 is a simple two-layer ordinary network, but when the network structure becomes deeper, the neural network encounters gradient explosion or gradient disappearance during training. So what are gradient explosions and gradient vanishes? And how do they arise? Figure 1 Two-layer network

Due to the training mechanism of neural networks, no matter what type of neural networks, their training is to update the weights by calculating gradients through back propagation. By setting the loss function, the gradient calculation of the loss function with respect to the input and output of each layer of the network is established. When the network training starts, the system updates the parameters of each layer of the network according to the back propagation mechanism until the training stops. However, when the network layer deepens, the training system is not very stable, and some problems often occur. Gradient explosion and gradient disappearance are two of the more serious problems.

The so-called gradient explosion is in the neural network training process, the gradient becomes larger and larger to make the neural network weights get crazy update situation, this situation is easy to find, because the gradient is too large, the calculated update parameters will also be large to collapse, at this time we may see a lot of NaN in the updated parameter values, which indicates that the gradient explosion has caused parameter update numerical overflow. This is the basic case of gradient explosions.

Then the gradient disappears. Gradient vanishing, as opposed to gradient explosion, is a situation in which the gradient becomes smaller and smaller during neural network training so that the gradient is not updated. When the network deepens, it is difficult for the error in the depth of the network to affect the weight update of the previous layer because of the decrease of gradient. Once the weight cannot be updated effectively, the training mechanism of the neural network will fail.

Why do gradients get bigger or smaller during neural network training? This can be explained by using the neural network backpropagation derivation formula in the first lecture of this book as an example.

Equations (11.1)~(11.8) are the derivation of the backpropagation parameter update formula for a two-layer network. Relatively far away from the output layer is the weight parameter input to the hidden layer. It can be seen that the gradient calculation formula of the weight and bias input to the hidden layer output by the loss function is generally converted from the formula after multiplying the weight of the next layer by the activation function. If the product of the derivative of activation function and the weight of the next layer is greater than 1 or much greater than 1, the gradient explosion will often occur when the network layer is deepened and the gradient is updated. If the product of the activation function derivative and the weight of the next layer is less than 1, the gradient of the shallow layer will become smaller and smaller as the network deepens, and the gradient will often disappear. So it can be said that the reverse propagation mechanism itself creates the gradient explosion and gradient disappearance two unstable factors. For example, for a 100-layer deep neural network, assuming that the gradient calculation value of each layer is 1.1, the final gradient value may become = 13780.61234 after backward propagation gradient calculation from output to input, which is a large gradient value enough to cause calculation overflow problem. If the calculated gradient for each layer is 0.9, the calculated gradient for the backpropagation input layer may be = 0.000026561398, small enough to cause the gradient to disappear. This example is only a simplified hypothetical case, the actual backpropagation calculation is more complex.

Therefore, in general, too large or too small parameters caused by too large or too small gradients in neural network training will lead to neural network failure, so our goal is to make gradient calculation return to the normal interval range, not too large or too small, this is also a way to solve these two problems.

So how do you solve gradient explosion and gradient disappearance problems? Gradient explosion is easy to deal with, in the actual training of the gradient can be trimmed, but the gradient disappearance of the treatment is more troublesome, from the above analysis we know that a key to gradient disappearance lies in the activation function. Sigmoid activation function itself is more likely to cause this problem, so in general, we replace the more robust ReLu activation function and add a normalized activation function layer (BN layer) to the neural network. The general problem can be solved well, but it does not work in all cases. For example, RNN network, we will focus on it later.

The above is the basic explanation of gradient explosion and gradient disappearance. Let's return to the topic and talk about the protagonist of this article-LSTM.

LSTM: Let RNN have a better memory mechanism. A lot of groundwork has been said in front, all for LSTM. Gradient explosion and gradient disappearance, ordinary neural networks and convolutional neural networks have it, so does recurrent neural network RNN? There has to be. And gradient vanishing and gradient exploding problems are more damaging to RNNs. When RNN network deepens, RNN will lose memory to some extent because the network weights of the previous layer cannot be updated due to the problem of gradient disappearance. To this end, on the basis of the traditional RNN network structure, researchers have given some famous improvement schemes, because these improvement schemes are not separated from the classical RNN architecture, so generally speaking, we also call these improvement schemes RNN variants. More famous are GRU (cyclic gating unit) and LSTM (long term memory network). GRU and LSTM are basically the same in structure, but there are some differences. In this lecture, LSTM is more representative to explain in detail.

Before we dive into the technical details of LSTM, a few things should be clear. First, LSTM is essentially an RNN network. Second, LSTM has made relatively complex improvements on the traditional RNN structure, which makes LSTM better able to solve the gradient explosion and gradient disappearance problems than the classical RNN, so that the recurrent neural network has stronger and better memory performance, which is also the value of LSTM. Let's focus on the technical details of LSTM.

Let's first put a comparison diagram between the classical RNN structure and the LSTM structure, so that we can have a macro grasp, and then disassemble and analyze each part of the LSTM structure diagram. Figure 2 shows the standard RNN structure and Figure 3 shows the LSTM structure. Figure 2 RNN structure

Figure 3 Structure of LSTM

As can be seen from Figure 3, LSTM units are much more complex than RNN units. Each LSTM cell contains four interacting network layers. Now zoom in on the LSTM cell and label it with the name of each structure, as shown in Figure 4. Figure 4 LSTM unit

According to Figure 4, a complete LSTM cell can be represented by equations (11.9) to (11.14), where the symbol indicates the merging of two vectors. Now we decompose the LSTM unit structure diagram and interpret the LSTM module by module according to the structure diagram and formula.

1. The memory cell is shown in red in Figure 5. You can see that there is a straight arrow running through the top part of the LSTM cell. This line goes from input to output. Compared to RNN, LSTM provides c as the memory cell input. Memory cells provide the function of memory, and can still transmit the network information of the previous layer and the previous layer when the network structure is deepened. Such straight lines make it easy to keep information in memory between layers of the network. 5. LSTM memory cells

2. Forget Gate is calculated as follows:

The function of the forgetting gate is to decide whether to discard certain information from memory cell c, which can be handled by a Sigmoid function. The location of the Forgotten Door throughout the structure is shown in Figure 11.6. As you can see, the Forgotten Gate accepts values from the input and the hidden state above for a combined weighting process. 6 Forgotten door

3. Memory cell candidate values and update gates Update gates indicate what information needs to be stored in memory cells. In addition to calculating the update gate, tanh is used to calculate candidate values for memory cells. Updating doors in LSTM requires a bit more care. Memory cell candidate values and update gates are calculated as follows: The positions of update gates and tanh in the overall structure are shown in Figure 7. 7. Memory cell candidate values and update gates

4. Memory cell update combines forgetting gate, update gate, memory cell value of the previous unit and memory cell candidate value to jointly determine and update the current cell state: Memory cell update is located in the whole structure of LSTM as shown in Figure 8: Figure 8 Memory cell update

5. The output gate LSTM provides a separate output gate. The calculation formula is as follows: The position of the output door is shown in Figure 9. Figure 9 Output Gate

This is the complete LSTM structure. Although it was complicated, it was basically clear after gradual analysis. LSTM has wide and deep applications in natural language processing, question answering system, stock forecasting and so on. Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.