Vernacular-several steps of long-term and short-term memory (LSTM), with code! 07/16 Update SLTechnology News&Howtos

Vernacular-several steps of long-term and short-term memory (LSTM), with code!

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

1. What is LSTM?

As you read this article, you infer the true meaning of the current word based on your existing understanding of the words you have seen before. We don't throw everything away and think with a blank mind. Our minds are permanent. LSTM has this feature.

This article will introduce another kind of circulation control nerve network: short-term memory (long short-term memory,LSTM) [1]. The structure of the control cycle unit is a little more complex, and it is also to solve the problem of gradient attenuation in the RNN network. It is an extension of GRU.

You can understand the process of GRU first, and it will be much easier to understand LSTM. Link address: three-step understanding-gated Loop Unit (GRU)

In LSTM, three memory cells are cited, namely, input gate, forget gate and output gate, and memory cells with the same shape as the hidden state (some sacrifices treat memory cells as a special hidden state) to record additional information.

two。 Output, forgetting, and output

Like the reset and update memory in the controlled cycle unit, the output of the short-term memory is the current time step input Xt and the previous time step hidden state Ht − 1, and the output is calculated by the full connection layer whose activation function is the sigmoid function. In this way, the range of the three resume elements is [0,1]. As shown in the following figure:

Specifically, assuming that the number of hidden units is h, the small batch output of t at a given time step

(the number of samples is n, the number of output is d) and the previous time step hidden state. The formula for the three doors is as follows:

Input door:)

Forget Q:)

Output door:)

3. Candidate memory cell

Next, short-term memory needs to calculate candidate memory cells. Its calculation is similar to the three functions introduced in the previous section, but uses the tanh function with a range of values in [− 1,1] as the activation function, as shown in the following figure:

Specifically, the candidate memory cells of time step t are calculated as follows:

)

4. Memory cell

We can control the flow of information in the hidden state through the input, forget and output values of the element range in [0,1], which is also achieved by multiplying the elements (the symbol is ⊙). The calculation of the current time step memory cells combines the information of the previous time step memory cells and the current time step candidate memory cells, and controls the flow of information through amnesia and transmission.

As shown in the following figure, forgetting whether the information in Ct − 1, the memory cell that controls the previous time step, is transferred to the current time step, while the input Xt controls how the current time step's input Xt flows to the current time step's memory cells through candidate memory cells C ü t. If the forgotten memory is approximately 1 and the input memory is 0, the memory cells of the past will "save and transfer through time" the current time step. This design can deal with the problem of gradient attenuation in circulatory nerve collaterals and better capture the dependence of time step distance in time series.

5. Hidden state

With the memory cell, we can also control the message from the memory cell to the hidden state Ht by outputting the memory cell.

The flow of information:

)

The tanh function of the hide ensures that the value of the hidden state element is between-1 and 1. It should be noted that when the output value is approximately 1, the memory cell information will be transferred to the hidden state for the output layer to make it; when the output value is approximately 0, the memory cell information will only be retained. The following figure shows all the calculations of hidden states in short-term memory:

6. The difference between LSTM and GRU

LSTM and GRU are very similar in structure, except that:

The new memory is calculated based on the previous state and input, but there is a reset gate in GRU that controls the input of the previous state, but there is no similar gate in LSTM. Different ways of generating new states, LSTM has two different gates, namely the forget gate gate and the input gate, while GRU has only one update gate (update gate). LSTM can adjust the new state through the output gate (output gate), while GRU has no adjustment to the output. The advantage of GRU is that it is a simpler model, so it is easier to create a larger network, and it has only two doors and runs faster computationally, and then it can expand the size of the model. LSTM is more powerful and flexible because it has three doors instead of two. 7. Can LSTM use other activation functions?

As for the selection of activation function, in LSTM, the Sigmoid function is used as the activation function for the forgetting gate, the input gate and the output gate, and the hyperbolic tangent function Tanh is used as the activation function when generating candidate memory.

It is worth noting that both activation functions are saturated, that is, when the input reaches a certain value, the output will not change significantly. If you use an unsaturated activation function, such as ReLU, it will be difficult to achieve the gating effect.

The output of the Sigmoid function is between 0 and 1, which conforms to the physical definition of gating. And when the input is large or small, the output will be very close to 1 or 0, thus ensuring that the door is open or closed. When generating candidate memories, the Tanh function is used because its output is between − 1 and 1, which coincides with the fact that the feature distribution is zero center in most scenarios. In addition, the Tanh function has a larger gradient than the Sigmoid function near the input of 0, which usually makes the model converge faster.

The choice of activation function is not immutable, but we should choose a reasonable activation function.

8. Code implementation

MIST data Classification-TensorFlow to implement LSTM

[machine learning easy to understand series of articles]

9. reference

"manipulative Learning-Deep Learning"

Author: @ mantchs

GitHub: https://github.com/NLP-LOVE/ML-NLP

Welcome to join the discussion! Work together to improve this project! Group number: [541954936]

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.