In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Today, I will introduce to you the principle of Transformer and how it compares with RNN encoder-decoder. The content of the article is good. Now I would like to share it with you. Friends who feel in need can understand it. I hope it will be helpful to you. Let's read it along with the editor's ideas.
1. Compare with RNN encoder-decoder
Rely on attention mechanism, do not use rnn and cnn, high degree of parallelism
Through attention, grasping long-distance dependency is stronger than rnn.
The feature extraction ability of transformer is better than that of RNN series models. The biggest problem of seq2seq is that all the information on the Encoder side is compressed into a fixed-length tensor.
2. Transformer knows
1) RNN (LSTM, GRU) training iterative, serial, need to wait for the current word processing, and then deal with the next word. The encoder,decoder of Transformer is parallel, and all words are trained at the same time, which increases the computational efficiency.
2) the Transformer model consists of Encoder and Decoder.
3. Positional encoding
1) self-attention has no location information in RNN, and add positional.
2) position coding uses binary to represent a waste of space and satisfies three conditions, it should output a unique code for each word; between sentences of different length, the difference between any two words should be consistent; its value should be bounded. Sin, cos is continuous and derivable.
3) Formula:
1) residual network: B` = b + a, b = (attention or feed farward) (a)
2) normalization: Layer Normalization with RNN and LN with Transformer
3) the difference between BN and LN, BN does normalization to the same dimention in batch layer, the mean is 0, the variance is 1, the batch layer is not considered in LN, the mean mean of different dimention is 0, and the variance is 1.
5. Mask
1) padding mask
In softmax, 0 will also be operated, exp (0) = 1, so that the invalid part participates in the operation, which will cause hidden trouble, so it is necessary to do a mask operation so that the invalid area does not participate in the operation, generally adding a large negative offset to the invalid area.
Tips: we usually use mini-batch to calculate, that is, to calculate more than one sentence at a time, that is, the dimension of x is [batch_size, seq_length], seq_length is sentence length, and a mini-batch is made up of sentences of unequal length. We need to complete the remaining sentences according to the maximum sentence length in this mini-batch, which is usually filled with 0. This process is called padding.
2) squence mask
Mask, do not see future information to the model. When the input of Encoder is: machine learning, then the input of decoder is: machine learning
Transformer Decoder is changed to self-Attention, which is different from the time-driven mechanism of RNN in Seq2Seq (the words at the end of t-moment can be seen at the end of t-moment), which makes all future words exposed in Decoder. Mask makes the word of the upper triangle 0, softmax (- inf) = 0
6. Self-attention & multi-head attention
According to the location, it is divided into self-attention and soft-attention.
2) multi-head attention
Multi-head can focus on different points, focus on different aspects, and finally integrate all aspects of information to help capture richer features.
These are the principles of Transformer and what it is like when compared with RNN encoder-decoder. For more information about the principles of Transformer and how to compare with RNN encoder-decoder, you can search the previous articles or browse the following articles to learn! I believe the editor will add more knowledge to you. I hope you can support it!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.