The basic concept of Transformers 04/20 Update SLTechnology News&Howtos

The basic concept of Transformers

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains the basic concepts of Transformers. The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn the basic concepts of Transformers.

What is Transformers?

It looks scary, doesn't it? Wouldn't it be easier if I told you that all this boils down to one formula?

Attention (Q, K, V) = ∑ (Similarity (Q, K) * V)

Yes, all the complex architecture in the above figure does is to ensure that this formula works properly. So what are these QBI K and V? What are these different types of attention? Let's study it in depth! We will adopt a bottom-up approach.

Input / output embedding

These can be Word2Vec,GloVe,Fastext or any type of word embedding in which we can convert the text into some form of meaningful vector. (PS- word embedding has no context. There is only one fixed embedding per word)

Location coding (PE):

In RNN (LSTM,GRU), the concept of time step is encoded sequentially because the input / output streams are one at a time. For Transformer, the author encodes the time as a sine wave as an additional input. Such signals are added to inputs and outputs to indicate the passage of time.

Pos is the position of the word I is the dimension of this vector. In other words, each dimension of PE corresponds to a sine curve. The wavelength changes from 2 times to 12 times in geometric series. For even numbers (2i) we use sine and for odd numbers (2i + 1) we use cosines. In this way, we can provide a different encoding for each tag of the input sequence, so we can now pass the input in parallel. This blog (https://kazemnejad.com/blog/transformer_architecture_positional_encoding/) explains the math behind PE very well.

However, recent architectures use "learned" PE rather than PE that can be extended to arbitrary length sequences. And it works very well. In other words, they do not need to generalize the sequence to a sequence that is longer than the sequence seen in training. That's because the input sizes of these models are fixed (for example, BERT's 512 tokens). Therefore, they will not see a longer sequence as input during the test.

Types of attention

Encoder self-attention

This is a two-way attention (and the only two-way attention mechanism, which is why it is the only type of attention used in BERT), where each word is associated with each other. It does capture double-context information in a sentence, and even bi-LSTM cannot capture it (because bi-LSTM combines the results of Forward AR and Backward AR, rather than generating double-context information at its core. This is why in essence some people think that ELMo embedding is not really bi-directional.

The main purpose of this concern is to provide a proportional representation based on the importance of each word in its context on the basis of all other words in the input.

Decoder self-attention

Decoders in Transformer are essentially autoregressive, that is, each word in the output is associated with all its previous words, but not with any future words when making predictions (AR can also be reversed, that is, given a future word, predict the previous word). If you associate it with future words, it will eventually lead to data leakage, and the model will not be able to learn anything.

Encoder-Decoder Note: (cross-attention instead of self-attention)

The purpose of using attention is to find links to the current output words of all the words in the input. Basically, what we are trying to find here is the effect of each input word on the current output word.

You can do this by using only the query section of the last decoder layer and the key and value section of the encoder. (because Query is used as the representation of the word under consideration, Key is the representation of all words, and it is used to find the weight of all words relative to the word under consideration, and Value is also the representation of all words, but is used to find the final weighted sum)

The following GIF sums up all three types of attention well.

Query (Q), key (K) and value (V)

The concepts of query, key and value come from the retrieval system. For example, when you type a query to search for certain videos on YouTube, the search engine maps your query against a set of keys (video title, description, etc.) related to the candidate video in the database and then displays you the best matching video (value).

QQuery K and V are basically linear layers above the original word embedding, which can reduce the size of the original word embedding (why should it be reduced? I will discuss the reasons later. We have embedded and projected the original words into three different (perhaps the same) low-dimensional spaces.

Basically, think of it this way. Whenever you need to find the similarity between two vectors, we just need to get their dot product. In order to find the output of the first word, we only consider the representation Q of the first word and multiply its dot product with the representation K of each word in the input. In this way, we can know the relationship of each word in the input relative to the first word.

After taking the dot product, we divide the result by sqrt (d numbers), where d times is the dimension of the vector K. This is done to stabilize the gradient because the dot product can be very large.

We normalize the softmax of the above values. This is done because these terms are now regarded as the weight of each word relative to the first word.

Remember what I said at the beginning of the post? That Transformers is about ∑ similarity (Q ~ K similarity) * V similarity. Well, we have now completed the ∑ similarity part of the equation. Now we have a distribution that describes the importance of each word in the input relative to the first word.

To complete the equation, we multiply the weight (softmax) by the corresponding representation V, and then add them up. Therefore, our final representation of the first word will be the weighted sum of all inputs, and each input word will be weighted by similarity (importance) relative to the first word.

We repeat this process for all words. In the form of a vector, we can express it with the equation given below.

The following figure sums up the whole process very well. (I'll talk about the mask mask later, which only appears in the decoder section.)

Bullish attention

Until now, our conversation has been about the attention of a single head. Single attention can focus attention on a specific group of words. What if we want to have multiple sets, each of which pays different attention to a different set of words? (somewhat similar to the aggregates we do, there are multiple similar models, but each of them learns something different.) once we have the attention of multiple extended point products, we connect the results, multiple weight matrices (so each head can be weighted based on its importance) to produce the final output Self-Attention layer.

One question remains unanswered. Why do Q, V and K need to be dimensionally reduced, even though this may result in the loss of information on the original word? The answer is bullish self-attention. Suppose the embedded input from Word2Vec is (1 x 512) and we have eight heads of attention. Then we keep the dimension of Q K V at 1x (512 Q K V 8), which is 1x64. In this way, we can use bullish attention without adding any computing power. Now, it has learned 24 different weights, not just three.

Mask mask for self-attention (for decoders only):

The Transformers decoder is essentially autoregressive, because if we let it look at all the words in the process of self-attention, it won't learn anything. In order to avoid this situation, we hide future words in the sequence while calculating self-attention.

Once we calculate the scaling scores for all the words in the sequence, we apply the "look forward" mask to get the mask score.

Now when we calculate the softmax of a hidden fraction, negative infinity is returned to zero, leaving zero attention scores for all future signs in the sequence.

Summarize (6 simple points):

From the introduction just now, we are very familiar with all the building blocks of Transformer, so now it's time to summarize them! It's been done well so far. :)

1. Add the embedding of all the words in the input sequence to their respective position codes to get the final input of our Transformer.

2. Transformer is a Seq2Seq model, so it consists of an encoder and a decoder. The encoder consists of N identical layers (N = 6 in the original paper). Each layer contains the following components:

Multi-headed self-attention layer (encoder): takes the input vector of each word and converts it into a representation that contains information about how each word should accompany all other words in the sequence.

Addition and normalization: the output of multi-head self-attention layer and position feedforward network are processed by this layer. It contains a residual connection (to ensure that the gradient does not get stuck and keep flowing) and a normalization layer (to prevent the value from changing too much, so that it can be trained faster and act as a regularization function).

Point-by-point full connection layer: this layer is applied to each word vector separately and equally. It consists of two linear transformations, which are activated by ReLU between them.

3. After calculating the output of all N-layer encoders, the final (key, value) pair is for each Encoder-Decoder Note block that will be passed to the decoder. This completes the encoder part of our Transformer.

4. Because the decoder is autoregressive in nature, it takes the list of previous outputs as input. The tokens are then converted to word embedding and added to their respective location codes to get the final input from the decoder.

5. The decoder also contains N identical layers (N = 6 in the original paper). Each layer contains the following components:

Multi-head self-attention layer (decoder): generates a representation for each position in the decoder to encode all positions in the decoder until that position. We need to block the left flow of information in the decoder to preserve the autoregressive attribute.

Multi-head cross attention layer (encoder-decoder): this is part of Transformer where mapping occurs between input and output words. From the Encoder, the Q value comes from the upper layer of the Decoder, and then calculates the cross attention.

Addition and normalization: similar to encoders.

Full connection layer point by point: similar to an encoder.

6. After calculating the output of all N layers of the decoder, the output will pass through a linear layer used as a classifier. The size of the classifier is the same as the vocab size. It is then fed into the softmax layer to obtain the probability distribution on all outputs of the decoder. Then, we use the index with the highest probability, and the word at that index is our prediction word.

Shortcomings of Transformer

There is a downside to all good things. The same is true of Transformer.

Needless to say, Transformer are very large models, so they require a lot of computing power and a lot of data to train. (compared to Transformers, reformer provides more efficient and faster storage. They have basically replaced dot product concerns with locally sensitive hashes (LSH). Moreover, they use the reversible residual layer instead of the standard residual. )

For hierarchical data used for tasks such as parsing, RNN seems to be better than Transformers. Some related work can be found in this article (https://www.aclweb.org/anthology/D18-1503/).

Transformer processes pictures

The image is not a sequence. However, the image can be interpreted as a series of blocks and then processed by a Transformer encoder. As long as the image is divided into small blocks and the linear embedded sequence of these blocks is provided, it can be used as the input of Transformer Encoder. Image blocks are treated the same way as tags (words) in the downstream tasks of NLP. This method can be used to replace the feature extraction method in the current widely used CNN-based image processing pipeline. Transformer for visual processing is based on this concept.

Thank you for your reading, the above is the content of "the basic concept of Transformers", after the study of this article, I believe you have a deeper understanding of the basic concept of Transformers, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.