How to analyze and apply Reformer 07/06 Update SLTechnology News&Howtos

How to analyze and apply Reformer

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to analyze and apply Reformer, I believe that many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Guide reading

The latest developments from Google AI.

Understanding sequence data-such as language, music, or video-is a challenging task, especially when it depends on a large number of surroundings. For example, if a person or an object disappears in the video and reappears a long time later, many models will forget what it looks like. In the field of language, the short-term memory (LSTM) neural network covers enough context to translate sentence by sentence. In this case, the context window (the range of data to be considered in the translation process) ranges from dozens of words to about 100 words. The latest Transformer model not only improves the performance of sentence-by-sentence translation, but also generates entire Wikipedia articles through multiple document summaries. This is possible because the context window used by Transformer can be expanded to thousands of words. With such a large context window, Transformer can be used for applications other than text, including pixels or notes, so that it can be used to generate music and images.

However, there are limitations in extending Transformer to a larger context window. Transformer's ability comes from attention, and in the process, it considers all possible word pairs in the context window to understand the connection between them. Therefore, for 100K word text, it is impractical to evaluate 100K x 100K word pairs, or 10 billion pairs per step. Another problem is the standard practice of storing the output of each model layer. For applications that use large contextual windows, the memory requirement for storing output from multiple model layers quickly becomes very large (from GB bytes with only a few layers to TB bytes with thousands of layers). This means that the actual Transformer model that uses many layers can only be used for a few pieces of text or to generate short music clips.

Today, we will introduce Reformer, a Transformer model designed to handle contextual windows of up to 1 million words, all on a single accelerator and using only 16GB memory. It combines two key technologies to solve attention and memory allocation problems, which limit the use of long context windows in Transformer applications. Reformer uses position-sensitive hashing (LSH) to reduce the complexity of processing overly long sequences and reversible residual layers, thus using available memory more efficiently.

The problem of attention mechanism

When applying the Transformer model to very large text sequences, the first challenge is how to deal with the attention layer. LSH does this by calculating a hash function that matches similar vectors together instead of searching for all possible vector pairs. For example, in a translation task, each vector from the first layer of the network represents a word (there is even a larger context in the subsequent layer), and the vectors corresponding to the same word in different languages may get the same hash. In the picture below, different colors depict different hashes, and similar words have the same color. When hash values are assigned, the sequence is rearranged, putting elements with the same hash value together and divided into fragments (or blocks) to support parallel processing. The attention mechanism is then focused on these shorter blocks (and their adjacent blocks to cover the overflow), thus greatly reducing the computational load.

Location-sensitive hash: Reformer accepts a sequence of input key, where each key is a vector that represents a single word in the first layer (or pixels in the image) and a larger context in subsequent layers. LSH is applied to the sequence and is then sorted by the hash and block of the key. Attention is applied only to a single block and its neighbors.

Memory problem

Although LSH solves the problem of attention, there is still a memory problem. A single layer of a network usually requires several GB of memory and is usually suitable for a GPU, so even a model with a long sequence can be executed with only one layer. However, when training a multi-layer model with gradient descent, you need to save the activation values for each layer for use in backward transmission. A typical Transformer model has 12 or more layers, so if it is used to cache values from each layer, memory will quickly run out.

The second new approach implemented in Reformer is to recalculate the input for each layer on demand during back propagation rather than storing it in memory. This is achieved by using a reversible layer, where activation from the last layer of the network is used to restore activation from any intermediate layer, which is equivalent to running the network in reverse. In a typical residual network, each layer of the stack continues to increase the vector through the network. In contrast, reversible layers have two sets of activation for each layer. One follows the standard process just described and gradually updates from one layer to the next, but the other captures only changes to the first layer. Therefore, to run the network in reverse, you only need to subtract the activation applied to each layer.

Reversible layer: (a) in the standard residual network, the activation of each layer is used to update the input to the next layer. (B) in a reversible network, two sets of activation are maintained, and only one of them is updated after each layer. (C) this approach makes it possible to reverse the operation of the network to recover all intermediate values.

Application of Reformer

In Reformer, the application of these two new methods makes it very efficient and enables it to process text sequences up to 1 million words in length only using 16GB memory on a single GPU. Because Reformer is so efficient, it can be directly applied to data in context windows that are much larger than almost all current state-of-the-art text field data sets. Perhaps Reformer's ability to handle such large data sets will stimulate the community to create them.

One of the shortcomings of large context data is image generation, so we have carried out Reformer experiments on images. In this article, we will give an example of how to use Reformer to "complete" parts of an image. Starting with the image clip on the top line of the following image, Reformer can generate a full-frame image pixel by pixel (the following line).

Top: the image clip is used as input to the Reformer. Bottom: a full-frame image of "done". The original image is from the Imagenet64 dataset.

Although Reformer has great potential in image and video tasks, the application in text is more exciting. Reformer can process the entire novel in a single device at once. In the future, when there are more data sets that need to train long text, techniques such as Reformer may make it possible to produce coherent text.

After reading the above, have you mastered how to analyze and apply Reformer? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.