Attention Mechanism in computer Vision 04/15 Update SLTechnology News&Howtos

Attention Mechanism in computer Vision

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Introduction: In the field of Machine Translation or Natural Language Processing, mathematical statistics methods have been used to analyze and process. In recent years, with the rise of AlphaGo, in addition to the field of game AI, deep learning also has great use in the field of computer vision, machine translation and natural language processing. In 2014, with the further development of deep learning, the training mode and translation mode of seq2seq have begun to enter people's field of vision. In addition, in the end-to-end training method, in addition to the need for massive business data, it is also necessary to add some important modules to the network structure. In this case, attention mechanisms based on recurrent neural networks come into view. In addition to the fields of machine translation and natural language processing mentioned earlier, the attention mechanism in computer vision is also very interesting. This article will briefly introduce the attention method in computer vision. In this statement in advance, the author is not engaged in these fields, may be in the process of writing articles will be some understanding of the lack of place, please readers point out the shortcomings.

attention mechanism

As the name suggests, the attention mechanism is essentially designed to mimic the way humans view objects. Generally speaking, when people look at a picture, in addition to grasping a picture as a whole, they will also pay more attention to some local information in the picture, such as the location of a local table, the type of goods, and so on. In the field of translation, when people translate a paragraph, they usually start with the sentence, but when reading the whole sentence, they definitely need to pay attention to the information of the word itself, as well as the information of the relationship between the words and the context. In the direction of natural language processing, if emotion classification is to be carried out, words expressing emotion will definitely be involved in a certain sentence, including but not limited to keywords such as "happy,""depressed," and "happy." The other words in these sentences are contextual, not that they are useless, but that they do not play a greater role than the emotional keywords.

In the above description, the attention mechanism actually consists of two parts:

Attention mechanisms need to decide which part of the whole input needs more attention;

Feature extraction from key parts to get important information.

Generally speaking, in the field of machine translation or natural language processing, people read and understand a sentence or a paragraph in a certain order, and read and understand according to the grammatical rules of linguistics. In the field of image classification, people also look at an image in terms of the whole and then the part, or the part and then the whole. When looking at the local, especially the handwritten mobile phone number, house number and other information, there is a sequence. In order to simulate the thinking and understanding mode of the human brain, recurrent neural networks (RNNs) have a unique advantage in dealing with this kind of problem with obvious sequence, so Attention mechanism is usually applied to recurrent neural networks.

Although, according to the above description, the attention mechanisms of machine translation, natural language processing, and computer vision are similar, in fact, when carefully examined, the attention mechanisms of these three are obviously different.

In the field of machine translation, translators need to translate an existing sentence into a sentence in another language. For example, translating a sentence from English to Chinese and translating Chinese to French. In this case, the order of words in the input language and the output language is relatively fixed and has certain grammatical rules.

In the field of video classification or emotion recognition, the sequence of videos is composed of timestamps and corresponding segments. The input is the key segment in a video, that is, a series of pictures with sequence. The same is true of emotion recognition in NLP, where language itself is sequential.

Image recognition and object detection are fundamentally different from the first two. Because object detection is actually mining the necessary object structure or position information in a picture, in this case, its input is a picture, and there is no obvious order, and from the perspective of the human brain, due to individual differences, it is difficult to find a general way to observe pictures. Since everyone has their own order of observation, it is difficult to unify into a whole.

In this case, the use of RNN-based Attention mechanisms in machine translation and natural language processing becomes relatively natural, while the use of Attention mechanisms in computer vision requires the necessary modifications.

Attention Mechanism Based on RNN

In general, deep neural networks such as RNNs can perform end-to-end training and prediction, and have unique advantages in the field of machine translation or text recognition. For end-to-end RNNs, there is a simpler name called sequence to sequence, abbreviated as seq2seq. As the name implies, the input layer is one sentence, the output layer is another sentence, and the middle layer includes two steps of encoding and decoding.

RNN-based attention mechanism means that for many problems of seq2seq, there is some implicit connection between the input layer and the output layer, that is, between words (Items) and words. For example: "China" -> "China","Excellent" -> "Excellent". In this case, each time machine translation is performed, the model needs to know that it is currently focusing more on a certain word or words, so that it can make the necessary refinements in the whole sentence. With these preliminary considerations in mind, the RNN-based Attention mechanism is:

A nonlinear model of encoder and decoder is established. The parameters of neural network are enough to store enough information.

In addition to focusing on the overall information of the sentence, each time the next word is translated, different words need to be given different weights. In this case, when decoding again, the overall information and local information can be considered at the same time.

Types of Attention Mechanisms

According to preliminary research, there are two ways to pay attention to the mechanism, one is based on reinforcement learning, and the other is based on gradient descent. Reinforcement learning is motivated by a reward function that allows the model to pay more attention to a local detail. Gradient descent method is done by objective function and corresponding optimization function. Whether it's NLP or CV domains, consider these methods to add attention mechanisms.

Attention in Computer Vision

The following is a brief introduction to a few recent articles in the field of computer vision on the mechanics of attention.

Look Closer to See Better：Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition

In the field of image recognition, it is usually encountered to classify birds in images, including the identification of species, the identification of attributes, etc. In order to distinguish different birds, in addition to grasping the picture from the whole, more attention is paid to a partial information, that is, the bird's appearance, including head, body, feet, color and other contents. Peripheral information, such as flowers and plants, is less important and can only be used as reference objects. Because different birds stay on trees and grass, attention to information about trees and grass is not critical to bird recognition. Therefore, the introduction of attention mechanism in the field of image recognition is a very key technology, so that deep learning models pay more attention to a certain local information.

In this paper, the authors propose a CNN-based attention mechanism called recurrent attention continuous neural network (RA-CNN), which recursively analyzes local information and extracts necessary features from local information. At the same time, there is a classification structure in the sub-network in RA-CNN, that is to say, from the pictures of different regions, a probability of bird species classification can be obtained. In addition, attention mechanism is introduced, so that the whole network structure not only pays attention to the overall information, but also pays attention to the local information, which is called Attention Proposal Sub-Network (APN). This APN structure iteratively generates sub-regions from the full-image, performs necessary prediction on these sub-regions, and integrates the prediction results obtained from the sub-regions to obtain the classification prediction probability of the whole image.

RA-CNN is characterized by an end-to-end optimization, which can identify birds and divide image types without labeling boxes and regions in advance. On the dataset, the paper not only carried out experiments on the CUB Birds dataset, but also on the Stanford Dogs and Stanford Cars, and achieved good results.

From the network structure of deep learning, RA-CNN input is the whole image (Full Image), output is the probability of classification. The method of extracting image features usually uses the structure of convolutional neural network (CNN), and then adds Attention mechanism to the whole network structure. As seen in the figure below, at first, the whole picture is input from the top, and then a classification probability is determined; then the middle layer outputs a coordinate value and a size, where the coordinate value indicates the center point of the sub-picture and the size indicates the size of the sub-picture. From this, the next subimage is the image obtained from the coordinate values and size, and the second network is constructed on this basis; the image is iteratively enlarged to focus on certain key positions in the image. Images of different size can output different classification probabilities, and then the classification probabilities are fused to the bird recognition probability of the whole image.

Therefore, throughout the paper, there are several key points to note:

The calculation of classification probability, that is, the design of the final loss function;

Coordinate values and size from previous picture to next picture.

Once these metrics are obtained, the entire RA-CNN network can be built.

The experimental effects of RA-CNN are as follows:

Multiple Granularity Descriptors for Fine-grained Categorization

This paper also does the classification of birds, different from RA-CNN in that it uses a hierarchical structure, because the classification of birds is carried out according to a certain hierarchical relationship, roughly speaking, there are three hierarchical structures: family-> genus-> species.

Therefore, in the process of designing network structure, parallel network structure is needed, corresponding to family, genus and species. The sequence from front to back is Detection Network, Region Discovery, Description Network. Parallel structures are Family-grained CNN + Family-grained Descriptor, Genus-grained CNN + Genus-grained Descriptor, Species-grained CNN + Species-grained Descriptor. And where regions are found, the authors use the idea of energy, which allows the neural network to focus on different parts of the image and ultimately predict the birds.

Recurrent Models of Visual Attention

Introducing attention mechanisms in computer vision, DeepMind's article Recurrent models of visual attention was published in 2014. In this paper, the authors use an attention mechanism based on reinforcement learning and use a payoff function to train the model. From the network structure, not only from the overall view of the picture, but also from the local to extract necessary information.

Overall, the network structure is RNN, and the information and coordinates obtained in the previous stage are passed on to the next stage. This network only makes probability judgments in the last step of classification, which is different from RA-CNN. This is to simulate the way humans see objects, not by focusing on the whole picture all the time, but by scanning the images in some underlying order. Recurrent Models of Visual Attention essentially inputs images in a time series, processes part of the original image information at a time, and in the process of processing information, needs to select the next appropriate position for processing according to past information and tasks. This eliminates the need for prior location marking and object positioning.

Multiple Object Recognition with Visual Attention

In doorplate recognition, the network scans images from left to right, much like how humans recognize objects. In addition to doorplate recognition, this paper also recognizes handwritten fonts, and also achieves good results.

The experimental results are as follows:

summary

This article gives a preliminary introduction to the Attention mechanism in computer vision. In addition to these methods, there should be some more ingenious methods. I hope readers can give me more advice.

references

Look Closer to See Better：Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition，CVPR，2017.

Recurrent Models of Visual Attention，NIPS，2014

GitHub code: Recurrent-Attention-CNN, github.com/Jianlong-Fu/

Multiple Granularity Descriptors for Fine-grained Categorization，ICCV，2015

Multiple Object Recognition with Visual Attention，ICRL，2015

Understanding LSTM Networks，Colah's Blog，2015，colah.github.io/posts/2

Survey on the attention based RNN model and its applications in computer vision，2016

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.