Percentage points of technical practical information sharing, ten thousand words in-depth interpretation of machine translation 10/24 Update SLTechnology News&Howtos

Percentage points of technical practical information sharing, ten thousand words in-depth interpretation of machine translation

2025-10-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Editor's note

In the article "how machine translation is made (part one)", we review the development history of machine translation. In this article, we will share the theoretical algorithm and technical practice of machine translation system, and explain how neural machine translation is formed. After reading this article, you will learn:

How the neural machine translation model evolved and developed into the Transformer model that attracted the attention of NLP researchers.

Based on the Transformer model, how to build an industrial-level neural machine translation system.

From 2013 to 2014, the tepid field of natural language processing (NLP) has undergone earth-shaking changes, because Google brain's Mikolov and others have put forward large-scale word embedding technologies such as word2vec,RNN, CNN and other deep networks have also been applied to various tasks of NLP. NLP researchers around the world are excited and eager to try, ready to bid farewell to the excruciating mediocre period and start a new era that belongs to NLP.

"The Big Bang" has also occurred in the field of machine translation in the past two years. In 2013, Nal Kalchbrenner and Phil Blunsom of Oxford University proposed end-to-end neural machine translation (Encoder-Decoder model). In 2014, Google's Ilya Sutskerver and others introduced LSTM into the Encoder-Decoder model. These two events indicate that the neural network-based machine translation is beginning to surpass the previous statistical model-based statistical machine translation (SMT) and quickly become the mainstream standard of online translation systems. After Google launched its neural machine translation system (GNMT) in 2016, there was a widely circulated saying on the Internet: "as an interpreter, when I saw this news, I understood the worries and fears of textile workers in the 18th century when they saw steam engines."

In 2015, attention mechanism and memory-based neural network alleviated the bottleneck of information representation in Encoder-Decoder model, which is the key that neural network machine translation is superior to classical phrase-based machine translation. Google Ashish Vaswani et al proposed the Transformer model based on self-attention mechanism with reference to the attention mechanism in 2017, and the Transformer family still maintains the best results in all tasks of NLP. It is concluded that the development of NMT in the past ten years has mainly gone through three stages: general encoder-decoder model (Encoder-Decoder), attention mechanism model and Transformer model.

The following step by step in-depth analysis of these three stages of NMT, a small number of mathematical formulas and conceptual definitions may be full of "mechanical sense", if you find it very difficult to read, please read part 4 directly to learn how to build your own industrial-grade NMT system.

01 New Dawn: Encoder-Decoder Model

This end-to-end machine translation model, which was proposed in 2013, has been mentioned above. A sentence in a natural language can be regarded as a time series data, and cyclic neural networks such as LSTM and GRU are more suitable for dealing with time series data. If we assume that both the source language and the target language are regarded as independent time series data, then machine translation is a sequence generation task, how to achieve a sequence generation task? Generally speaking, the encoder-decoder model framework (also known as Sequence to Sequence, Seq2Seq) based on cyclic neural network is used for sequence generation. The Seq2Seq model includes two sub-models: an encoder and a decoder. The encoder and decoder are independent cyclic neural networks. This model can map a given source language sentence to a continuous and dense vector using an encoder. Then a decoder is used to convert the vector into a target language sentence.

The encoder Encoder encodes the input source language sentences and transforms them into intermediate semantic representation C:

At the I moment, the decoder Decoder represents C and the previously generated historical information y ", y", … according to the intermediate semantics of the sentence encoder output. Y words-words to generate words for the next target language:

Each y phrase is generated in turn, that is, the seq2seq model generates the translation model of the target language sentence according to the input source language sentence. Although the sentences of the source language and the target language are different in language and word order, they have the same semantics. After Encoder condenses the source language sentences into a vector C embedded in the space, Decoder can use the semantic information implied in the vector to reproduce the target language sentences with the same semantics. In a word, the Seq2Seq neural translation model can simulate the two main processes of human translation:

The encoder Encoder interprets the text of the source text; the decoder Decoder recompiles the text to the target language.

02 Breakthrough Leap: attention Mechanism Model

2.1. Limitations of Seq2Seq Model

An important assumption of the Seq2Seq model is that the encoder can compress all the semantics of the input sentences into a fixed-dimensional semantic vector, and the decoder can use the information of the vector to regenerate sentences with the same meaning but in different languages. As the performance of codec degrades sharply with the increase of input sentence length, using a fixed-dimensional intermediate semantic vector as the output of the encoder will lose a lot of detailed information, so the cyclic neural network is difficult to deal with long input sentences, and the general Seq2Seq model has the bottleneck of information representation.

The general Seq2Seq model deals with the source sentence and the target sentence separately, and can not directly model the relationship between the source statement and the target sentence. So how to solve this limitation? In 2015, Bahdanau et al. applied the attention mechanism to joint translation and alignment of words for the first time, which solved the bottleneck of Seq2Seq. The attention mechanism can calculate the relationship between the target word and each source word, so as to directly model the relationship between the source sentence and the target sentence. What kind of artifact is the attention mechanism that can make NMT famous and win the machine translation competition?

2.2. The general principle of attention mechanism

Popular explanation, in the database, the primary key Key is generally used to uniquely identify a data record Value. When accessing a data record, the query statement Query searches the primary key Key that matches the query condition and fetches the data Value in it. Attention mechanism, which is similar to this idea, is a concept of soft addressing: assuming that the data is stored, the matching degree between all primary key Key and a query condition Query is calculated, and then as a weight value, and then weighted with each data Value as the result of the query, the result is attention. Therefore, the general principle of attention mechanism (refer to the figure above): first, imagine that the constituent elements in the source sentence are composed of a series of data pairs, and the target statement is composed of a sequence of elements Query; then given a certain element Query in the target statement, by calculating the similarity or correlation between Query and each Key, the weight coefficient of each Key corresponding to Value is obtained; finally, the Value can be weighted to get the final Attention value. Therefore, in essence, the attention mechanism is the weighted summation of the value of the elements in the source statement, while Query and Key are used to calculate the weight coefficient of the corresponding Value. The general calculation formula is as follows:

In machine translation, the Seq2Seq model is generally stacked by multiple LSTM/GRU and other RNN. Google released the neural machine translation system GNMT in September 2016. it adopts the model framework of Seq2Seq+ attention mechanism. The encoder network and decoder network have eight LSTM hidden layers. The output of the encoder is weighted and averaged by the attention mechanism and then input to each LSTM hidden layer of the decoder. Finally, the softmax layer is connected to output the probability of each word in each target language dictionary.

How does GNMT calculate the attention that leads to a significant performance improvement? Assuming that (XQuery Y) is any set of source-target sentence pairs of parallel corpus, then:

A string with a source statement length of M: a string with a target statement length of N: the encoder outputs a d-dimensional vector as the encoding of h:

Using Bayesian theorem, the conditional probability of sentence pairs:

At the point in time, the decoder maximizes P (Y | X) according to the encoder output code and the previous iMel decoder output, and obtains the target word.

The actual calculation steps of the GNMT attention mechanism are as follows:

At this point, you may begin to feel so tired that you don't know what to say in this article. Please read more patiently, because this exciting moment begins: the protagonist of the article, Transformer (Transformers), is on the stage!

03 highlight moment: Transformer model based on self-attention mechanism

In part 2, we mentioned that the seq2seq+-based attention mechanism works better than the general seq2seq model architecture, so what are the disadvantages of this combination? In fact, the cyclic neural network has a problem that has perplexed researchers for a long time: it is impossible to run parallel operations effectively, but soon the researchers are waiting for the good news. When the Transformer model was published in June 2017, Google referred to the attention mechanism in a paper "Attention Is All You Need" and proposed the self-attention mechanism (self-attention) and a new neural network structure-Transformer. The model has the following advantages:

The traditional Seq2Seq model is based on RNN, which restricts the training speed of GPU. Transformer model is a parallel mechanism that calculates attention without RNN and CNN at all; Transformer improves the slow training of RNN, which is most criticized by people, and uses self-attention mechanism to achieve fast parallel computing, and Transformer can increase to a very deep depth to fully explore the characteristics of DNN model and improve the accuracy of the model.

Let's take a closer look at the Transformer model architecture.

3.1. Transformer model architecture

The Transformer model is also essentially a Seq2Seq model, consisting of encoders, decoders, and the connection layer between them, as shown in the following figure. The "The Transformer" encoder introduced in the original article: the encoder Encoder is made up of six identical coding layers Encoder layer, each of which has two sublayers. The first sublayer is a Multi-Head Attention mechanism, and the second sublayer is a simple feedforward network Feed-Forward Network with fully connected locations. We use a residual connection Residualconnection for each sublayer, followed by layer standardization Layer Normalization. The output of each sublayer is LayerNorm (x+Sublayer (x)), where Sublayer (x) is a function implemented by the sublayer itself.

"The Transformer" decoder: the decoder Decoder is also stacked by six identical decoding layers Decoder Layer. In addition to the same two sublayers as in each encoder layer, the decoder inserts a third sublayer (Encoder-Decoder Attention layer), which performs Multi-HeadAttention on the output of the encoder stack. Similar to encoders, we use residual connections in each sublayer, and then standardize the layer.

The Transformer model calculates attention in three ways:

Encoder self-attention, every Encoder has a Multi-Head Attention layer; decoder self-attention, every Decoder has a Masked Multi-Head Attention layer; encoder-decoder attention, every Decoder has an Encoder-Decoder Attention, the process is similar to the past seq2seq+attention model.

3.2. Self-attention mechanism

The core idea of the Transformer model is the self-attention mechanism (self-attention), which can pay attention to the different positions of the input sequence to calculate the representation of the sequence. As the name implies, the self-attention mechanism does not refer to the attention mechanism between the source statement and the target sentence, but the attention mechanism that occurs between the internal elements of the same statement. The attention in the calculation of the general Seq2Seq model takes the output of Decoder as the query vector, the output sequence of the Q focus Encoder as the key vector k, and the value vector v focus mechanism occurs between the elements of the target statement and all the elements in the source statement.

The calculation process of the self-attention mechanism is that the vector of each position of the input sequence of Encoder or Decoder is transformed into three vectors through three linear transformations: query vector Q, key vector k, value vector v, and the Q of each position is matched with the k of other positions in the sequence. After calculating the matching degree, the weight value between 0 and 1 is obtained by using the softmax layer, and the weight is weighted averaged with the v of each position. Finally, the output vector z of this position is obtained. The calculation method of self-attention is described below.

▶ scalable dot product attention

Scalable dot product attention is how to use vectors to calculate self-attention, which is calculated in four steps:

Three vectors are generated from the input vector of each encoder (the word vector of each word): query vector Q, key vector k, value vector v. In the matrix operation, these three vectors are created by multiplying the codec input X and the three weight matrices W, W and W. Calculate the score. In the illustration example, enter a sentence "Thinking Machine", and the first word "Thinking" is calculated from the attention vector. You need to rate "Thinking" for each word in the input sentence. The score determines how much attention is paid to the rest of the sentence in the coding of the word "Thinking". The score is calculated by the dot product of the key vector k of the scored word (all words entered in the sentence) and the query vector Q of "Thinking". For example, the first fraction is the dot product of Q and k, and the second fraction is the dot product of Q and k. Scale summation: multiply the fraction by the scaling factor 1 / √ d "(d" is the dimension of the key vector d "= 64) to make the gradient more stable, and then pass the result through softmax. The function of softmax is to normalize the scores of all words so that the scores are positive and the sum is 1. The softmax score determines the contribution of each word to the current position of the code ("Thinking"). Multiply each value vector v by the softmax score, hoping to focus on semantically related words and weaken unrelated words. Sum the weighted value vector, and then get the output z from the attention layer at that position.

Therefore, the scalable dot product attention can be calculated by the following formula:

In practice, the calculation of attention is done in the form of a matrix in order to calculate faster. Then let's take a look at how to implement the self-attention mechanism through matrix operations.

Firstly, the query vector matrix Q, the key vector matrix K and the value vector matrix V are obtained by multiplying the weight matrix W, W, W and W by the input matrix X. Similarly, the score of any word is calculated by the dot product of its key vector k and the query vector Q of all words, then we can transpose the key vector k of all words into a key vector matrix K, and combine the query vector Q of all words together to form the query vector matrix Q. the two matrices are multiplied to get the attention score matrix A=QK matrix. Then, softmax the attention score matrix A to get the normalized score matrix A ^, which is multiplied on the left by the value vector matrix V to get the output matrix Z.

▶ bullish attention

If only one attention is calculated, it is difficult to capture all the spatial information in the input sentence. In order to optimize the model, a novel method-Multi-Head Attention is proposed in the original paper. Multi-Head Attention can not only use the vector dimension d (model) embedded in the vector dimension d (model) to do a single attention, but linearly project KMagi Q and V to different spaces h times, and then turn them into dimensions dq,d dimension, and then do attention respectively.

Where dq=d "= d" = d (model) / html 64 is projected onto h Head. Multi-Head Attention allows different representation subspaces of the model to jointly focus on information in different locations, and if there is only one attention Head, its average weakens this information.

Multi-Head Attention maintains independent query / key / value weight matrices for each Head, resulting in different query / key / value matrices (Q, K, V). The query / key / value matrix Q, K, V is generated by multiplying the matrix X by the matrix of W, K, V, etc. With the same self-attention calculation as above, only eight different weight matrix operations are needed to obtain eight different Z-matrix matrices, each of which represents the projection of the hidden vectors of the input text to different spaces. Finally, the eight matrices are put together and reduced to an output matrix Z by multiplying a weight matrix W matrix.

What information does each Head of Multi-Head Attention pay attention to in the sentence? Where does the Head with different attention focus? Take the following two sentences as examples: "The animal didn't crossthe street because it was too tired" and "The animal didn't crossthe street because it was too wide". What does "it" mean in the two sentences? does "it" mean "street" or "animal"? When we encode the word "it", it focuses on "animal" and "street". In a sense, the model's expression of the word "it" is the representative of "animal" and "street" to some extent, but without semantics, the it of the first sentence points more strongly to animal, and the it of the second sentence points more strongly to street.

Other structures of 3.3.Transformer model

▶ residual connection and normalization

The codec has a special structure: there is a sublayer between the output of Multi-HeadAttention and Feed-forward layer: residual connection and layer normalization (LN), that is, residual connection and layer normalization. The residual connection is to construct a new residual structure, which rewrites the output to the residual of the input, so that small changes in the model can be noticed during training. This method is commonly used in computer vision.

We need to normalize the data before feeding it into the activation function, because we don't want the input data to fall in the saturation zone of the activation function. LN is a regularization method in deep learning, which is generally compared with batch normalization (BN). The main idea of BN is to normalize each batch of data at each layer. LN calculates the mean and variance on each sample. The advantage of LN is that it is calculated independently and normalized for a single sample, rather than the batch direction of BN.

▶ feedforward neural network

The output of the attention sublayer in the coding and decoding layer is connected to a fully connected network: Feed-forward networks (FFN), which contains two linear transformations and a ReLu. The paper does FFN according to each location (each text in the input sentence), so it is called the FFN of point-wise. The calculation formula is as follows:

▶ Linear Transformation and softmax layer

The decoder finally outputs a real vector. How to change a floating point number into a word? This is what the linear transform layer does, followed by the softmax layer. The linear transform layer is a simple fully connected neural network, which can project the vector generated by the decoder into a much larger vector called logarithmic probability (logits).

Suppose our model learns 10,000 different English words from the training set (the output word list of our model). So the logarithmic probability vector is a vector of ten thousand cells-- the score of a word for each cell. The next softmax layer will turn those scores into probabilities (all positive numbers, upper limit 1. 0). The cell with the highest probability is selected, and its corresponding word is used as the output of this time step.

▶ position coding

The input of the Seq2Seq model is only the word vector, but the Transformer model abandons the loop and convolution and can not extract the sequence order information. if the sequence order information is missing, it may cause all the words to be right, but can not form meaningful sentences. How does the author solve this problem? In order for the model to take advantage of the sequence, it is necessary to inject information about the relative or absolute position of words in the sequence. In this paper, the author introduces Positional Encoding: coding the position of the words in the sequence. The following figure shows the visualization of the location coding of 20 words and 512 words embedded in the dimension.

Add the "positional coding" of each word in the sentence to the input embedding at the bottom of the encoder and decoder stack, and the positional encoding and word embedding have the same dimension d (model), so the two can be added. In this paper, sine and cosine functions of different frequencies are used to obtain position information.

Where pos is the position, I is the dimension, using sine coding in even positions and cosine coding in odd positions. Each dimension of the position code corresponds to a sine curve.

There is no doubt that the Transformer model is the mainstream model of machine translation at present. In the face of the strong strength of Google and other technology giants, how can the percentage cognitive intelligence laboratory use the Transformer model to develop a market-competitive, industrial-grade multilingual neural translation system? Part 4 will explain it to you.

04 practice of multilingual neural translation model at industrial level

4.1. Multilingual model translation framework

Google GNMT uses a large parallel corpus of multiple languages to train at the same time to obtain a neural translation model that can support multiple source language input and multiple target language output, but this method requires expensive computing resources to support training and deployment.

Percent of neural translation system Deep Translator

The neural translation system Deep Translator currently supports Chinese, English, Japanese, Russian, French, German, Arabic, Spanish, Portuguese, Italian, Hebrew and Persian translation in hundreds of directions. How to carry out model training and online calculation under the condition of limited server resources?

Different from Google GNMT which adopts the architecture of multi-language and single translation model, the multi-language translation model of Deep Translator proposed by the R & D team is an integration scheme of multi-parallel sub-model. The scheme has two main characteristics: one is model independence, which trains different translation models for different language directions; the other is "bridging" translation, for which there is less parallel corpus from Chinese to other languages. English, which is rich in corpus resources, is used as the intermediate language for transit translation, that is, the source language is first translated into English, and then English is translated into the target language.

What is the depth of thinking of the R & D team when adopting the above plan? First, unlike Google's global Internet users, domestic enterprise end-users have clear language translation needs and require localized deployment of the system, and have higher requirements for translation quality in some language directions, such as English, Chinese and Russian. At the same time, it is hoped that the translation effect of these language directions can be continuously improved and corrected in time when problems are found, while other translation models with low frequency can ensure their stability. As a result, the language model used in high frequency will be updated more frequently, while the language model used in low frequency will be updated less frequently. If we unify the multilingual model under one framework, it will not only increase the complexity of the model, but also affect the stability of the model, because upgrading one language direction will inevitably update the whole model parameters. in this way, the translation effect of other language directions will also be affected, and each upgrade will have to evaluate the effect of all language directions. if part of the translation effect decreases obviously, it will be time-consuming and laborious. On the other hand, the parameter optimization of one language direction by the independent model structure will not affect the translation effect of other language directions. on the basis of ensuring the stability of the overall translation effect of the system, the workload of model updating is greatly reduced.

Second, the industrial-level available neural machine translation model requires high quality of parallel corpus. An available translation model requires tens of millions of parallel training corpus, and the system supports relatively many language directions. at present, it is difficult to obtain enough bilateral training data for many language directions. There are generally two solutions to this problem. One is the unsupervised translation model, which only requires unilateral training corpus, while unilateral training corpus is relatively easy to obtain. But the disadvantage is that the current unsupervised translation model is not mature enough to meet the needs of use. The other is to adopt the method of "bridging", because the bilateral corpus between different languages and English is relatively easy to obtain, but the disadvantage is that the accuracy is lost after English translation, and the efficiency of doubling computing resources is reduced. Through the analysis of users' needs, it is found that users' requirements for translation effect are greater than those for execution efficiency, and through the evaluation and comparison of the translation effects of the two models, the translation effect of "bridging" structure is better than that of the current unsupervised translation model. so the final choice is through the English "bridging" frame structure.

4.2. Billion-level parallel corpus construction

Parallel corpus is the dream resource of neural machine translation researchers. It is no exaggeration to say that parallel corpus resources are the competitiveness of machine translation before breaking through the structure of Transformer model. No matter how many parallel corpus Google and Facebook crawl from the massive Internet, parallel corpus in the industry field is always a scarce resource. because a large number of unilateral corpus (electronic documents, books) and the translation achievements of professional translators are not on the Internet. The acquisition and collation of these resources into parallel corpus is not free and requires a lot of manpower, so it is an obstacle to the deep application of neural machine translation in the industry.

How does the cognitive intelligence laboratory construct its own multilingual parallel corpus? In addition to obtaining the open corpus resources on the Internet all over the world, the development team designed a model and tool for constructing domain parallel corpus from unilateral corpus in electronic documents, which can efficiently construct high-quality industry domain parallel corpus support model training. The construction of parallel corpus from unilateral corpus requires clause and sentence alignment, so how to calculate the semantic similarity of sentences from tens of millions of unilateral corpus? The development team proposes to learn semantic similarity by classifying translations: given a pair of bilingual text inputs, design a coding model that can return to represent various natural language relationships (including similarity and relevance). In this way, the training time of the model is greatly reduced, and the performance of bilingual semantic similarity classification can be guaranteed at the same time. As a result, the fast automatic alignment of bilingual texts is realized and the billion-level parallel corpus is constructed.

After sorting out the online open source parallel corpus and constructing industry-level parallel corpus, the number of high-quality parallel corpus formed by the cognitive intelligence laboratory is as follows.

4.3. Document format conversion, OCR and UI Design

To create an industry-oriented machine translation system with good user experience has always been the tireless pursuit of the cognitive intelligence laboratory research and development team. In order to realize this dream, we should not only use the end-to-end neural translation model to achieve the best multilingual translation quality, but also provide an end-to-end translation system for multi-user cooperation. The end-to-end translation system mainly needs to solve two problems: first, how to solve the technical problems of multi-format and multi-language document format conversion and picture text OCR? Second, how to provide multi-user cooperation to use the UI interface?

End users generally want to convert PDF, pictures, slides and other different formats into editable electronic files through the system and translate them into the final target language, and better maintain the typesetting format of the original documents for reading. So how to convert the format of the document, recognize the text of the picture, and achieve the best results in this field of technology? The use of leading OCR technology makes the Deep Translator translation system closer to the actual work scene of users, supports the direct multilingual translation of PDF, PPT, pictures and other documents in multiple languages without manual conversion, and finally outputs editable formats such as PDF, Word, PPT and maintains the original typesetting style and format, making it convenient for users to compare and read between the source text and the target text.

For scientific research institutes or companies, it is necessary to support multi-user cooperative operation and provide a friendly UI interface under the condition of limited server resources. After iterative polishing, the Deep Translator translation system has formed four major features: first, it provides functional operations of document translation, text translation and document conversion to meet the different needs of users; second, it designs task priority scheduling and sorting algorithms for the translation of multi-user urgent tasks and normal tasks; third, it supports rich operations such as single-user multi-document batch upload, batch download, parameter configuration, translation progress check, etc. Fourth, support multiple permissions, multi-role management and unified authentication of account passwords.

4.4. Product advantages and practical experience

The multilingual machine translation system Deep Translator launched by the percent Cognitive Intelligence Lab supports localized deployment, customized training models and reaches the best industrial-level machine translation level in the industry. Table 1 shows the results of the translation quality assessment conducted by Deep Translator in the official test set of the United Nations parallel corpus, and the BLEU score in the mainstream translation field of English-Chinese translation and Russian translation reaches the best level.

Since its inception in 2017, Deep Translator has served hundreds of customers, including in domestic aviation, electronics and other military research institutes and has won a good reputation. In addition, in cooperation with the financing Network (www.rongrong.cn), it has been promoted and sold to thousands of military research institutes. We have gone further and further on the road of promoting machine translation services in the industry, fulfilling the mission of using cognitive intelligence technology to serve national defense.

References:

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Continuous TranslationModels. In Proceedings of EMNLP 2013Ilya Sutskever,etc. 2014. Sequence to Sequence Learning with NeuralNetworks.In Proceedings of NIPS 2014.Dzmitry Bahdanau etc. 2015. Neural Machine Translation by Jointly Learningto Align and Translate. In Proceedings of ICLR 2015.Ashish Vaswani,etc.Attention is All You Need. In Proceedings of NIPS2017.Jay Alammar TheIllustrated Transformer, http://jalammar.github.io/illustrated-transformer/ Zhang Junlin, attention Model in Deep Learning (version 2017), https://zhuanlan.zhihu.com/p/37601161

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.