Interpretation of full-mark papers in ICLR 2020 | A model of image-generated machine translation: MGNMT 04/17 Update SLTechnology News&Howtos

Interpretation of full-mark papers in ICLR 2020 | A model of image-generated machine translation: MGNMT

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

2020-01-09 06:15:09

Paper link: https://static.aminer .cn / misc/pdf/minrror.pdf

1. Summary

Conventional neural machine translation (NMT) requires a large number of parallel corpus, which is too difficult for many languages. Fortunately, the original non-parallel corpus is very easy to obtain. But even so, the existing methods based on non-parallel corpus still do not give full play to the non-parallel corpus in training and decoding.

For this reason, this paper proposes a mirror-generated machine translation model: MGNMT (mirror-generative NMT).

MGNMT is a unified framework, which integrates the translation models of source-target and target-source and the language models of their respective languages. The translation model and language model in MGNMT share the implicit semantic space, so they can learn translation in both directions more effectively from non-parallel corpus. In addition, the translation model and the language model can jointly decode and improve the quality of translation. Experiments show that this method is indeed effective, and MGNMT is always better than the existing methods in various scenarios and languages (including resource rich and low-resource).

II. Introduction

Nowadays, neural machine translation is very popular, but it depends heavily on a large number of parallel corpus. However, in most machine translation scenarios, it is not easy to obtain a large number of parallel corpus. In addition, due to the great difference of parallel corpus between domains, it is usually difficult for NMT to apply it to other fields because of the limited parallel corpus in specific fields (for example, medical field). Therefore, when parallel corpus is insufficient, it is very important to make full use of non-parallel bilingual data (usually with low cost) to achieve satisfactory translation performance.

The current NMT system has not yet made the best use of the non-parallel corpus in the training and decoding stages. For the training stage, the back translation method (back-translation) is generally used. The back translation method updates the machine translation model in two directions respectively, which is not efficient enough. Given the source language data x and the target language data y, the back translation method first uses the tgt2src translation model to translate y to x languages. Then update the src2tgt translation model with the pseudo translation pair (x translation, y) generated above. Similarly, the data x can be used to update the translation model in the opposite direction. It should be noted that the translation models in the two directions are independent of each other and updated independently. In other words, each update of one model has no direct benefit to the other. In this regard, some scholars have proposed joint translation method and dual learning (dual learning) to make them implicitly benefit each other in iterative training. However, the translation models in these methods are still independent. Ideally, when the translation models of the two directions are related, the gain brought by the non-parallel corpus can be further improved. At this point, each update of one side can improve the performance of the other, and vice versa. This will give greater play to the effectiveness of non-parallel corpus.

For decoding, some scholars have proposed to insert the external language model trained on the target language into the translation model x-> y directly. This method of introducing knowledge of target languages can indeed achieve better translation results, especially for specific areas. However, it does not seem to be the best to introduce an independent language model directly when decoding. The reasons are as follows:

The main results are as follows: (1) the language model is external and independent of the learning of the translation model. This simple way of insertion may make the two models unable to cooperate well, or even lead to conflict.

(2) the language model is only used in decoding, but not in the training process. This leads to inconsistencies in training and decoding, which may affect performance.

This paper proposes that mirror-generated NMT (MGNMT) tries to solve the above problems so as to make more efficient use of non-parallel corpus. MGNMT combines the translation model (two directions) and the language model (two languages) in a unified framework. Inspired by the generative NMT (GNMT), a hidden semantic variable z shared between x and y is introduced into MGNMT. In this paper, we use symmetry or mirror property to decompose the conditional joint probability p (x, y | z):

The probability graph model of MGNMT is shown in Figure 1:

Probability Graph Model of Figure 1:MGNMT

The two-way translation models and language models of the two languages are aligned by shared implicit semantic variables, as shown in Figure 2:

The Mirror property of Figure 2:MGNMT

After the introduction of hidden variables, each model is associated and the conditions are independent under a given z. Such a MGNMT has the following two advantages:

The main results are as follows: (1) during training, due to the role of hidden variables, the translation models of the two directions are no longer independent, but related to each other. Therefore, the update in one direction is directly beneficial to the translation model in the other direction. This improves the utilization efficiency of non-parallel corpus.

(2) when decoding, MGNMT can naturally make use of its internal target language model. This language model is learned jointly with the translation model, and the joint language model and translation model are helpful to obtain better results.

Experiments show that MGNMT achieves competitive results in parallel corpus, and even outperforms several robust benchmark models in some scenarios and language pairs (including resource-rich, resource-poor and cross-domain translation). In addition, it is found that the joint learning of translation model and language model can indeed improve the translation quality of MGNMT. This paper also proves that MGNMT is a free architecture and can be applied to any neural sequence model such as Transformer and RNN.

III. Methods

The overall framework of MGNMT is shown in Figure 3:

Frame diagram of Figure 3:MGNMT

Where (xQuery y) denotes source-target pairs, θ represents model parameters, D_xy $denotes parallel corpus, and Dempx and dichoy denote their respective non-parallel monolingual corpus respectively.

MGNMT carries on the joint modeling of bilingual sentence pairs, specifically using the mirror property of joint probability:

The hidden variable z (standard Gaussian distribution is selected in this paper) represents the semantic sharing between x and y. Hidden variables bridge translation models and language models in two directions. The training and decoding of parallel corpus and non-parallel corpus are introduced below.

Training of parallel corpus

Given a pair of parallel corpus (x ~ ()), the approximate maximum likelihood estimation of log p (x _ ()) is obtained by using the stochastic gradient variational Bayesian method (stochastic gradient variational Bayes,SGVB). An approximate posteriori can be parameterized as:

The lower bound of evidence of logarithmic likelihood of joint probability (Evidence Lower BOund,ELBO) can be obtained from equation (1).

The first term in equation (2) represents the expectation of the sentence for log likelihood, which is obtained by Monte Carlo sampling. The second term is the KL divergence between the approximate a posteriori and prior distributions of hidden variables. Through the technique of reparameterization, a gradient-based algorithm is used to jointly train all parts.

Training of non-parallel corpus

This paper designs an iterative training method in MGNMT to make use of non-parallel corpus. In the process of this training, translators in both directions can benefit from their respective monolingual data sets and can promote each other. The training methods on the non-parallel corpus are shown in Algorithm 1:

Give two non-parallel sentences: the sentence x ^ s in source and the sentence y ^ t in target. The goal is to maximize the lower bound of their marginal distribution likelihood:

The two terms to the right of the less than or equal sign represent the lower bounds of the marginal logarithmic likelihood of source and target, respectively.

Take the second item above as an example. The x sampled by p (x | y ^ t) in the source language is used as the translation result of y ^ t (that is, back translation). In this way, pseudo-parallel sentence pairs (x, y ^ t) can be obtained. The expression of this term is given directly in equation (4):

In the same way, you can get the expression of another term:

For the specific proof process, children's shoes of interest can see the appendix of the original paper.

According to the above two formulas, the pseudo-parallel corpus in two directions can be obtained, and then the two can be combined to train MGNMT. Equation (3) can be updated using a gradient-based method, as shown in equation (6):

The whole training process of using non-parallel corpus is similar to that of joint back translation to some extent. However, each iteration of joint back translation only uses one direction of the non-parallel corpus to update the translation model in one direction. Because the hidden variable comes from the shared approximate posteriori Q (z | x, y; Φ), it can serve as a bridge to promote monolingual performance in both directions of the MGNMT.

Decode

MGNMT models both the translation model and the language model in decoding, so it can obtain smoother and higher quality translation results. Given the sentence x (or target sentence y), find the corresponding translation result by yawning argmax {y} p (y | x) = argmax_ {y} p (x, y). The specific decoding process is shown in Algorithm 2:

Take the srg2tgt translation model as an example. Do the following for the given source sentence x:

(1) sample an initialized hidden variable z from the standard Gaussian prior distribution and get an initial translation y~=arg max_y p (y | x, z).

(2) the hidden variables are sampled continuously from the posterior approximate distribution Q (z | x, y y; Φ) and re-decoded by beam search to maximize ELBO. So as to iterate to generate ytrees:

The decoding score of each step is determined by the x-> y translation model and the y language model, which helps the translation result to be more like the target language. The rearrangement score of reconstruction is determined by the language model of y-> x and x. Rearrangement refers to the reordering of candidates after translation. The introduction of reconstruction score in rearrangement does contribute to the improvement of translation effect.

IV. Experiment

Experimental data sets: WMT16 En-Ro,IWSLT16 EN-DE, WMT14 EN-DE and NIST EN-ZH.

MGNMT can make better use of non-parallel corpus.

For all languages, the non-parallel corpus used is as follows: Table 1:

Table 1: statistical results of each translation task dataset

The following two Table are the experimental results of the model on each dataset. It can be seen that the MGNMT+ non-parallel corpus achieves the best results in all experiments.

Table 2: BLEU scores on low-resource and cross-domain translation tasks

It is worth noting in Table 2 that MGNMT, which uses non-parallel corpus in cross-domain datasets, achieves the best results.

BLEU scores on Table 3:resource-rich language datasets

Combining the two Table, both low-resource and resource-rich,MG-NMT can achieve good results, especially after adding non-parallel corpus.

Introducing language model into MGNMT has better performance.

Table 4 shows the impact of introducing a language model into decoding. The LM-FUSION in the table means to insert a pre-trained language model instead of training together like MG-NMT. As you can see, the regular direct insertion of LM is not as effective as the method in this article.

Table 4: experimental results of introducing a language model into decoding

The influence of non-parallel Corpus

Both Transformer and MGNMT can benefit from more non-parallel corpus, but generally speaking, MGNMT benefits more from it.

Figure 4:BLEU is influenced by the size of non-parallel corpus data sets.

Then let's see whether the use of monolingual non-parallel corpus is also helpful to the translation of MGNMT in both directions. From the experimental results, we can see that the Bleu value of the model is indeed improved by adding only monolingual non-parallel corpus. This shows that the translation models in these two directions do promote each other.

Figure 5: the influence of monolingual non-parallel corpus on BLEU

V. Summary

In this paper, a mirror-generated machine translation model MGNMT is proposed to make more efficient use of non-parallel corpus.

The model jointly learns the two-way translation model and their respective language models through a shared bilingual implicit semantic space. In MGNMT, both translation directions can benefit from non-parallel corpus. In addition, MGNMT naturally uses the target language model learned when decoding, which can directly improve the quality of translation. Experiments show that the proposed MGNMT is superior to other methods in translation pairs in various languages.

The future direction

The future research direction is to use MGNMT in completely unsupervised machine translation.

AAAI 2020 Proceedings:

AAAI 2020 thesis interpretation meeting @ Wangjing (with PPT download)

AAAI 2020 thesis interpretation series:

01. [Institute of Automation, Chinese Academy of Sciences] create a better speech translation model through recognition and translation interaction.

02. From a new perspective, [Institute of Automation, Chinese Academy of Sciences] explores the mutually beneficial relationship between "target detection" and "case segmentation"

03. From a new point of view of bilinear pooling, where does the essence of redundancy and sudden problems come from?

04. [Fudan University] use scene diagrams to generate stories for image sequences

05. [Tencent AI Lab] 2100 games for Arena of Valor, the winning rate of 1v1 is 99.8%, Tencent has no idea about AI technology interpretation.

06. [Fudan University] how to design a better parameter sharing mechanism for multitask learning?

07. [Tsinghua University] you forget what you said on the tip of your tongue? This model can help you | Multi-channel reverse dictionary model

08. [Beihang et al] DualVD: a new framework of visual dialogue

09. [Tsinghua University] build a multilingual semantic knowledge base with the help of BabelNet

10. [Microsoft Asia Research] the gully is easy to fill: the connection method of pre-training and fine-tuning in end-to-end speech translation

11. [Microsoft Asia Research] can time be two-dimensional? Video content fragment Detection based on two-dimensional time Graph

twelve。 [Tsinghua University] Neural Network Snowball Mechanism for few Relational Learning

13. [Institute of Automation, Chinese Academy of Sciences] explores the brain representation mechanism of semantics and grammar through disentanglement model.

14. [Institute of Automation, Chinese Academy of Sciences] generative multimodal automatic abstracting under the guidance of multimodal reference

15. [Nanjing University] using long attention Mechanism to generate Diversity Translation

16. [UCSB Wang William Group] Zero sample learning to expand the knowledge graph (video interpretation)

Https://www.toutiao.com/i6779699118714389006/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.