How does EMNLP19 incorporate syntactic tree information into Transformer 07/11 Update SLTechnology News&Howtos

How does EMNLP19 incorporate syntactic tree information into Transformer

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces how EMNLP19 integrates syntax tree information into Transformer. The content is very detailed. Interested friends can refer to it for reference. I hope it can help you.

introduced

In fact, there have been many previous efforts to incorporate syntactic information into RNNs, such as ON-LSTM and PRPN, to implicitly model syntactic structure information and improve the accuracy of language models. This paper attempts to incorporate syntactic information into Transformer to give attention better interpretability. At the same time, it can predict the syntax tree of sentences unsupervised, and the performance of language model is improved compared with general Transformer.

model structure

The main difference is that a component attention is added to the multi-head attention operation, which is used to indicate whether a span can form a phrase. For example, in the image above,"cute dog" forms a phrase, so the attention of these two words in layer 0 is higher. And "the cute dog" makes up a larger phrase, so "the" and "dog" in layer 1 have greater attention.

Looking back at the operation of self-attention, it is mainly to calculate the vector dot product of two words:

here

generally taken. But in this paper, a new component prior is added, which represents the probability that sum is within a phrase. Then multiply the element with the original self-attention:

Notice how different heads are shared.

What about this a priori component? Here it is broken down into the product of the probabilities of several adjacent words in the same phrase. That is, defined as the probability of words and within the same phrase, then it can be expressed as:

This way, only if all the words in to have a high probability of being in the same phrase, the value is relatively large. Of course, the logarithm is used in the implementation to avoid too small a value.

Then the question arises again, how? First, similar to self-attention, calculate the score of two adjacent words belonging to the same phrase:

here

It is the number of heads.

Note that there is a distinction between directions, that is, there are scores, and although the meaning is the same, the scores are not necessarily the same. In order to prevent a problem, that is, all the scores are the same, and then calculate the probability is all 1, then there is no point, so add a limit to the score, that is, normalization. Here you choose to normalize the scores of both a word and its left and right neighbors:

Then, since the sum is different, take the average:

In this way, if the probability of two adjacent words connecting to each other is high, it will lead to a large probability, which means that the two words have a high probability of belonging to the same phrase.

As you can see from the first model diagram, component attention calculates more than one layer. The lower level can be used to represent the probability that two adjacent words belong to the same phrase, while the higher level can represent the probability of belonging to a larger phrase. Note that there is also a property that if two words have a high probability of belonging to the same phrase at the lower level, then they must have a higher probability of belonging to a larger phrase at the higher level. So the calculation is as follows:

It's initialized to zero. Thus, for each layer, a component prior can be obtained.

unsupervised parsing

The picture above shows the syntax tree decoding algorithm, similar to the decoding algorithm of the paper on syntax distance. Because it represents the probability that two adjacent words belong to the same phrase, we first find the smallest one, then divide the phrase into two sub-phrases from there, and then recursively divide them. However, this may not work well, because the range of phrases represented by a single layer is actually limited, and it does not cover all the phrases well. So, like in the above picture, recursively decode from the top. First, find the minimum value. If it is greater than the threshold value (0.8 in the experiment), it means that this segmentation point is not credible. If this time has reached the first level (set as 3 in the experiment), then there is no way, indicating that these words have no separation point, all as a phrase on the line. If you haven't reached the first level yet, continue to the next level to find the split point. If it is less than the threshold, it means that the segmentation point is credible, so it is good to divide it like this.

experimental

First, unsupervised parsing results on the WSJ test set:

It can be seen that the Tree-Transformer effect is still better than the previous ON-LSTM and PRPN, although it is slightly worse than DIORA trained on NLI, but it is also understandable. After all, the training set is large, and it is global decoding, and even achieves the effect of URNNG. And the best effect is to choose 10 layers.

Then there are the unsupervised parsing results on the WSJ10 test set:

It can be seen that Tree-Transformer is even worse than PRPN when the length is very short, and it is almost the same as ON-LSTM. The paper didn't analyze the reason. It didn't even mention this.

Then there are unsupervised parsing results using different layers:

As you can see, the minimum recursion to the third layer results best, and look at the fewer layers, that is, only look at the high-level, the effect is very poor. The effect of only looking at a single layer is not good, which shows that the high-level representation is more abstract, and in fact it is not suitable for the representation of syntactic information. And the lower levels are too close to the word level, all surface information. This is consistent with a recent paper explaining the meaning of attention in bert, where attention in the middle layer represents syntactic information.

Finally, the confusion results of the language model:

This is only compared to ordinary transformers, and the results are still better. Because we have to use masked LM as the objective function here, it cannot be compared with ON-LSTM, PRPN and other language models.

Other discussions on attention interpretability are detailed in the paper. I don't think it's very meaningful. The interpretability of attention has been controversial recently, and forced interpretation is meaningless.

The Tree Transformer proposed by Xiaobian uses component priors to express the probability that two words belong to the same phrase, and then determines the attention between the two words jointly with self-attention. An algorithm for decoding syntax tree is proposed, but there are still some problems.

It is said that we have tried to pre-train the Tree Transformer with Transformer, so that the loss drop is lower and the fitting is better, but the decoded syntax tree is worse. This actually makes sense. I have seen an analysis paper before, which mentioned that good language model training does not necessarily mean good syntax tree learning. The two cannot be equated.

About EMNLP19 how to integrate syntax tree information into Transformer is shared here, I hope the above content can be of some help to everyone, you can learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.