Ali open source a new generation of man-machine dialogue model ESIM: the accuracy broke the world record, raised to 94.1%! 07/02 Update SLTechnology News&Howtos

Ali open source a new generation of man-machine dialogue model ESIM: the accuracy broke the world record, raised to 94.1%!

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Recently, Ali AI opened up a new generation of man-machine dialogue model Enhanced Sequential Inference Model (ESIM). ESIM is an enhanced version of LSTM designed for natural language inference. According to Ali, since it was proposed in 2017, the algorithm model has been cited in papers by Google, facebook and other international academia more than 200 times. It has even won the double champion in the International Top Dialogue system Evaluation Competition (DSTC7), and raised the world record for human-computer conversation accuracy to 94.1%.

ESIM model has a wide application prospect in intelligent customer service, navigation software, intelligent speakers and other scenarios. Ali AI published a paper introducing the model, and Lei Feng.com AI Technology Review compiled it as follows.

ESIM introduces the background

The man-machine dialogue system with great potential and commercial value is getting more and more attention. With the recent introduction of deep learning model, we have a higher chance of success in the process of building an end-to-end dialogue system. However, the construction of this dialogue system is still full of challenges, which requires the system to remember and understand the text of multiple rounds of dialogue, rather than just considering the current discourse like the single-round dialogue system.

The modeling of multi-round dialogue system can be divided into text-based method and retrieval-based method. The retrieval-based method will select the best reply from the candidate pool of multi-round conversations, which can be regarded as the execution of the multi-round reply text selection task. The typical methods of selecting reply text include sequence-based method and hierarchical information-based method. Sequence-based methods usually connect conversations into a long sequence, while hierarchical information-based methods usually model each conversation text separately, and then model the interaction between discourses.

Recently some research work has said that the combination of hierarchical information-based method and complex neural network can achieve more significant gain effect than sequence-based method. But in this paper, we still choose to study the sequence-based method, that is, to enhance the effectiveness of the sequence reasoning model (ESIM), which was originally developed for natural language reasoning (NLI) tasks.

In the DSTC7 conversation reply selection Challenge, our model ranked first in both datasets (that is, Advising and Ubuntu datasets). In addition, our model performs better than all previous models on both large common benchmark data sets (Lowe's Ubuntu), including the most advanced hierarchical information-based models mentioned above. Our open source code is available on https://github.com/alibaba/ESIM.

The method based on hierarchical information usually uses additional neural networks to simulate the relationship between multiple rounds of dialogue. this method needs to segment and intercept the text in multiple rounds of dialogue to make it have the same length and shorter than the maximum length. However, the length of each round of conversation usually varies greatly in the actual task, and when using a larger maximum length value, we need to add a large number of zeros to the method based on hierarchical information to fill it. This will greatly increase computational complexity and memory costs, while when using a smaller maximum length, we may lose some important information in a multi-round conversation environment.

We suggest that the sequence-based ESIM model should be used in multi-round conversation reply selection tasks to effectively solve the above problems encountered by the hierarchical information-based method. This method connects the contents of multi-round conversations into a long sequence, and converts the multi-round dialogue reply selection task into a binary classification task of a sentence pair (that is, whether the next sentence is the reply of the current conversation).

Compared with the approach based on hierarchical information, ESIM has two main advantages. First of all, because ESIM does not need to make every discourse have the same length, it has less zero filling and can be more efficient than the method based on hierarchical information. Secondly, ESIM implicitly simulates the interaction between conversations in an effective way without using an additional complex network structure, as described in the "Model description" section below.

Task description

The Dialogue Systems Technology Challenge (DSTC7) is divided into three different tracks, while our proposed approach is for the "end-to-end response selection" theme. The track focuses on multiple rounds of goal-oriented dialogue, focusing on selecting the right response from a set of dialogue candidates. The racing system participating in the track cannot use manual data or rule-based data, but needs to use the Ubuntu and Advising data sets provided by the competitor, which will be described in detail in the "experiment section".

The "end-to-end reply selection" track provides a series of subtasks with a similar structure, but the tasks available in the output section and the conversation section are different. In figure 1, "√" indicates that the task is evaluated on the marked dataset, and "×" indicates that the task evaluation is not performed on that dataset.

Figure 1 Task description

Model description

The multi-round reply selection task is to select the next conversation from the candidate pool in the case of a given multi-round conversation. We convert the problem into a binary classification task, that is, for a given round of conversations and candidate responses, our model only needs to determine whether the candidate responses are correct. In this section, we will introduce the enhanced Sequential reasoning Model (ESIM), which was originally developed for natural language reasoning. The model consists of three main components, namely, input coding (Input Encoding), local matching (Local Matching), and matching synthesis (Matching Composition), as shown in figure 2.

Fig. 2 sentence pair classification based on attention mechanism

Input code

The input coding part performs the task of encoding the dialogue information and marking in the meaning of the dialogue. ESIM is different from the method based on hierarchical information, which encodes conversation information through complex hierarchical information, while ESIM simply encodes conversation information like this-first, the contents of multiple rounds of conversations are concatenated into long sequences, which are marked as c = (c 1;::; cm); candidate replies are marked as r = (r 1;::; r n) Then use pre-trained word embedding E ∈ R de × | V | (where | V | is the vocabulary size, de is the dimension of word embedding) to convert c and r into two vector sequences [E (c 1);::; E (c m)] and [E (r 1);::; E (r n)]. There are many types of pre-training word embedding, and here we propose a method using multiple embedding-given k pretraining words to embed E 1;::; E k, we connect all the embedding of the word I, for example: e (c I) = [E 1 (c I);::; E K (C I)] Then use the feedforward layer with ReLU to reduce the dimension of word embedding from (d e 1 + + d e k) to d h.

In order to express the tag in its conversation meaning, we input the conversation and reply into the BiLSTM encoder to obtain c / s and r / s that depend on the hidden state of the conversation:

Where I and j represent the I mark in the conversation and the j mark in the reply, respectively.

Local matching

Modeling the local semantic relationship between conversation and reply is a key component to determine whether the reply is correct or not. Because the correct reply is usually related to some keywords in the text, it can be obtained by modeling local semantic relations. Instead of directly encoding the conversation and reply into two dense vectors, we use the cross-attention mechanism to align the tag with the conversation and re-reply, and then calculate the semantic relationship at the tag level. The weight of the attention mechanism is calculated as follows:

Soft alignment is used to obtain the local correlation between conversation and reply, which is calculated from the attention force matrix e ∈ R m × n in the above equation. Then, for the hidden state of the I tag in the conversation, namely c i s (the encoded tag itself and its conversational meaning), the relevant semantics in the candidate reply are identified as vector c i d, which is called double vector here, which is a weighted combination of all reply states, and the formula is as follows:

Where α ∈ R m × n and β ∈ R m × n are normalized attention mechanism weight matrices relative to axis 2 and axis 1. We perform a similar calculation for the hidden status r j s of each tag in the reply, with the following formula:

By comparing vector pairs, we can simulate the tag-level semantic relationship between aligned tag pairs. Similar calculations apply to vector pairs. We collect the following local matching information:

Here, we use a heuristic difference matching method and an element-based product to obtain the sum of local matching vectors for conversation and reply, respectively. Where F is a single layer feedforward neural network, RELU can be used to reduce the dimension.

Matching synthesis

The implementation of matching synthesis is as follows: in order to determine whether the reply is the next discourse in the current conversation, we explore an integration layer to integrate the resulting local matching vectors (cl and rl):

Once again, we use BiLSTM as the building block of the integration layer, but the role of BiLSTMs is completely different from that of the input encoding layer. The BiLSTM here reads the local matching vectors (cl and rl) and learns to distinguish the key local matching vectors to obtain the hierarchical relationship of the overall conversation.

The output hidden vector of BiLSTM2 is converted into a fixed length vector through integrated operation and fed to the final classifier to determine the overall relationship; where the maximum and average values are used and connected together to obtain a fixed length vector; the last vector is fed into a multilayer perceptron (MLP) classifier with a hidden layer, TANH activation layer and SOFTMAX output layer The whole process trains the whole ESIM model by minimizing the cross-entropy loss in an end-to-end way.

A method based on sentence coding

For subtask 2 in the Ubuntu dataset, we need to select the next reply discourse from the candidate pool of 120000 sentences; if we directly use the ESIM model based on cross-attention mechanism, the computational cost is unacceptable. On the contrary, we use the method based on sentence coding to select the first 120000 candidate words from the first 100 sentences, and then rearrange them using ESIM, which is also effective.

The method based on sentence coding uses the Siamese architecture shown in figure 3, which applies the parameter binding neural network to encode the conversation and reply, and then uses the neural network classifier to determine the relationship between the two sentences. Here, we use BiLSTM in a pool of multi-headed self-attention mechanisms to encode sentences and classify them with MLP.

Fig. 3 sentence pair classification based on sentence coding

We use the same input coding process as ESIM. To transform a variable-length sentence into a fixed-length vector representation, we use the weighted summation of all BiLSTM hidden vectors (H):

Is the weight matrix; is the deviation; d an is the dimension of the concerned network, d h is the dimension of BiLSTM. Is the hidden vector of BiLSTM, where T represents the length of the sequence. Is the weight matrix of the long attention mechanism, where d m is the superparameter of the number of heads that need to be adjusted using the hold set. Instead of using the maximum pool or the average pool, we sum the BiLSTM hidden state H according to the weight matrix A to get the vector representation of the input sentence:

The matrix can be converted into a vector representation. In order to enhance the relationship between sentence pairs, similar to ESIM, we connect the embedding of two sentences and their absolute differences and the product of elements as the input of the MLP classifier:

MLP has ReLU activation layer, fast connection layer and softmax output layer, and can train the whole model end-to-end by minimizing cross-entropy loss.

Experiment

Data set

We tested our model on two datasets of the DSTC7 end-to-end reply selection track, namely Ubuntu and Advising datasets. In addition, in order to compare with the previous method, we also tested our model on two large-scale public response selection benchmark datasets, namely Lowe's Ubuntu dataset and e-commerce dataset.

Ubuntu dataset. The Ubuntu dataset includes two-person conversation data from Ubuntu Internet Relay Chat (IRC). Under this challenge, each dialog box contains more than three rounds of conversations, and the system is asked to select the next reply discourse from a given set of candidate sentences, in which the Linux manual page is provided to the contestants as external knowledge. We use a data enhancement strategy similar to that proposed by Lowe, that is, we treat each discourse (starting from the second) as a potential response, and the previous discourse as its conversation; therefore, a 10-length dialogue will produce nine training samples. In order to train the binary classifier, we need to extract negative (error) responses from the candidate pool. Initially, we used a positive and negative response ratio of 1:1 to balance the sample; later we found that using more negative responses would effectively improve the results, such as 1:4 or 1:9. Considering the efficiency factor, we chose the positive and negative response ratio of 1:4 in the final configuration of all subtasks except subtask 2 with a positive and negative response ratio of 1:1.

Advising dataset. The Advising dataset includes two-person conversation data simulating the discussion between students and academic counselors; structured information is provided as a database, including course information and roles; the data also includes sentence interpretation and target responses. Using a similar data enhancement strategy, based on the Ubuntu dataset of the original conversation and its interpretation, the ratio between positive responses is 1RV 4.33.

Ubuntu dataset for Lowe. This data set is similar to DSTC7 Ubuntu data, the training set contains a million conversation-reply pairs, and the ratio between positive and negative responses is 1:1. On the development and test suite, each conversation is associated with one positive reply and nine negative responses.

E-commerce dataset. E-commerce data sets are collected from real conversations between customers and customer service staff on Taobao, China's largest e-commerce platform. The ratio between positive and negative responses in the training and development set was 1:1, and that in the test set was 1:9.

Data training

We use spaCy3 to mark up the text of two DSTC7 datasets and use the original tagged text without any further preprocessing of the two common datasets; then we join the content of multiple rounds of conversation and insert two special tags eou and eot, where eou indicates the end of the discourse and eot indicates the end of the conversation.

The hyperparameters are adjusted based on the development set. We use GloVe and fastText as pre-trained words to embed. For subtask 5 of the Ubuntu dataset, we trained word embedding using word2vec from the provided Linux man page. The details are shown in figure 4.

Fig. 4 pre-trained word embedding statistics. Where 1-3 lines come from Glove;4-5 lines come from FastText;6 lines come from Word2Vec.

Note that for subtask 5 of the Advising dataset, we tried to use the recommended course information as external knowledge, but did not observe any effective improvement; therefore, we submitted the results of the Advising dataset without using any external knowledge. For Lowe's Ubuntu and e-commerce datasets, we use word2vec to embed pre-trained words in the training data. During the training of the two DSTC7 data sets, the pre-training embedding is fixed, but we fine-tune the Lowe Ubuntu and e-commerce data sets.

The Adam algorithm is used for optimization in the training process. It is known that the initial learning rate of Ubuntu dataset of Lowe is 0.0002 and the rest is 0.0004; for DSTC7 dataset, the size of small batch data is set to 128,for Lowe Ubuntu dataset, the size of small batch data is set to 16, and for e-commerce dataset, the size of small batch data is 32. The hide size for BALTMS and MLP is set to 300.

In order to make the sequence less than the maximum length, we cut off the last tag of the reply, but at the same time reverse cut the context; this is because we assume that the last few sentences in the context are more important than the previous ones. For the Ubuntu dataset of Lowe, the maximum values of context sequence and reply sequence are 400 and 150 respectively; for e-commerce dataset, the corresponding maximum values are 300 and 50; and the remaining datasets are 300 and 30 respectively.

The details of the method are as follows: for subtask 2 of DSTC7 Ubuntu, we use BILSTM's hidden sequence length of 400 and use 4 heads to encode sentences. For subtask 4, the candidate pool may not contain the correct next discourse, so we need to choose a threshold; when the probability of the positive marker is lower than the threshold, we predict that the candidate pool does not contain the correct next discourse. The threshold is selected based on the development set from [0RV 50; 0RV 51;:: 0:99].

Result

Figure 5 summarizes the results of all the DSTC7 reply selection subtasks. Challenge rankings take into account the average of recall@10 results and average reciprocal rankings (Mean Reciprocal Rank, a measure of search and the like). On advising data sets, because test case 1 (advising1) is dependent on training data sets, it is ranked according to test case 2 (advising2) results. Our results ranked first out of seven subtasks, second in Ubuntu subtasks 2, and overall first in the two datasets of the DSTC7 reply selection Challenge. Subtask 3 may contain multiple correct responses, so average accuracy (MAP) is considered an additional metric.

Figure 5 submission results of a hidden test set for the DSTC7 reply selection challenge. NA indicates that it is not applicable. There are altogether 8 test conditions.

Ablation analysis

For Ubuntu and Advising datasets, the ablation analysis is shown in figures 6 and 7, respectively. For Ubuntu subtask 1 MRR with R @ 10 of 0.854 and MRR of 0.6401. If we remove local matches and matching combinations to speed up the training process ("- CtxDec"), R @ 10 and MRR drop to 0.845 and 0.6210, respectively; if we further discard the last word instead of the first few words of the conversation ("- CtxDec&-Rev"), R10 and MRR will be reduced to 0.840 and 0.6174.

By averaging the output of the models with different initialization parameters and different structures, the above three models are integrated ("Ensemble"), and the R @ 10 of 0.887 and the MRR of 0.6790 are obtained. For Ubuntu subtask 2, using the sentence-based coding method ("Sent-based") can achieve R @ 10 of 0.082 and MRR of 0.0416. After integrating several models with different parameters initialization ("Ensemble1"), R @ 10 and MRR will increase to 0.091 and 0.0475. The top 100 candidates predicted by rearranging "Ensemble1" using ESIM will get a R @ 10 of 0.125 and a MRR of 0.0713. Removing the local match and match combination ("- CtxDec") of the conversation reduces R @ 10 and MRR to 0.117 and 0.0620. Integrating the above two ESIM methods ("Ensemble2") yields 0.134 R @ 10 and 0.0770 MRR.

For Ubuntu subtask 4, we observe a trend similar to subtask 1. When ESIM reaches 0.887 R @ 10 and 0.6434 MRR; and uses "- CtxDec", performance degrades to 0.877 R @ 10 and 0.6277 MRR;. If "- CtxDec&-Rev" is used, performance is further reduced to 0.875 R @ 10 and 0.6212 MRR. The integration of the above three models "Ensemble" will reach 0.909 R @ 10 and 0.6771 MRR.

For Ubuntu subtask 5, the dataset is the same as subtask 1 except for the external knowledge of using the Linux man pages. Adding pre-trained word embedding from the Linux man page ("+ W2V") will result in 0.858 R @ 10 and 0.6394 MRR. Compared with ESIM without external knowledge, integrating the integration model of subtask 1 (0.887 R @ 10 and 0.6790 MRR) with the "+ W2V" model will bring further gains to 0.890 R @ 10 and 0.6817 MRR, respectively.

Fig. 6 Development set ablation analysis of Ubuntu dataset in DSTC7

Figure 7 shows the development set ablation analysis of the Advising dataset in DSTC7. We use ESIM that removes local matching and matching combinations to improve computational efficiency, and we observe that this dataset has a similar trend to Ubuntu datasets. The Rand10 and MRR effects of "- CtxDec&-Rev" will be reduced by more than "- CtxDec", but overall, the overall gain of the two models is more significant than that of a single model.

Fig. 7 Development set ablation analysis of Advising dataset in DSTC7

Compared with previous work

Figure 8 summarizes the results of the two public reply selection benchmark datasets. The first set of models includes sentence-based coding methods, in which artificial features or neural network feature data are used to encode replies and conversations, and then cosine classifiers or MLP classifiers are used to determine the relationship between the two sequences. Previous work used TF-IDF,RNN and CNN,LSTM,BiLSTM to encode conversations and responses.

Figure 8 comparison of different models on two large common benchmark datasets. Except for the results of our research, all the other results come from previous work.

The second set of models consists of sequence-based matching models, which usually use attention mechanisms, including MV-LSTM,Matching-LSTM,Attentive-LSTM and multi-channels. These models compare the tag hierarchy between conversation and reply, rather than directly comparing two dense vectors as in sentence-based coding. These models have better performance than the first group of models.

The third set of models includes more complex hierarchical information-based models, which usually explicitly simulate marker-level and discourse-level information. The Multi-View model makes use of discourse relations-from the word sequence view and the discourse sequence view. The DL2R model uses a neural network and reexpresses the last utterance in other words in the dialogue. The SMN model uses CNN and attention mechanism to match the response of each utterance in the dialogue. DUA and DAM models adopt a framework similar to SMN, one of which improves the closed self-attention mechanism and the other improves the Transformer structure.

Although previous hierarchical information-based work claimed that they could achieve state-of-the-art performance by using hierarchical information for multiple rounds of conversation content, our ESIM sequence matching model outperformed all previous models, including hierarchy-based models. In Lowe's Ubuntu dataset, the ESIM model has a more significant improvement in performance than the previous best results of the DAM model, with an effect of up to 79.6% (from 76.7%) R @ 1 score 89.4% (from 87.4%) R @ 2 (from 96.9%). For e-commerce datasets, the ESIM model has also made substantial improvements to the previous technical level through the DUA model, up to 57.0% (from 50.1%) to 76.7% (from 70.0%) to 94.8% for R @ 2 (from 92.1%). These results prove the effectiveness of ESIM model (a sequential matching method) for multi-round reply selection.

Conclusion

The previous most advanced multi-round reply selection model uses hierarchical (discourse-level and marker-level) neural networks to accurately simulate the interaction between different rounds of dialogue, so as to model the dialogue. However, in this paper, we have proved that sequence-based sequential matching models can also perform better than all previous models, including the most advanced hierarchy-based methods. This shows that this sequential matching method has not been fully utilized in the past, and it is worth noting that the model achieved the first result in both dataset tests in the DSTC7 end-to-end reply selection Challenge, and produced the best performance on two large-scale public reply selection benchmark datasets. In the future multi-round reply selection research, we will also consider exploring the effectiveness of external knowledge, such as knowledge graph and user profile.

Links to papers:

Https://arxiv.org/abs/1901.02609

Open source address:

Https://github.com/alibaba/esim-response-selection

Lei Feng AI Science and Technology Review

Https://www.leiphone.com/news/201908/ERJ2iCZEz8muSUlv.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.