Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand BiLSTM and CRF algorithm

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "how to understand BiLSTM and CRF algorithm". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Preface

Given a sentence "what is be economy", the correct way of word segmentation is "what / is / stall / economy", and the corresponding word segmentation label for each word is "what / s / be / economy". From the picture below, you can see the problem with LSTM when doing sequence tagging.

BiLSTM participle

BiLSTM can predict the probability that each word belongs to a different label, and then use Softmax to get the label with the highest probability as the predicted value of that position. In this way, the correlation between tags will be ignored in the prediction, such as BiLSTM in the above picture predicts the first word as s and the second word as e. But in fact, e does not appear after s in participle, so BiLSTM does not consider the relationship between tags.

Therefore, BiLSTM+CRF adds a CRF to the output layer of BiLSTM, so that the model can consider the correlation between classes. The correlation between tags is the transition matrix in CRF, which represents the probability of transition from one state to another. Assume that the transition matrix of CRF is shown in the following figure.

CRF state transition matrix

For the first two words "what", the probability of the label "se" is 0.8 × 0 × 0.7 × 0, and the probability of the label "be" is = 0.6 × 0.5 × 0.7 × 0.21.

Therefore, BiLSTM+CRF considers the probability of the entire class label path rather than just the probability of a single class label, as shown below after adding CRF to the BiLSTM output layer.

BiLSTM+CRF participle

In the end, among all the paths, besbebe has the highest probability, so the predicted result is besbebe.

2.BiLSTM+CRF model

CRF includes two characteristic functions. If you are not familiar with children's shoes, you can read the previous article. The first characteristic function is the state characteristic function, also known as the emission probability, which represents the probability of the word x corresponding to the label y.

CRF state characteristic function

In BiLSTM+CRF, this characteristic function (emission probability) is calculated directly from the output of LSTM. As shown in the figure in the first section, LSTM can calculate the probability of different tags corresponding to each time position.

The second characteristic function of CRF is the state transition characteristic function, which represents the probability of transition from one state Y1 to another state Y2.

CRF state transition characteristic function

The state transition characteristic function of CRF can be expressed by a state transition matrix, and the element value of the state transition matrix needs to be adjusted during training. Therefore, BiLSTM+CRF needs to add a state transition matrix to the model of BiLSTM. In the code as follows.

Class BiLSTM_CRF (nn.Module): def _ init__ (self, vocab_size, tag2idx, embedding_dim, hidden_dim): self.word_embeds = nn.Embedding (vocab_size, embedding_dim) self.lstm = nn.LSTM (embedding_dim, hidden_dim / / 2, num_layers=1, bidirectional=True) # corresponding CRF launch probability That is, each location corresponds to the probability self.hidden2tag = nn.Linear (hidden_dim, self.tagset_size) # transfer matrix, and the dimension is equal to the number of tags, indicating that the probability of transfer from one tag to another is self.transitions = nn.Parameter (torch.randn (len (tag2idx), len (tag2idx)).

The probability that the label order of a given sentence x is listed as y is calculated by the following formula.

P (y | x)

The score in the formula is calculated using the following formula, where Emit corresponds to the emission probability (that is, the probability of LSTM output), and Trans corresponds to the transition probability (that is, the value corresponding to the CRF transfer matrix).

The calculation formula of score

BiLSTM+CRF is trained by maximum likelihood method, and the corresponding loss function is as follows:

Loss function

Where score (XBI y) is easier to calculate, and Z (x) is the sum of the exponents scored by all tag sequences (y). If the length of the sequence is l and the number of tags is k, then the number of sequences is (k ^ l). It cannot be calculated directly, so the forward algorithm is used for calculation.

Using the current mainstream deep learning framework, loss can be optimized by derivation and gradient descent. After training the model, we can use viterbi algorithm (dynamic programming) to find the optimal path.

3. Loss function calculation

The difficulty in calculating the BiLSTM+CRF loss function is to calculate log Z (x) and denote log Z (x) with F, as shown in the following formula.

We split the score into the sum of the emission probability p and the transition probability T. To simplify the problem, we assume that the length of the sequence is 3, then we can calculate the log Z values when the length is 1, 2, and 3, respectively, as shown below.

In the above formula, p represents the launch probability, T represents the transition probability, Start indicates the beginning, and End indicates the end of the sentence. F (3) is the final log Z (x) value. By transforming the above equation, F (3) can be transformed into a recursive form, as follows.

You can see that the operation of each step in the above formula is the same, including log_sum_exp, such as F (1):

First you need to calculate exp, and for all y1, calculate exp (p (y1) + T (Start,y1))

Sum, sum the exp values obtained in the previous step

Calculate log, and calculate log for the result of summation

So you can write the code for the forward algorithm to calculate log Z, as follows:

Def forward_algorithm (self, probs): def forward_algorithm (probs): "probs: probability value of LSTM output, size is [seq_len, num_tags], num_tags is the number of tags"# forward_var (can be understood as F in the article) saves the value of the previous moment, is a vector, the dimension is equal to num_tags # initially only Start is 0 Others take a very small value (- 10000.) Forward_var = torch.full ((1, num_tags),-10000.0) # [1, num_tags] forward_var [0] [Start] = 0.0 for p in probs: # probs [seq_len, num_tags] Ergodic sequence alphas_t = [] # alphas_t saves the cumulative probability value for next_tag in range (num_tags) of different tags at the next time: # ergodic tag # probability of next_tag emission at the next time emit_score = p [next _ tag] .view (1,-1) .probability (1) Num_tags) # probability of transfer from all tags to next_tag, transitions is a matrix Both length and width are num_tags trans_score = substitutions [next _ tag] .view (1,-1) # next_tag_ver = F (Imur1) + p + T next_tag_var = forward_var + trans_score + emit_score alphas_t.append (log_sum_exp (next_tag_var). View (1)) forward_var = torch.cat (alphas_t) .view (1) -1) terminal_var = forward_var + self.transitions [Stop] # finally transfer to Stop to indicate the end of the sentence alpha = log_sum_exp (terminal_var) return alpha

4.viterbi algorithm decoding

After training the model, the prediction process needs to use the viterbi algorithm to decode the sequence. Interested children's shoes can see "Statistical Learning methods". Here's a look at viterbi's formula. First of all, the meaning of some symbols, as follows:

Then the recursive formula of viterbi algorithm can be obtained.

Finally, we can find the most appropriate sequence according to the value calculated by viterbi.

That's all for "how to understand BiLSTM and CRF algorithm". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report