What is the exploration and practice of NER technology? 04/15 Update SLTechnology News&Howtos

What is the exploration and practice of NER technology?

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how the exploration and practice of NER technology is, and the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

1. Background

Named entity recognition (Named Entity Recognition, referred to as NER), also known as "proper name recognition", refers to the identification of entities with specific meaning in the text, including person names, place names, organization names, proper nouns and so on. NER is an important basic tool in information extraction, question answering system, syntactic analysis, machine translation, Semantic Web-oriented metadata tagging and other application fields. It plays an important role in the practical process of natural language processing technology. In Meituan search scene, NER is the underlying basic signal of deep query understanding (Deep Query Understanding, referred to as DQU), which is mainly used in search recall, user intention identification, entity link and other links. The quality of NER signal directly affects the user's search experience.

The following is a brief description of the application of entity identification in search recalls. In O2O search, the description of the merchant POI is the merchant name, address, category and other text fields that are not highly related to each other. If the O2O search engine also adopts the method of hitting and intersecting all the text fields, a large number of false recalls may occur. Our solution is shown in figure 1, which allows specific queries to do inverted retrieval only in a specific text field, which we call a "structured recall", which ensures the strong relevance of the recall merchant. For example, for requests such as "Haidilao", some merchants' addresses will be described as "a few hundred meters near Haidilao". If these merchants are searched in full-text domain, they will be recalled, which is obviously not what users want. On the other hand, the structured recall identifies "Haidilao" as a merchant based on NER, and then searches only in the relevant text field of the merchant name, thus recalling only the undersea salvage brand merchants, which accurately meets the needs of users.

Different from other application scenarios, the NER task searched by Meituan has the following characteristics:

The number of new entities is large and the growth rate is relatively fast: the field of local life services is developing rapidly, and new stores, new goods and new services emerge in endlessly; user Query is often mixed with many non-standardized expressions, abbreviations and buzzwords (such as "worried", "cat sucking", etc.), which poses a great challenge to achieving high accuracy and high coverage of NER.

Strong domain relevance: entity recognition in search is highly related to business supply. In addition to general semantics, it is necessary to add business-related knowledge to assist judgment, such as "cut a hair". The general understanding is to generalize the description entity, but it is a business entity in the search.

High performance requirements: the time from the search initiated by the user to the final result presented to the user is very short. As the basic module of DQU, NER needs to be completed in millisecond. Recently, many researches and practices based on deep network have significantly improved the effect of NER, but these models often have a large amount of computation and time-consuming prediction. How to optimize the performance of the model to meet the computing time requirements of NER is also a major challenge in NER practice.

two。 Technology selection

According to the characteristics of the NER task in O2O field, our overall technology selection is the framework of "entity dictionary matching + model prediction", as shown in figure 2 below. The problems solved by entity dictionary matching and model prediction have their own emphasis, which are indispensable in the current stage. Here are the answers to three questions to explain why we chose it.

Why do you need entity dictionary matching?

* * answer: * * there are four main reasons:

First, the header flow of the user query in the search is usually short, the expression form is simple, and focus on the merchant, category, address and other three types of entity search. Although the entity dictionary matching is simple, the accuracy of this kind of query can also reach more than 90%.

Second, NER domain-related, through mining business data resources to obtain business entity dictionary, after online dictionary matching can ensure that the recognition results are domain-adapted.

Third, the access to the new service is more flexible, and the entity recognition in the new business scenario can be completed only by providing the service-related entity vocabulary.

Fourth, some downstream users of NER require extremely high response time, fast dictionary matching, and basically no performance problems.

Why do we need model prediction when we have entity dictionary matching?

* * answer: * * there are two reasons:

First, with the continuous increase of search volume, the expression of medium-and long-tail search traffic is complex, more and more OOV (Out Of Vocabulary) problems begin to appear, entity dictionaries have been unable to meet the increasingly diversified needs of users, model prediction has generalization ability, which can be used as an effective supplement to dictionary matching.

Second, entity dictionary matching can not solve ambiguity problems, such as "Yellow Crane Tower Food". In entity dictionaries, "Yellow Crane Tower" is also a scenic spot in Wuhan, a businessman in Beijing, and cigarette products. Dictionary matching does not have the ability of disambiguation. These three types will be output, while the model prediction can be combined with the context, will not output "Yellow Crane Tower" is a cigarette product.

How are the two results of entity dictionary matching and model prediction merged?

* * answer: * * currently, we use the trained CRF weight network as a grader to score the NER paths of entity dictionary matching and model prediction. When there is no result of dictionary matching or its path score is significantly lower than that of the model prediction, the results of model recognition are used, and the dictionary matching results are still used in other cases.

After introducing our technology selection, we will introduce our work in entity dictionary matching and model online prediction, hoping to provide some help for your exploration in the field of O2O NER.

3. Entity dictionary matching

The traditional NER technology can only deal with the established and existing entities in the general domain, but can not deal with the entity types unique to the vertical domain. In Meituan search scenario, the problem of domain entity identification can be well solved by offline mining unique data such as POI structured information, merchant review data, search logs and so on. After the continuous enrichment and accumulation of offline entity library, the online use of lightweight thesaurus matching entity recognition is simple, efficient, controllable, and can well cover head and waist traffic. At present, the recognition rate of online NER based on entity library can reach 92%.

3.1 offline mining

Meituan has a rich variety of structured data, and a high-precision initial entity database can be obtained by processing the structured data in the field. For example: from the merchant basic information, you can obtain the merchant name, category, address, selling goods or services and other types of entities. From the cat's eye entertainment data, we can obtain movies, TV dramas, artists and other types of entities. However, the entity names searched by users are often mixed with a lot of non-standardized expressions, which are different from the standard entity names defined by the business, so how to mine domain entities from non-standard expressions becomes particularly important.

The existing new word mining techniques are mainly divided into unsupervised learning, supervised learning and long-distance supervised learning. In unsupervised learning, candidate sets are generated by frequent sequences and screened by calculating compactness and degree of freedom indexes. although this method can produce sufficient candidate sets, only feature threshold filtering can not effectively balance accuracy and recall rates. in practical applications, higher thresholds are usually selected to ensure accuracy at the expense of recall. Most of the advanced new word mining algorithms are supervised learning, which usually involve complex parsing models or deep network models, and rely on domain experts to design a variety of rules or a large number of manual tagged data. Remote supervised learning generates a small amount of tagged data through open source knowledge base, although it alleviates the problem of high cost of human labeling to some extent. However, the labeled data with small sample size can only learn simple statistical models, and can not train complex models with high generalization ability.

Our offline entity mining is multi-source and multi-method, and the data sources involved include structured business information database, encyclopedia entries, semi-structured search logs, and unstructured user reviews (UGC). The mining methods used also include a variety of rules, traditional machine learning models, deep learning models and so on. As a kind of unstructured text, UGC contains a large number of non-standard entity names. Next, we will introduce in detail an automatic mining method of vertical domain new words for UGC, which mainly consists of three steps, as shown in figure 3 below:

Step1: candidate sequence mining. Word sequences that occur frequently and continuously are effective candidates for potential new vocabularies. We use frequent sequences to generate sufficient candidate sets.

Step2: large-scale marked corpus generation based on remote supervision. Frequent sequences change with the change of a given corpus, so the cost of manual tagging is very high. We use the accumulated entity dictionaries in the domain as the remote supervision thesaurus, and take the intersection of candidate sequences and entity dictionaries in Step1 as training examples. At the same time, through the analysis of candidate sequences, it is found that only about 10% of the millions of frequent Ngram candidates are really high-quality new words. Therefore, for negative samples, negative sampling method is used to produce training negative case set [1]. Aiming at the massive UGC corpus, we design and define four-dimensional statistical features to measure the availability of candidate phrases:

Frequency: meaningful new words should meet a certain frequency in the corpus, which is calculated by Step1.

Compactness: mainly used to evaluate the co-occurrence intensity of continuous elements in new phrases, including T distribution test, Pearson chi-square test, point-by-point mutual information, likelihood ratio and so on.

Degree of information: newly discovered words should have real meaning and refer to a new entity or concept, which mainly considers the inverse document frequency, part of speech distribution and stop word distribution of phrases in the corpus.

Integrity: newly discovered words should exist as a whole in a given context, so the compactness of subset phrases and superset phrases should be taken into account to measure the integrity of phrases.

After the construction of small sample marker data and multi-dimensional statistical feature extraction, the binary classifier is trained to calculate the prediction quality of candidate phrases. Because the negative sample of the training data adopts the method of negative sampling, a small number of high-quality phrases are mixed in this part of the data. In order to reduce the influence of negative case noise on the quality score of phrase prediction, the error can be reduced by integrating multiple weak classifiers. After the model prediction of the candidate sequence set, the set whose score exceeds a certain threshold is regarded as the positive pool, and the set with lower score is regarded as the negative pool.

Step3: phrase quality assessment based on deep semantic network. In the case of a large amount of tagged data, the deep network model can automatically and effectively learn corpus features and produce efficient models with generalization ability. BERT learns text semantic representation through massive natural language texts and depth models, and refreshes records on multiple natural language understanding tasks after simple fine-tuning, so we train phrase quality raters based on BERT. In order to better improve the quality of training data, we use search log data to remotely guide the large-scale positive and negative case pool data generated in Step2, and use a large number of search entries as meaningful keywords. We take the coincident part of the positive case pool and the search log as the positive sample of the model, and the part of the negative case pool minus the search log set as the negative sample of the model, so as to improve the reliability and diversity of the training data. In addition, we use the Bootstrapping method to update the training samples according to the existing phrase quality scores and the remote corpus search log after getting the phrase quality score for the first time. Iterative training improves the effect of the phrase quality grader and effectively reduces the false positive and negative cases.

After a large number of new words or phrases are extracted from UGC corpus, the type prediction of new mining words is carried out with reference to AutoNER [2], so as to expand the offline entity database.

3.2 online matching

The original online NER dictionary matching method does bi-directional maximum matching directly for Query, thus obtaining the candidate set of component recognition, and then filtering the final result based on word frequency (in this case, entity search volume). This strategy is relatively crude and requires high accuracy and coverage of the thesaurus, so there are the following problems:

When Query contains uncovered entities in the thesaurus, the character-based maximum matching algorithm is easy to cause segmentation errors. For example, for the search word "Haituo Valley", the thesaurus can only match "Haituoshan", so the wrong segmentation of "Haituoshan / Valley" occurs.

The granularity is out of control. For example, the segmentation of the search word "Starbucks Coffee" depends on the coverage of "Starbucks", "Coffee" and "Starbucks Coffee".

The definition of node weight is unreasonable. For example, directly based on the entity search volume as the entity node weight, when the user searches for "Xinyang Restaurant", the score of "Xinyang Cuisine / Restaurant" is higher than that of "Xinyang Restaurant".

In order to solve the above problems, CRF word segmentation model is introduced before entity dictionary matching, word segmentation criteria are formulated for Meituan search in vertical domain, training corpus is labeled manually and CRF word segmentation model is trained. At the same time, aiming at the problem of model word segmentation errors, a two-stage repair method is designed.

Combined with model word segmentation Term and domain dictionary matching Term, the optimal solution of weight sum of Term sequence is solved according to dynamic programming.

Strong repair rules based on Pattern regular expressions. Finally, the result of component recognition based on entity library matching is output.

4. Model on-line prediction

For long-tailed and unlogged-in queries, we use the model for online identification. The evolution of the NER model has gone through several stages shown in figure 5 below. At present, the main models used online are BERT [3] and BERT+LR cascade models. In addition, the offline effects of some models under exploration have also been proved to be effective. In the future, we will gradually go online considering both performance and benefits. There are three main problems in the construction of NER online model in search:

High performance requirements: NER as the basic module, model prediction needs to be completed in milliseconds, but the current models based on deep learning have the problems of large amount of calculation and long prediction time.

Domain strong correlation: the entity types in the search are highly related to the business supply, so it is difficult to ensure the accuracy of the model recognition only considering the general semantics.

Lack of labeling data: NER labeling task is relatively difficult, entity boundary segmentation and entity type information need to be given, the labeling process is time-consuming and laborious, and it is difficult to obtain large-scale labeling data.

In order to solve the problem of high performance requirements, our online model carries out a series of performance tuning when it is upgraded to BERT; for the related problems in the field of NER, we propose a knowledge-enhanced NER method that combines search log features and entity dictionary information; to solve the problem that it is difficult to obtain training data, we propose a weak supervision NER method. Let's introduce these technical points in detail below.

4.1 BERT model

BERT is a natural language processing method made public by Google in October 2018. As soon as this method is released, it has aroused widespread concern in academia and industry. In terms of results, BERT refreshed the current best results for 11 NLP tasks, and this method was also rated as the major progress of NLP in 2018 and best paper of NAACL 2019 [4J5]. The technical route of BERT is basically the same as that of the GPT method released by OpenAI earlier, except that there are slight differences in technical details. The main contribution of the two works is to use the idea of pre-training and fine-tuning to solve the problem of natural language processing. Taking BERT as an example, the application of the model includes two steps:

Pre-training (Pre-training), in which network parameters are learned from a large number of general corpus, including Wikipedia and Book Corpus, which contain a large number of texts and can provide rich language-related phenomena.

Fine-tuning (Fine-tuning), which uses "task-related" label data to fine-tune network parameters, eliminating the need to design Task-specific networks for target tasks.

There is a challenge when applying BERT to entity recognition online prediction, that is, the prediction speed is slow. We explored model distillation and prediction acceleration, and launched BERT distillation model, BERT+Softmax model and BERT+CRF model in stages.

4.1.1 Model distillation

We try to tailor and distill the BERT model. The results show that tailoring has a serious loss of accuracy for complex NLP tasks such as NER, and model distillation is feasible. Model distillation uses a simple model to approximate the output of a complex model in order to reduce the amount of calculation required for prediction and ensure the prediction effect. Hinton elaborated the core idea in his 2015 paper [6]. The complex model is generally called Teacher Model, and the simple model after distillation is generally called Student Model. Hinton's distillation method uses the probability distribution of pseudo-labeled data to train Student Model, but not the label of pseudo-labeled data. The author's view is that probability distribution can provide more information and stronger constraints than tags, and can better ensure that the prediction results of Student Model and Teacher Model are consistent. On the Workshop of NeurIPS in 2018, [7] proposed a new network structure BlendCNN to approach the prediction effect of GPT, which is essentially model distillation. The prediction speed of BlendCNN is 300 times faster than that of the original GPT, and the prediction accuracy of specific tasks is also slightly improved. With regard to model distillation, the following conclusions can be drawn:

The essence of model distillation is function approximation. For specific tasks, the author believes that as long as the complexity of Student Model can meet the complexity of the problem, then Student Model can be completely different from Teacher Model. The example of choosing Student Model is shown in figure 6 below. For example, suppose that the sample in the problem (XMagol y) is sampled from the polynomial function, and the highest exponential number is diter2; the available Teacher Model uses a higher exponential number (for example, diter5). In this case, to choose a Student Model for prediction, the complexity of the Student Model model cannot be lower than that of the problem itself, that is, the corresponding exponential degree is at least diter2.

Depending on the size of the unlabeled data, the constraints used in distillation can be different. As shown in figure 7, if the scale of untagged data is small, you can use logits approximation for learning to strengthen constraints; if the scale of untagged data is medium, you can use distribution approximation; if the scale of untagged data is very large, you can use label approximation for learning, that is, you can only use Teacher Model prediction tags to guide model learning.

With the above conclusion, how can we apply model distillation in the search NER task? First of all, analyze the task. Compared with the related tasks in the literature, there is a significant difference in searching NER: as an online application, searching has a large amount of unlabeled data. User queries can reach the order of 10 million / day, and the scale of data is much larger than that provided by offline evaluation. Accordingly, we simplify the distillation process: do not limit the form of Student Model, choose the mainstream neural network model with fast inference speed to approximate BERT; training does not use value approximation and distribution approximation as learning objectives, and directly use label approximation as goals to guide Student Model learning.

We use IDCNN-CRF to approximate the BERT entity recognition model. IDCNN (Iterated Dilated CNN) is a multi-layer CNN network, in which the low-level convolution uses ordinary convolution operation, and the convolution result is obtained by weighted summation of the positions delineated by the sliding window. in this case, the distance between the positions delineated by the sliding window is equal to 1. The high-level convolution uses dilated convolution (Atrous Convolution) operation, and the distance between the positions circled by the sliding window is equal to d (d > 1). The amount of convolution computation can be reduced by using expansive convolution at high level, and there is no loss in sequence dependency calculation. In text mining, IDCNN is often used to replace LSTM. The experimental results show that, compared with the original BERT model, the on-line prediction speed of the distillation model is improved tens of times without obvious accuracy loss.

4.1.2 Forecast acceleration

Due to the problem of a large number of small operators in BERT and the amount of Attention computation, the prediction time is longer when it is applied on the actual line. We mainly use the following three methods to accelerate the model prediction, and for the high-frequency Query in the search log, we upload the prediction results to the cache in a dictionary way to further reduce the QPS pressure of the model online prediction. Here are three methods of model prediction acceleration:

Operator fusion: reduce the time-consuming overhead of small operators in BERT by reducing the number of Kernel Launch and improving the memory access efficiency of small operators. We have investigated the implementation of Faster Transformer here. On the average delay, there is an acceleration ratio of about 1.4x~2x; on TP999, there is an acceleration ratio of about 2.1x~3x. This method is suitable for the standard BERT model. The open source version of Faster Transformer is of low project quality and has many problems in ease of use and stability, so it can not be applied directly. We have carried out secondary development based on NV open source Faster Transformer, mainly in terms of stability and ease of use:

Ease of use: support automatic conversion, support Dynamic Batch, support Auto Tuning.

Stability: fix memory leaks and thread safety issues.

The principle of Batching:Batching is mainly to combine multiple requests into one Batch for reasoning, reduce the number of Kernel Launch, make full use of multiple GPU SM, so as to improve the overall throughput. When max_batch_size is set to 4, the native BERT model can control the average Latency within 6ms, with a maximum throughput of 1300 QPS. This method is very suitable for the optimization of BERT model in Meituan search scenario, because the search has obvious peak period and can improve the throughput of the peak model.

Hybrid accuracy: hybrid accuracy refers to the hybrid method of FP32 and FP16. The use of hybrid accuracy can speed up the training and prediction process of BERT and reduce the overhead of video memory, while taking into account the stability of FP32 and the speed of FP16. FP16 is used to accelerate the calculation process in the process of model calculation, the weights are stored in FP32 format in the process of model training, and the FP32 type is used when the parameters are updated. Using FP32 Master-weights to update parameters under FP32 data type can effectively avoid overflow. On the basis of the mixing accuracy basically does not affect the effect, the model training and prediction speed are improved to a certain extent.

4.2 knowledge enhanced NER

How to embed the external knowledge of a specific domain into the language model as auxiliary information has been a research hotspot in recent years. K-BERT [8], ERNIE [9] and other models explore the combination of knowledge graph and BERT, which provides a good reference for us. The NER in Meituan's search is domain-related, and the determination of entity type is highly related to business supply. Therefore, we also explore how to integrate external knowledge such as providing POI information, user clicks and domain entity thesaurus into the NER model.

4.2.1 Lattice-LSTM that integrates search log features

In the field of O2O vertical search, a large number of entities are defined by merchants (such as merchant name, group single name, etc.), and the entity information is hidden in the attributes supplied to POI. Lattice-LSTM [10] enriches semantic information by increasing the input of word vectors for Chinese entity recognition. We learn from this idea, combined with search user behavior, mining potential phrases in Query, these phrases contain POI attribute information, and then embed these hidden information into the model to solve the problem of domain new words discovery to a certain extent. Compared with the original Lattice-LSTM method, the recognition accuracy is improved by 5 points in 1000 percentile.

1) phrase mining and feature calculation

The process mainly includes two steps: matching position calculation and phrase generation, which are described in detail below.

Step1: match position calculation. The search log is processed, focusing on calculating the detailed match between the query and the document field and calculating the document weight (such as click-through rate). As shown in figure 9, the user input query is "hand-woven". For document D1 (which is POI in the search), "manual" appears in the field "group list" and "weaving" appears in the field "address". For document 2, "hand weaving" appears in both the Merchant name and the Group list. The matching start position and the matching end position match the start position and the end position of the query substring that should be matched respectively.

Step2: phrase generation. Taking the results of Step1 as input, the model is used to infer candidate phrases. Multiple models can be used to produce results that satisfy multiple assumptions. We model candidate phrase generation as an integer linear programming (Integer Linear Programmingm,ILP) problem and define an optimization framework in which the hyperparameters in the model can be customized according to business requirements so that the results can be satisfied without assumptions. For a specific query Q, each segmentation result can be represented by the integer variable xij: xij=1 indicates that the position of the query I to j constitutes a phrase, that is, Qij is a phrase, and xij=0 means that the position of the query I to j does not constitute a phrase. The optimization goal can be formalized as: to maximize the collected matching score given different segmentation xij. The optimization objectives and constraint functions are shown in figure 10, where p: document, f: field, w: weight of document p, wf: weight of field f. Xijpf: query whether the substring Qij appears in the f field of document p, and the final segmentation scheme will consider the observation evidence, Score (xijpf): the observation score considered in the final segmentation scheme, w (xij): the weight corresponding to the segmentation Qij, yijpf: the observed match, and the query substring Qij appears in the f field of document p. χ max: the maximum number of phrases contained in the query. Here, χ max, wp, wf, w (xij) are super parameters, which need to be set before solving the ILP problem. These variables can be set according to different assumptions: they can be set manually according to experience, or they can be set based on other signals, which can be set according to the method shown in figure 10. The feature vector of the final phrase is represented by the click distribution in each attribute field of POI.

2) Model structure

The model structure is shown in figure 11. The blue part represents a standard LSTM network (which can be trained separately or combined with other models), input as word vectors, the orange part represents all the word vectors in the current query, and the red part represents all the phrase vectors calculated by Step1 in the current query. The hidden state input of LSTM is mainly composed of two levels of features: the current text semantic features, including the current word vector input and the previous engraved word vector hidden layer output, and the potential entity knowledge features, including phrase features and word features of the current word. The following describes the calculation of potential knowledge features and the method of feature combination: (in the following formula, σ represents sigmoid function, ⊙ represents matrix multiplication)

4.2.2 two-stage NER of fused entity Dictionary

Considering the integration of domain dictionary knowledge into the model, a two-stage NER recognition method is proposed. The method divides the NER task into two sub-tasks: entity boundary recognition and entity tag recognition. Compared with the traditional end-to-end NER method, the advantage of this method is that entity segmentation can be reused across domains. In addition, in the entity label recognition stage, we can make full use of accumulated entity data and entity links to improve the accuracy of label recognition, but the disadvantage is that there will be the problem of error propagation.

In the first stage, the BERT model focuses on the determination of entity boundaries, while in the second stage, the information gain brought by the entity dictionary is integrated into the entity classification model. The entity classification in the second stage can predict each entity separately, but this method will lose the entity context information. Our processing method is to use the entity dictionary as training data to train an IDCNN classification model. The model encodes the segmentation result of the first stage output, and adds the coding information to the label recognition model of the second stage, and the joint context vocabulary is decoded. Based on the evaluation of Benchmark tagged data, the accuracy of Query granularity of this model is 1% higher than that of BERT-NER. Here we use IDCNN mainly to consider the performance of the model, you can use scenarios to replace with BERT or other classification models.

4.3 weakly supervised NER

In order to solve the problem that it is difficult to obtain labeled data, we propose a weak supervision scheme, which includes two processes, namely, weak supervision label data generation and model training. These two processes are described in detail below.

Step1: weakly supervised label sample generation

Initial model: the entity recognition model is trained by labeled small batch data sets, and the latest BERT model is used here to get the initial model ModelA.

Dictionary data prediction: the entity recognition module currently precipitates millions of high-quality entity data as dictionaries, and the data format is entity text, entity type and attribute information. The ModelA prediction data obtained in the previous step is used to change the dictionary data to output entity recognition results.

Prediction result correction: the entity accuracy in the entity dictionary is high. Theoretically, at least one entity type given by the model prediction result should be the entity type given in the entity dictionary. Otherwise, it shows that the recognition effect of the model for this kind of input is not good, and we need to supplement the samples pertinently. We correct the model results of this kind of input and get the tagged text. We have tried two correction methods, namely the whole correction and the partial correction. The whole correction refers to the whole input correction to the dictionary entity type, and the partial correction refers to the type correction of the single Term cut out of the model. For example, the entity type given in the "Brother Barbecue Personality diy" dictionary is the merchant, the model prediction result is the modifier + dish + category, no Term belongs to the merchant type, and the model prediction result is different from the dictionary, so we need to correct the model output label. There are three kinds of correction candidates, namely, "merchants + dishes + categories", "modifiers + merchants + categories", and "modifiers + dishes + merchants". We choose the one closest to the model prediction, the theoretical significance of this choice is that the model has converged to the prediction distribution closest to the real distribution, we only need to fine-tune the prediction distribution, rather than greatly change the distribution. So how to choose the one that is closest to the model prediction from the correction candidates? The method we use is to calculate the probability score of the correction candidate under the model, and then calculate the probability ratio with the current prediction result of the model (the optimal result considered by the current model). The formula for calculating the probability ratio is shown in Formula 2. The one with the highest probability ratio is the final correction candidate, that is, the weakly supervised labeling sample. In the example of "Brother Barbecue Personality diy", the correction candidate "merchant + dish + category" has the highest probability with the output of the model "modifier + dish + category", and the marked data of "brother / merchant barbecue / dish personality diy/ category" will be obtained.

Step2: weak supervised model training

There are two weak supervised model training methods: one is to mix the generated weak supervised samples and labeled samples to re-train the model, and the other is to use weak supervised samples for Fine-tuning training on the basis of the ModelA generated by tagged samples. We have tried both ways. From the experimental results, the effect of Fine-tuning is better.

On how the exploration and practice of NER technology is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.