How to use the Chinese pre-training Model ERNIE 04/21 Update SLTechnology News&Howtos

How to use the Chinese pre-training Model ERNIE

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you how to use the Chinese pre-training model ERNIE, which is concise and easy to understand. It will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Recently, I have dealt with Chinese corpus at work, and I have also tried some recently released pre-training models (ERNIE,BERT-CHINESE,WWM-BERT-CHINESE). After comparison, I still feel that Baidu's ERNIE effect will be better and very convenient to use, so record it in detail today. I hope everyone can make progress on their own projects.

1 、 A Glance at ERNIE

On the ERNIE model itself, this article will not do too much introduction, online introduction documents are also a lot, I believe that students engaged in NLP must be very familiar with.

2. A taste of ERNIE source code

Okay, when we understand the general framework and principle of the ERNIE model, we can take a closer look at the specific implementation. ERNIE is based on Baidu's own deep learning framework (PaddlePaddle). (Baidu pushes this paddle vigorously and opens free computing power). People usually use TensorFlow and Pytorch for alchemy. Here, you can refer to the Quick installation Guide (https://www.paddlepaddle.org.cn/#quick-start) for the installation of the flying oar environment that runs ERNIE.

2.1 about input

The input of the model pre-training is based on the data of encyclopedia, information and forum dialogue to construct sentence pairs with contextual relations. Baidu's internal lexical analysis tool is used to segment the sentence into words, words, entities and other different granularities, and then the segmented data is token based on the CharTokenizer in tokenization.py to get the plaintext token sequence and segmentation boundary. Then the plaintext data is mapped to id data according to the dictionary config/vocab.txt, and during the training process, the continuous token is randomly operated by mask according to the segmentation boundary. The input sample after the above preprocessing is:

1 1048 492 1333 1361 1051 326 2508 5 1803 1827 98 2777 2696 983 121 4 19 9 634 551 844 85 2476 1 895 33 13 983 121 23 7 1093 24 46 660 12043 2 1263 6 328 33 121 126 398 276 315 5 63 44 35 25 12043 2 Binder 0 000 0 000 0 000 0 0 0 1 1 1 0 12 34 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 26 27 28 30 31 33 33 36 37 38 39 41 42 43 45 46 47 48 49 51 53 54 55 Ting 55 Tian 1 000 10 0 10 0 10 000 10 000-1 000 100 0-1 000

It consists of five parts, each of which is separated by a semicolon:

Token_ids: enter the representation of the sentence pair

Sentence_type_ids:0 or 1 indicates which sentence token belongs to.

Position_ids: absolute position coding

Seg_labels: indicates participle boundary information, 0 indicates prefix, 1 indicates non-prefix, and-1 is placeholder

Next_sentence_label: indicates whether the sentence has the relationship between the upper and lower sentences (0 means no one is there)

The parse_line function in reader.pretraining.py.

2.2 about mask Policy batching.py

We know that the biggest improvement over BERT,ERNIE is the Chinese + phrase / entity mask (the operation of the phrase mask was later trained into WWM-BERT by BERT), so let's first look at how the mask mechanism of ERNIE is implemented.

2.3 about infer process code rewriting

ERNIE code is easy to use, but one deficiency is that the official infer.py file has not been given yet, that is, the file that gives the quick reasoning results after model training. There are almost ten thousand people asking for interfaces on Github.

So our purpose is to rewrite the source code to complete such an interface: input the file predict.tsv that we need to predict, and then call the interface and output it as the result pred_result of the corresponding task. Let's take the classification task as an example to rewrite an infer interface.

Step 1. Classifier.py under finetune

Complete the predict function in the file

Step 2. Run_classifier.py

Logic when modifying predict_only=True

Step 3. Finetune_args.py

Add a parameter do_predict to the file

OK, after the space is limited, we will introduce the specific ERNIE practice. This is the first part of the source code reading part. In fact, many of the rest are similar to the BERT code. Interested students can also refer to the previous BERT source code analysis series (https://blog.csdn.net/Kaiyuan_sjtu/article/details/90265473).

3. ERNIE practical Guide

What is mentioned above is a retreat, so let's take a pragmatic look at the specific application of the pre-training model ERNIE. Compared with BERT, ERNIE is easier to use. In the previously introduced BERT model multiple text classification (https://blog.csdn.net/Kaiyuan_sjtu/article/details/88709580), we need to manually rewrite a Processor adapted to our task, while for ERNIE, it is as simple as three steps:

Prepare the data in the required format (tsv format is used in the source code, but as we said in bert, it can be modified to various formats)

Write the training script sh

Run the script to get the result bash run_script.sh

Preparatory work

For the pre-training model of the recent fire, most of us are unlikely to train from scratch, and most of us use the official open source model for specific tasks of Finetune. So the first step is to download the model code (

Https://github.com/PaddlePaddle/ERNIE/tree/develop/ERNIE

And the corresponding parameters (https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz).

The next step is to prepare the data for our task to meet the input requirements of the ERNIE model. Generally speaking, label and text_a are separated by tabs between fields, and additional text_b fields are required for sentence-pair tasks. We will describe the sample input for each task in detail later.

Ok, we have always stressed that ERNIE is a super-friendly and super-fast model. Let's take a look at how simple it is combined with the actual task.

Emotion classification

Emotion classification is one of the very typical basic tasks of NLP, because BERT has written text classification before, so here we will change our taste a little. Here we only consider the most simple emotion classification task, that is, given an input sentence, the model is required to give an emotion label, which can be two categories only positive or negative, or three categories including neutral emotions. Ok, let's take a look at the data. We randomly found a financial news data set on the Internet. The data comes from the positive and negative news headlines released by Wande Information on Snowball. The data set contains 17149 news data, including date, company, code, positive / negative, title and text, including 12514 positive news and 4635 negative news. It looks something like this:

The input required to process it into an ERNIE classification task looks something like this:

Put the processed data and the previously downloaded pre-training model parameters in place, and you can start writing the script file of our running model:

Well, such a mission is over... Just wait for the output result after running the script, isn't it very simple?

Of course, if you want to play some tricks, you can read more papers. For example, there was an article in Fudan that transformed the single sentence classification task of ABSA emotion classification into the similarity matching task of sentence pairs on the basis of BERT. To put it simply, by constructing auxiliary sentences, the pot meat imported into this restaurant is super delicious and the food taste is positive. The paper shows that this trick will be better than single sentence classification. For more specific details, please refer to the paper:

"Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence": https://www.aclweb.org/anthology/N19-1035

Named entity recognition

Named entity recognition is also a basic task of NLP, which has been introduced in the blog before: [paper Notes] named entity recognition thesis: https://blog.csdn.net/Kaiyuan_sjtu/article/details/89143573

The way to deal with NER is more or less the same as the above emotion classification, except that NER belongs to the sequence annotation task, so pay attention to using run_senquence_labeling.py in the source code when running the script.

4. Interesting ISSUE

More valuable than the source code on Github is the corresponding issue. A good open source project will attract a lot of attention, and there will be a lot of interesting thoughts in the issue area, so don't miss it. Here are a few issue that I think are more interesting for your reference.

About batch_size

Students who have just opened the ERNIE script to run may find that its batch_size is 8192. Oh, my God (little Yue's face), this is not allowed to explode! So you tactfully changed batch_size to 32, typed bash script/pretrain.py happily, and then confidently pressed the Enter key. Huh? Report an error?

If you are interested in what is wrong, go and reproduce it by yourself.

Yes, in pretrain, the batch_size here refers to the total number of all input token, which is why it is so large.

On the Logic of Mask Mechanism

As I said at the beginning, the biggest innovation of ERNIE is its mask mechanism, and the code implementation of this is also hotly discussed in the issue area.

About getting the intermediate vector representation of the input

Sometimes we need to get the sentences Embedding and token Embeddings, but refer to the following scheme

Words predicted by masked

Mask a word in a sentence, and then use the model to predict the word and get the probability of the candidate word and word.

The above content is how to use the Chinese pre-training model ERNIE. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.