How to analyze the Java implementation of word Segmentation based on structured average Perceptron 04/22 Update SLTechnology News&Howtos

How to analyze the Java implementation of word Segmentation based on structured average Perceptron

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article is about how to analyze the Java implementation of the word splitter based on the structured average perceptron. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article.

Java implementation of word Segmentation based on structured average Perceptron

Recently, as high-yielding as a sow, we have written a Chinese word separator based on AP, which has an F value of 96.11% on the MSR corpus of Bakeoff-05. Most importantly, only five iterations were trained; including IO operations such as corpus loading, the whole training took only 23 seconds. After 80% of the features in the model are removed by using the clipping algorithm, the F value is reduced by less than 0.1%, and the volume is controlled at 11 trillion. If you train 100 iterations, the F value can reach 96.31%, and the training time is more than two minutes.

The data are obtained on an ordinary IBM compatible computer:

This module has been integrated into the open source version of HanLP 1.6. the documentation is located in the project wiki, welcome to use! [the new version of hanlp1.7 has been released, you can look it up and use it]

Structured prediction

A handout on the difference between structured and unstructured forecasting is as follows:

For more information, please refer to Neubig's handout "The Structured Perceptron".

The AP participle prediction realized in this paper is the BMES tagging sequence of the whole sentence, which of course belongs to the structured prediction problem.

Perceptual machine

Two classifications

The basic form of perceptron, as described in Statistical Learning method, is a linear dichotomy model defined on a hyperplane. As the second chapter of the original work, it couldn't be simpler. However, in practical application, the simpler the model, the more tenacious the vitality.

The only thing to add here is that the perceptron is an online learning model, and after learning a training example, you can update the whole model.

Multi-classification

How to extend two classifications to multiple classifications? Multiple classifiers can be used, and for the four classifiers of BMES, there are four perceptrons. Each perceptron is responsible for distinguishing the four dichotomous questions of "whether it is B", "whether it is M", "whether it is E" or "S". In the implementation, of course, you don't have to foolishly create four perceptrons. By splicing their weight vectors together, you can output "the score of B", "the score of M", "the score of E" and "the score of S". If we take the largest one, we can initially realize the multi-classification. However, in the word segmentation, transfer features and HMM-viterbi search algorithm are also involved, which will be discussed later.

Average perceptron

The average perceptron refers to recording the cumulative value of each feature weight, and finally averaging the perceptron of the final model. Why bother to come up with an average algorithm?

As mentioned earlier, the perceptron is an online learning model, and after learning a training example, you can update the entire model. Suppose there are 10000 examples, and the model gets the correct answer perfectly in the study of the first 9999 examples, indicating that the model at this time is close to perfect. However, the last example is a noise point, and the naive perceptron model is directly modified after the prediction error of the naive perceptron model, resulting in the prediction error of the previous 9999 examples, and the previous efforts of the model training are wasted.

What is the solution? One scheme is voting, that is, recording the number of times each model is classified correctly as its vote. At the end of the training, the model with the highest number of votes was obtained as the final model. But this algorithm is not practical, if you train 5 iterations, 10000 instances, then you need to store 50000 models and their votes, which is too wasteful.

The best way to use is the average perceptron, add up the weight vectors of these 50000 models, and finally divide by 50000, so that we only record an additional cumulative value at any time, which is very efficient. For more information about the average perceptron, please refer to "200 lines of Python code to implement a perceptual POS tagger." Although that article is about part of speech tagging, I believe that readers who are primates of all things must have the ability to generalize by analogy.

Language model

HMM

Aren't we talking about perceptual participle? What does it have to do with HMM?

In fact, any word splitter based on sequence tagging can not be separated from hidden Markov chain, that is, the Bigram (or even higher-order n-gram) transfer probability between the four tags of BMES. The AP word splitter, as one of them, takes the label of the previous character as a feature. This feature is undoubtedly useful for predicting the current tag, for example, if the previous tag is B, the current tag can never be S.

This kind of feature similar to y [I-1] is generally called transition feature in the model of linear graph, and those features that do not involve y [I-1] are usually called state feature.

Viterbi

Since the AP word splitter uses the transfer feature, there must be a Viterbi search. Considering the accuracy of the whole sequence, search is also essential. Given the three elements of the hidden Markov model, I wrote a "runnable pseudo code" in Java:

The above implementation is a prototype that values organization over efficiency. As the ancients said, "premature optimization is the devil." I believe that smart readers will be able to understand what is going on here.

Feature extraction

Define the character sequence as x and the annotation sequence as y.

Transfer feature

The transfer feature is the y [I-1] mentioned above.

State characteristics

I used a total of seven state characteristics:

In Deng Zhilong's "Design and implementation of efficient Chinese word Segmentation and part of speech tagging system based on Perceptron algorithm", it is mentioned that more complex character n-gram, character category n-gram, reduplicated words, dictionaries and other features should be used. But in my practice, in addition to the above seven features, every time I reduce one feature, the accuracy of my AP word splitter is improved a little, perhaps because the corpus is different, maybe the implementation of feature extraction is different. In short, the focus is on streamlining and efficiency.

Training

In fact, the number of iterations does not need too much, and the model basically converges within three iterations:

The fourth iteration seems to be unhelpful, but fortunately, we are using an average perceptron. After the weight is averaged, the performance of the model is improved.

The model size at this time:

Model cutting

The model clipping strategy mentioned in "Design and implementation of efficient Chinese word Segmentation and part of speech tagging system based on Perceptron algorithm" is effective. I set the compression ratio to 0.2, that is, 20% of the features are compressed, and the model accuracy remains unchanged:

Because I use the random shuffle algorithm, the accuracy of each training fluctuates slightly. At this point, it can be seen that the model clipping process took an extra 1 minute, and the accuracy remained unchanged at 96.11 after clipping.

The model size at this time:

How about a 50% cut?

The model size at this time:

It can be seen that 80% of the features have been cut, and the volume has decreased from 54m to 11m, so the accuracy of the model has dropped by less than 0.1 percentage point! This shows that most features are useless, feature clipping is very useful, very easy to use!

The above is how to analyze the Java implementation of the word splitter based on the structured average perceptron. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.