What is the Java implementation of Chinese dependency parser based on CRF sequence tagging? 07/19 Update SLTechnology News&Howtos

What is the Java implementation of Chinese dependency parser based on CRF sequence tagging?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces how the Java implementation of Chinese dependency parser based on CRF sequence tagging is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

This is a Chinese dependency parser based on CRF. The characteristic function of the internal CRF model is stored by a double array Trie tree (DoubleArrayTrie) and decoded by a special Viterbi backward algorithm. Compared with the implementation of maximum entropy dependent parser, the parsing speed has doubled to 1262.8655 sent/s.

Open source project

The code of this article has been integrated into the open source project in HanLP, and the latest hanlp1.7 version has been released.

Introduction to CRF

CRF is a commonly used model in sequence tagging scenes, which can make use of more features than HMM and is more resistant to the problem of tag bias than MEMM. The training tool that is often used in production is CRF++,. For more information about the use of CRF++ and model format, please see "CRF++ Model format description".

CRF training

Corpus

Same as the implementation of maximum Entropy dependency parser, 20000 sentences from Tsinghua University's semantic dependency network corpus are used as the training set.

Pretreatment

In fact, the dependency relationship consists of three characteristics-the starting point, the end point and the name of the relationship. The relationship name is temporarily ignored in this CRF model (you can use other models to complete it below).

According to the theory of dependency grammar, we can know that there are two main factors that determine the dependency between two words: direction and distance. So we define the category tag as having the following form:

[+ | -] dPOS

Among them, [+ | -] indicates direction, + indicates that the position of the dominant word in the sentence appears after the subordinate word,-indicates that the dominant word appears in front of the subordinate word, POS indicates the part of speech category of the dominant word, and d indicates distance.

For example, the original tree library:

After conversion:

Feature template

Training parameters

1.crf_learn-f 3-c 4.0-p 3 template.txt train.txt model-t

My experimental conditions (machine performance) are limited. It takes 5 minutes for each iteration. In the end, I can only set the maximum number of iterations to 100. After a painful iteration, we get a model with a very limited effect, whose serr is as high as 50%, which is only used for algorithm testing for the time being.

Decode

The standard Viterbi algorithm assumes that all tags are legal, but in this CRF model, tags are also constrained by sentences. For example, the label of the last word cannot be + nPos, it must be negative, and the [+ / -] nPos of any word must ensure that the label of n words is Pos after (or before, when the symbol is negative). So I overwrote CRF's Viterbi tag algorithm, the code is as follows:

Pay attention to the above

1.if (! isLegal (j, I, table)) continue

It ensures the legitimacy of the label.

The result of this step:

Subsequent processing

With a dependent object, you also need to know what kind of specific name the dependency is. I counted the probability of the combination of words and parts of speech from the tree database, and called it the 2gram model. We used this model to accept the words at both ends of the dependency and output the most likely relation names.

The final result

Convert to CoNLL format output:

This is the end of the Java implementation of Chinese dependency parser based on CRF sequence tagging. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.