Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to parse the Chinese word Segmentation algorithm of hanlp Source Code

2025-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

How to parse hanlp source Chinese word segmentation algorithm, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

Parsing hanlp source code Chinese word segmentation algorithm. A word graph refers to a picture that may be made up of all the words in a sentence. If the next word of a word A may be B, then there is a path E between An and B. A word may have more than one follow-up, and there may also be multiple precursors, which make up a graph I call a word graph.

A sparse two-dimensional matrix model is needed, and a two-dimensional matrix can be obtained by taking the starting position of a word as a row and the ending position as a column. For example, the sentence "what he said is true"

The storage method of picture words: one is the DynamicArray method, the other is the fast offset method. The second approach is used in the Hanlp code.

1. DynamicArray (two-dimensional array) method

In the word graph, the relationship between rows and columns: all words in a column with col n can be combined with words in all rows with row n. For example, the word "indeed", whose col = 5, needs to calculate the smooth value with two words: "real" and "real". They are row = 5. But when traversing and inserting, you need to compare the relationship between col and row one by one, and the complexity is O (N).

2. Fast offset

An one-dimensional array, each element is a single linked list

The line number of "indeed" is 4, and the length is 2, 4, 2 and 6, so the two words "real / real" in the sixth line are the follow-up to "indeed".

At the same time, this method is very fast, and the time of insertion and query is O (1).

Hanlp Core Dictionary:

Shortest path algorithm-viterbi (dynamic path planning)

Frequency: word Frequency in Core Dictionary

NTwoWordsFreq: co-occurrence of word frequency

IntMAX_FREQUENCY= 25146057

Double dTemp = (double) 1 / MAX_FREQUENCY + 0.00001

DSmoothingPara = 0.1

Viterbi shortest path digraph

1. The calculation process is from top to bottom, and the precursor node is changed according to the calculated weight value to ensure that the precursor node is unique (dynamic planning path).

2. After the calculation, the term is taken out from the last node, and the precursor node of the node is taken out in turn. The result: Li, in, indeed, said, he

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report