How to identify and analyze people's names in HanLP 05/08 Update SLTechnology News&Howtos

How to identify and analyze people's names in HanLP

2025-05-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

How to carry out HanLP name recognition analysis, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more small partners who want to solve this problem find a simpler and easier way.

Word segmentation

In HMM and word segmentation, part-of-speech tagging, named entity recognition, it says:

Word segmentation: Given a sequence of words, find the most likely sequence of labels (the sequence of broken sentences: [ending] or [non-ending]). Stuttering participles are currently using BMES tags to participle, B (beginning),M (middle),E(end),S(independent words)

Word segmentation is also the use of Viterbi algorithm dynamic programming properties of the solution, specific reference: text mining word segmentation principle

Role observation

Take "singing Jacky Cheung's song is gone" as an example,

Start with ##from the starting vertex, label the characters NR.A and NR.K, and the frequency defaults to 1

iterator.next ();tagList.add(new EnumItem(NR.A, NR.K)); //start ##start A K

For the first word,"song", which does not exist in nr.txt, EnumItem nrEnumItem = PersonDictionary.dictionary.get(vertex.realWord); returns null, so guess a character tag based on its own part of speech:

switch (vertex.guessNature()){ case nr: case nnt: default:{ nrEnumItem = new EnumItem(NR.A, PersonDictionary.transformMatrixDictionary.getTotalFrequency(NR.A)); }}

Since the attribute of "singing first" is nz16, not nr and nnt, it is assigned a role NR.A by default, and the frequency is the total frequency of NR.A roles in nr.tr.txt.

At this point, the list of characters is as follows:

Next up is vertex "Zhang," and since "Zhang" is in nr.txt, PersonDictionary.dictionary.get(vertex.realWord) returns the EnumItem object, adding it directly to the character list:

EnumItem nrEnumItem = PersonDictionary.dictionary.get(vertex.realWord);tagList.add(nrEnumItem);

The list of characters after adding "Zhang" is as follows:

The list of characters for the whole sentence "Singing the first song of Jacky Cheung's Love Is Gone" is as follows:

At this point, the character observation part is complete.

To sum up, the role observation of sentences is first to divide the sentences into several words through the word segmentation algorithm, and then query the Person Dictionary for each word.

If the word is in the name dictionary (nr.txt), record the role of the word, all roles defined in com.hankcs.hanlp.corpus.tag.NR.java.

If the word is not in the name dictionary, guess a character according to the Attribute of the word. In the guessing process, some words may have been labeled nr or nnt in the core dictionary, and will be split. In other cases, the word is labeled NR.A with the frequency NR. A's total word frequency in the transition matrix.

Viterbi Algorithm (Dynamic Programming) for Optimal Path

In the image above, each word is labeled with a role tag, so you can see that a word can have multiple tags. And we need to choose a character path with the shortest path for these words. Detailed explanation of Viterbi algorithm based on reference hidden Markov model

List nrList = viterbiComputeSimply(roleTagList);//some code.... return Viterbi.computeEnumSimply(roleTagList, PersonDictionary.transformMatrixDictionary);

And this process, in fact, is: Viterbi algorithm decoding hidden state sequence. Here, the quintuple is:

Hide individual name tags defined by the state set com.hankcs.hanlp.corpus.tag.NR.java

Observe the elements in each tagList that have been divided into words in the state set (equivalent to the word segmentation result)

The transition probability matrix is generated from the nr.tr.txt file. Specific reference:

The number of occurrences of a name tag (hidden state) divided by the total number of occurrences of all tags.

Math.log((item.getFrequency(cur) + 1e-8) / transformMatrixDictionary.getTotalFrequency(cur)

Initial state (start ##start) and end state (end ##end)

Viterbi decoding hidden state dynamic programming solution core code is as follows:

for (E cur : item.labelMap.keySet()) { double now = transformMatrixDictionary.transititon_probability[pre.ordinal()][cur.ordinal()] - Math.log((item.getFrequency(cur) + 1e-8) / transformMatrixDictionary.getTotalFrequency(cur)); if (perfect_cost > now) { perfect_cost = now; perfect_tag = cur; } }

transformMatrixDictionary. transiiton_probability[pre.ordinal()][cur.ordinal()] is the transition probability from the previous hidden state pre.ordinal() to the current hidden state cur.ordinal(). Math.log((item.getFrequency(cur) + 1e-8) / transformMatrixDictionary.getTotalFrequency(cur) is the emission probability of the current hidden state. Subtract the two to get a probability stored in the double now variable, and then find the most likely (perfect_cost minimum) hidden state perfect_tag corresponding to the current observed state through the for loop.

As for why the above formula is used to calculate the transition probability and emission probability, please refer to the paper: "Research on Automatic Recognition of Chinese Names Based on Role Labeling"

In the above example, the optimal hidden state sequence (optimal path)K->A->K->Z->L->E->A->A is obtained as follows:

nrList = {LinkedList@1065} size = 8

"K" start ##start

"A" sings first.

"K" Zhang

"Z" student friend

"L"

"E" song

"A" is gone.

"A" End ##End

For example:

Hide Status---View Status

"K"---------Start ##Start

maximum matching

With the optimal hiding sequence: KAKZLEAA, the next step is the subsequent "maximum matching process."

PersonDictionary.parsePattern(nrList, pWordSegResult, wordNetOptimum, wordNetAll);

Before the maximum match, a "pattern split" occurs. The exact meaning of hidden state is defined in com.hankcs.hanlp.corpus.tag.NR.java. For example, if there is a 'U' or 'V' in the optimal hidden sequence,

UPPf name above and surname into the word here [related] Tianpei's heroic

V Pnw The last word of the three-character name and the following words Gong Xueping and other leaders, Deng Ying [super birth] before

it's going to be split.

switch(nr){ case U: //Split into K B case V: //Split as appropriate}

After splitting, a new hidden sequence (pattern) is retrieved

String pattern = sbPattern.toString();

Next, AC automata is used to perform maximum pattern matching, and the matching results are stored in the "optimal word net." Of course, you can customize some recognition processing rules for specific applications here.

trie.parseText(pattern, new AhoCorasickDoubleArrayTrie.IHit(){ //..... wordNetOptimum.insert(offset, new Vertex(Predefine.TAG_PEOPLE, name, ATTRIBUTE, WORD_ID), wordNetAll);}

After saving the recognized names to the optimal word network, the Viterbi segmentation algorithm is invoked once based on the optimal word network to obtain the final segmentation result--segmentation result.

if (wordNetOptimum.size() != preSize) { vertexList = viterbi(wordNetOptimum); if (HanLP.Config.DEBUG) { System.out.printf("wordNetOptimum: \n%s\n"); } }

Source code on the name recognition is basically in accordance with the content of the paper to achieve. For a given sentence, the following three steps are performed:

Role observation

Viterbi algorithm decoding to solve hidden state (solve the role label of each word segment)

Maximum matching of character tags (some post-processing possible)

Finally, Viterbi algorithm is used for segmentation once again, and the segmentation result is obtained, that is, the final recognition result.

About how to carry out HanLP name recognition analysis of the answer to the problem shared here, I hope the above content can be of some help to everyone, if you still have a lot of doubts not solved, you can pay attention to the industry information channel to learn more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.