In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Before looking at the source code, read the paper "Research on automatic recognition of Chinese names based on role tagging"
For some questions about naming recognition, you can refer to the following issue:
The problem of u name recognition # 387
U u organization name recognition error
U u about the process of Chinese entity recognition in cascading HMM
HanLP reference blog:
Part of speech tagging
Organization name recognition based on cascading HMM-Viterbi role tagging Model
Word segmentation
In HMM and word segmentation, part of speech tagging, and named entity recognition, it is said:
Participle: given the sequence of a word, find out the most likely label sequence (breaksentence symbol: [suffix] or [non-suffix] sequence). At present, stuttering participle is divided by BMES tag, B (beginning), M (middle), E (end), S (independent word).
Word segmentation is also solved by the dynamic programming property of Viterbi algorithm, which can be referred to: the word segmentation principle of text mining.
Role observation
Take "singing the first song of Zhang Xueyou is gone" as an example
Start with the starting vertex # #, and the characters are marked as NR.An and NR.K. The frequency defaults to 1.
For the first word "sing the first", it does not exist in nr.txt, EnumItem nrEnumItem = PersonDictionary.dictionary.get (vertex.realWord); returns null, so guess a role tag according to its own part of speech:
Because the Attribute of "singing head" is nz 16, not nr and nnt, it is assigned a role NR.A by default, and the frequency is the total frequency of the NR.A character in nr.tr.txt.
At this point, the list of roles is as follows:
Next is the vertex "Zhang". Because "Zhang" is in nr.txt, PersonDictionary.dictionary.get (vertex.realWord) returns the EnumItem object, adding it directly to the character list:
The list of roles after adding "Zhang" is as follows:
The list of roles in the whole sentence of "singing Zhang Xueyou's Song is gone" is as follows:
At this point, the role observation part is complete.
To sum up, to observe the role of a sentence, we first divide the sentence into several words through the word segmentation algorithm, and then query the person name Dictionary (PersonDictionary) for each word.
U if the word is in the person name Dictionary (nr.txt), record the role of the word, and all roles are defined in com.hankcs.hanlp.corpus.tag.NR.java.
U if the word is not in the name dictionary, guess a role according to the Attribute of the word. In the process of guessing, some words may have been marked as nr or nnt in the core dictionary. In other cases, the word is marked with the NR.A role, and the frequency is the total word frequency of NR.An in the transfer matrix.
Viterbi algorithm (dynamic programming) to solve the optimal path
In the image above, each word is marked with a role, so you can see that a word can have more than one tag. And we need to choose the shortest role path for these words. Detailed explanation of Viterbi algorithm for reference Hidden Markov Model
List nrList = viterbiComputeSimply (roleTagList); / / some code.... Return Viterbi.computeEnumSimply (roleTagList, PersonDictionary.transformMatrixDictionary)
In fact, this process is: Viterbi algorithm decodes the hidden state sequence. Here, the quintuple is:
U hide individual name tags defined by the state set com.hankcs.hanlp.corpus.tag.NR.java
U observe the elements in each tagList in which the state set has been partitioned (equivalent to the result of word segmentation)
U transition probability matrix is generated from nr.tr.txt file. For details, please refer to:
U Emission probability the number of times a person's name tag (hidden state) appears divided by the total number of times all tags appear
Math.log ((item.getFrequency (cur) + 1e-8) / transformMatrixDictionary.getTotalFrequency (cur)
U initial state (start #) and end state (end # end)
The core code of dynamic programming for Viterbi decoding hidden state is as follows:
For (E cur: item.labelMap.keySet ())
{
Double now = transformMatrixDictionary.transititon_probability [pre.ordinal ()] [cur.ordinal ()]-Math.log ((item.getFrequency (cur) + 1e-8) / transformMatrixDictionary.getTotalFrequency (cur))
If (perfect_cost > now)
{
Perfect_cost = now
Perfect_tag = cur
}
}
TransformMatrixDictionary.transititon_probability [pre.ordinal ()] [cur.ordinal ()] is the transition probability from the previous hidden state pre.ordinal () to the current hidden state cur.ordinal (). Math.log ((item.getFrequency (cur) + 1e-8) / transformMatrixDictionary.getTotalFrequency (cur) is the emission probability of the current hidden state. The two "subtract" to get a probability saved in the double now variable, and then through the for loop to find out the most possible (perfect_cost minimum) hidden state perfect_tag corresponding to the current observed state.
As for why the above formula is used to calculate the transfer probability and launch probability, please refer to the paper: "Research on automatic recognition of Chinese names based on role tagging"
In the above example, the optimal hidden state sequence (optimal path) K-> A-> K-> Z-> L-> E-> A-> An is as follows:
NrList = {LinkedList@1065} size = 8
"K" beginning #
Sing the song "A"
"K" Zhang
"Z" student friends
"L"
"E" song
The "A" love is gone.
"A" end # end
For example:
Hidden state-observe state
"K"-start #
Maximum matching
With the optimal hidden sequence: KAKZLEAA, the next step is the following "maximum matching processing".
PersonDictionary.parsePattern (nrList, pWordSegResult, wordNetOptimum, wordNetAll)
A pattern split occurs before the maximum match. The specific meaning of hidden state is defined in com.hankcs.hanlp.corpus.tag.NR.java. For example, if there is a'U'or'V'in the optimal hiding sequence
The words of U Ppf's first name and surname here [relating to] the heroism of Tianpei
The last word of V Pnw and the following words Gong Xueping and other leaders, before Deng Ying [Chaosheng]
Will do "split processing"
Switch (nr)
{
Case U:
/ / split into K B
Case V:
/ / split depending on the situation
}
After the split is complete, a new hidden sequence (pattern) is obtained.
String pattern = sbPattern.toString ()
Next, the maximum pattern matching is performed with AC automata, and the matching results are stored in the "optimal word network". Of course, here you can customize some recognition processing rules for specific applications.
Trie.parseText (pattern, new AhoCorasickDoubleArrayTrie.IHit () {
/ /.
WordNetOptimum.insert (offset, new Vertex (Predefine.TAG_PEOPLE, name, ATTRIBUTE, WORD_ID), wordNetAll)
}
Save the identified names to the optimal word network, and then use the Viterbi word segmentation algorithm based on the optimal word network to get the final word segmentation result-subdivision result.
If (wordNetOptimum.size ()! = preSize)
{
VertexList = viterbi (wordNetOptimum)
If (HanLP.Config.DEBUG)
{
System.out.printf ("Segmentation:\ n% s\ n", wordNetOptimum)
}
}
Summary
The name recognition on the source code is basically realized according to the content of the paper. For a given sentence, take the following three steps:
Role observation
Viterbi algorithm decodes and solves the hidden state (solving the role marks of each word segmentation)
Maximum matching of role tags (some post-processing operations can be done)
Finally, the Viterbi algorithm is used for word segmentation, and the subdivision result is obtained, which is the final recognition result.
This article does not write about the detailed process of the Viterbi word segmentation algorithm and the generation process of the transfer matrix, which will be made up later. Looking at the source code, I have a deeper understanding of the hidden horse model and feel how to realize the theory step by step with the code. As I am also a beginner, the understanding of the source code is not deep enough or there are some deviations, welcome criticism and correction.
For a simple example of dynamic programming, please refer to the application of Fib sequence problem of dynamic programming.
The article comes from hapjin's blog.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.