Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is named entity recognition in pyhanlp

2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

This article is to share with you about named entity recognition in pyhanlp. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

For word segmentation, named entity recognition is a very important function. Of course, it is also important to find new words (this part of the content is put in the "keyword extraction, phrase extraction and automatic summary, new word recognition" and later cases.

The first is a simple example that shows the effect of named entity recognition. Then comes the official content:

A simple demonstration example

From pyhanlp import *

"

HanLP enables named entity recognition

"

# example of transliteration of names

CRFnewSegment = HanLP.newSegment ("crf")

Term_list = CRFnewSegment.seg ("what Tian Feng of Yizhi Society wants to say is that this is just an example of hanlp named entity recognition")

Print (term_list)

Print ("\ n = named entity open and close comparison test =\ n")

Sentences = [

Keiko Kitagawa starred in Fast and Furious 3 directed by Lin Yibin.

"Lin Chi-ling appeared on the Internet: are you sure it's not a knot?"

"Kiyama Qianguang and Kondo Park drink and enjoy flowers in Guishan Park."

]

# Global settings are made through HanLP, but some word splitters themselves may not support a certain feature

# part of the word splitter itself is effective in recognizing some named entities

HanLP.Config.japaneseNameRecognize = False

ViterbiNewSegment = HanLP.newSegment ("viterbi") .enableJapaneseNameRecognize (True)

CRFnewSegment_new = HanLP.newSegment ("crf") .enableJapaneseNameRecognize (True)

# segSentence

# CRFnewSegment_2.seg2sentence (sentences)

For sentence in sentences:

Print ("crf:", CRFnewSegment.seg (sentence))

Print ("crf_new:", CRFnewSegment_new.seg (sentence))

Print ("viterbi:", viterbiNewSegment.seg (sentence))

[Yizhi Society / n, / u, Tian Feng / nr, want / v, say / v, / u, is / v, this / r, only / d, is / v, a / m, hanlp name / vn, entity / n, identify / v, / u, example / n]

= named entity open and close comparison test =

Crf: [Beichuan / ns, Jingzi / n, starring / v, le / u, Lin Yibin / nr, director / n, / u, "/ w, speed / n, and / c, passion / n, 3Gao," / w]

Crf_new: [Beichuan / ns, Jingzi / n, starring / v, le / u, Lin Yibin / nr, director / n, / u, "/ w, speed / n, and / c, passion / n, 3Gao," / w]

Viterbi: [Kitagawa Kyoko / nrj, starring / v, / ule, Lin Yibin / nr, director / nnt, / ude1, "/ w, speed / n, and / cc, passion / n, 3choc," / w]

Crf: [Lin Zhiling / nr, appearance / v, netizens / n,: / w, confirm / v, not / d, is / v, Potomo / n, knot / n,? / w]

Crf_new: [Lin Zhiling / nr, appearance / v, netizens / n,: / w, confirm / v, not / d, is / v, Potomo / n, knot / n,? / w]

Viterbi: [Lin Zhiling / nr, debut / vi, netizens / n,: / w, confirm / v, not / c, Potomo knot / nrj,? / w]

Crf: [tortoise / v, mountain / n, thousand / m, Guang / Q, and / c, Kondo / a, park / n, in / p, Guishan Park / ns, Li / f, drink / v, wine / n, reward / v, flowers / n]

Crf_new: [tortoise / v, mountain / n, thousand / m, Guang / Q, and / c, Kondo / a, park / n, in / p, Guishan Park / ns, Li / f, drink / v, wine / n, reward / v, flowers / n]

Viterbi: [Kameyama Chihiro / nrj, and / cc, Kondo Park / nrj, in / p, Guishan / nz, Park / n, Li / f, drinking / vi, Flower appreciation / nz]

Formal content

Chinese name recognition

Description

At present, Chinese name recognition is basically enabled by default, such as the word splitter used in the HanLP.segment () interface. Users do not have to turn it on manually; the above code is only for emphasis.

If there is a certain mishit rate, for example, if you miss a critical year, you can exclude the possibility of a critical year as a person's name by adding a critical year A1 to data/dictionary/person/nr.txt, or register the critical year as a new word in a custom dictionary.

If you have solved the problem through the above methods, please submit pull request to me. Dictionaries are also valuable assets.

It is recommended that NLP users use perceptron or CRF lexical analyzer for higher accuracy.

Detailed explanation of algorithm

"actual combat HMM-Viterbi role tagging Chinese name recognition"

# Chinese name recognition

Def demo_chinese_name_recognition (sentences):

Segment = HanLP.newSegment () .enableNameRecognize (True)

For sentence in sentences:

Term_list = segment.seg (sentence)

Print (term_list)

Print ([i.word for i in term_list])

Sentences = [

"before the signing ceremony, Qin Guangrong, Li Jiheng, Qiu he and others met with the entrepreneurs who took part in the signing."

"Wu Dajing set a world record and won the championship, and the Chinese delegation won the first gold medal in Pyeongchang."

"District Mayor Zhuang Mudi's New year's speech"

"Zhu Lilun: both sides of the strait hope to jointly create a win-win situation. Zhu's historical meeting will be held soon."

"the richest man in Shaanxi, Wu Yijian, was taken away and intersected with your wife."

Eight-year-old Catherine Kroll (Feng Fujuan), like many Chinese-American children, began to learn the violin at an early age, VOA reported on April 28. Is her mother a tiger mother? "

Catherine and Lucy (Lu Ruiyuan) are a little different from their brothers. "

"Wang Guoqiang, Gao Feng, Wang Yang, Zhang Chaoyang, Han Han, Xiao Si"

"Zhang Hao and Hu Health have been demobilized and returned home."

"Boss Wang and Xiao Li are married."

"the screenwriter Shao Junlin and Ji Daoqing said"

"there are stories about Guan Tianpei here."

Gong Xueping and other leaders said that Deng Yingchao put an end to over-birth.

Demo_chinese_name_recognition (sentences)

Print ("\ n = Chinese names are on by default =\ n")

Print (CRFnewSegment.seg (sentences [0]))

[sign / vi, ceremony / n, former / f, / w, Qin Guangrong / nr, / w, Li Jiheng / nr, / w, Qiu he / nr, etc. / udeng, together / d, meet / v, / ule, participate / v, sign / vi, / ude1, entrepreneur / nnt. / w]

['signing contract', 'ceremony', 'Qian', 'Qin Guangrong','Li Jiheng', 'Qiu he', 'etc.,' together', 'meet', 'participate', 'sign a contract', 'entrepreneur',']

[Wu Dajing / nr, Chuang / v, World / n, record / n, win the championship / vi, / w, China / ns, delegation / n, Pyeongchang / ns, first / Q, Jin / b]

['Wu Dajing', 'Chuang', 'World', 'record', 'win the championship', 'China', 'delegation', 'Pyeongchang', 'head', 'Jin']

[district Chief / nnt, Zhuang Mudi / nr, New year / t, speech / vi]

['district governor', 'Zhuang Mudi', 'New year', 'speech']

[Zhu Lilun / nr,: / w, both sides / n, both sides / d, hope / v, co-creation / v, win-win / n, / w, Xi / v, Zhu / ag, history / n, meeting / vn, in / vi]

['Zhu Lilun',':', 'cross-strait','du', 'hope','co-creation', 'win-win','Xi', 'Zhu', 'history', 'meeting', 'coming']

[Shaanxi / ns, richest man / n, Wu Yijian / nr, taken away / v, / w, and / cc, Ling Jie / nr, wife / n, have / vyou, intersection / v]

['Shaanxi', 'richest man','Wu Yijian','be taken away', 'with', 'Ling Jie', 'wife', 'you', 'intersection']

[according to / p, VOA / n, Radio / nis, website / n, April / t, 28pm, Japan / b, report / v, / ude1, Catherine / nr, / w, g / Q, Rohr / nr, (/ w, Fengfujuan / nr,) / w, and / cc, many / m, Chinese / n, American / nsf, children / n, same / uyy, / w, young / z, age / n, just / d, start / v, learn / v, violin / n, le / ule,. / w, she / rr, / ude1, mother / n, is / vshi, bit / Q, tiger mother / nz,? / w]

[according to, Voice of America, Radio, website, April, 28th, Day, report, Catherine, Ke, Rohr, (, Fengfujuan,), and, many, Chinese, American, Children, ',' young', 'age', 'just', 'start', 'learn', 'violin','la'. , 'she','Di', 'Mom', 'Yes', 'bit', 'Tiger Mother','Mu','?]

[Catherine / nr, / cc, Lucy / nr, (/ w, Lu Ruiyuan / nr,) / w, / w, with / p, they / rr, / ude1, brothers / n, people / k, there are / vyou, some / m, different / a. / w]

['Catherine', 'and', 'Lucy','(','Lu Ruiyuan',')', 'follow', 'they', 'brothers','we', 'have', 'some', 'different'.]

[Wang Guoqiang / nr, / w, Feng / n, / w, Wang Yang / n, / w, Zhang Chaoyang / nr, Light / n, uzhe, head / n, / w, Han Han / nr, / w, Xiao / a, four / m]

['Wang Guoqiang', 'Feng', 'Wang Yang', 'Zhang Chaoyang', 'Guangguang', 'Zhe', 'Tou',', 'Han Han', 'Xiao', 'four']

[Zhang Hao / nr, and / cc, Hu Health / nr, demobilized / v, going home / vi, / ule]

['Zhang Hao', 'and','Hu Health', 'demobilized','go home',']

[boss Wang / nr, and / cc, Xiao Li / nr, married / vi, / ule]

['Boss Wang', 'and', 'Xiao Li', 'married', 'got married']

[screenwriter / nnt, Shao Junlin / nr, and / cc, Ji Daoqing / nr, say / v]

['screenwriter', 'Shao Junlin', 'and','Ji Daoqing', 'Shuo']

[here / rzs, / vyou, Guan Tianpei / nr, / ude1, related / vn, deeds / n]

['here', 'you', 'Guan Tianpei','of', 'about', 'deeds']

Gong Xueping / nr, et al / udeng, Leader / n, said / v, / w, Deng Yingchao / nr, living / t, stop / v, Chaosheng / vi]

['Gong Xueping','et al', 'Leader', 'say', 'Deng Yingchao', 'alive', 'put an end to', 'Chaosheng']

= Chinese names are basically turned on by default

[signing / vn, ceremony / n, former / f, / w, Qin Guangrong / nr, / w, Li Jiheng / nr, / w, Qiuhe / nr, etc. / u, meeting / v, / u, participating / v, signing / v, / u, entrepreneur / n, etc. / w]

Transliteration of person name recognition

Description

At present, word splitters basically enable transliteration person name recognition by default, and users do not have to turn it on manually; the above code is only for emphasis.

Detailed explanation of algorithm

"transliteration and Japanese name recognition under cascading Hidden Horse Model"

# transliteration of person name recognition

Sentences = [

"A bucket of ice water fell, Microsoft's Bill Gates, Facebook's Zuckerberg and Sandberg, Amazon's Bezos, Apple's Cook are all willing to wet into the camera, these Silicon Valley tech people, moths sacrifice performance, in fact, it's all for charity."

"the longest name in the world is Janson Joey Alexander Biki Karisler Duff Elliot Fox Iverumo Marni Myers Patterson Thompson Wallace Preston."

]

Segment = HanLP.newSegment () .enableTranslatedNameRecognize (True)

For sentence in sentences:

Term_list = segment.seg (sentence)

Print (term_list)

Print ("\ n = transliteration of person's name is enabled by default =\ n")

Print (CRFnewSegment.seg (sentences [0]))

[bucket / nz, ice water / n, Dangtou / vi, fall / v, / w, Microsoft / ntc, / ude1, Bill Gates / nrf, / w, Facebook/nx, / ude1, Zuckerberg / nr, and / p, Sandberg / nrf, / w, Amazon / nrf, / ude1, Bezos / nrf, / w, Apple / nf, / ude1, Cook / nr, all / d, regardless of / v, wet / nz, camera / nz, / w, these / rz, Silicon Valley / ns, / ude1, technology / n, people / n, / w, moths / n, fire / vn, like / vg, ground / ude2, sacrifice / v, performance / vn, / w, actually / d, all / a, for / p, charity / a. / w]

[world / n, upper / f, longest / d, / ude1, name / n, is / vshi, Jensen / nr, / w, Joey / nr, / w, Alexander / nr, / w, Biki / nr, / w, Kallis / nr, le / v, / w Duff Elliott Fox Iverumo Marni Myers Patterson Thompson Wallace Preston / nrf. / w]

= transliteration of person's name is enabled by default

[bucket / m, ice water / n, Dangtou / d, fall / v, / w, Microsoft / a, / u, Bill Gates / n, / w, Facebook/l, / u, Zuckerberg / n, and / p, Sandberg / n, / w, Amazon / nr, / u, Bezos / nr, / w, Apple / n, / u, Cook / nr, all / d, regardless of / v, wet / n, camera / v, / w, these / r, Silicon Valley / n, / u, technology / n, people / n, / w, moths / v, fire like / v, ground / u, sacrifice / v, performance / v, / w, actually / d, all / d, for / p, charity / a. / w]

Japanese name recognition

Description

At present, the standard word splitter turns off Japanese name recognition by default, and users need to turn it on manually; this is because Japanese names appear less frequently, but consume performance.

Detailed explanation of algorithm

"transliteration and Japanese name recognition under cascading Hidden Horse Model"

# Japanese person name recognition

Def demo_japanese_name_recognition (sentences):

Segment = HanLP.newSegment () .enableJapaneseNameRecognize (True)

For sentence in sentences:

Term_list = segment.seg (sentence)

Print (term_list)

Print ([i.word for i in term_list])

Sentences = [

Keiko Kitagawa starred in Fast and Furious 3 directed by Lin Yibin.

"Lin Chi-ling appeared on the Internet: are you sure it's not a knot?"

"Kiyama Qianguang and Kondo Park drink and enjoy flowers in Guishan Park."

]

Demo_japanese_name_recognition (sentences)

Print ("\ n = Japanese name Standard Separator is not turned on by default =\ n")

Print (CRFnewSegment.seg (sentences [0]))

[Kitagawa Jingzi / nrj, starring / v, le / ule, Lin Yibin / nr, director / nnt, / ude1, "/ w, speed / n, and / cc, passion / n, 3choc," / w]

['Kitagawa Kingko', 'take part in acting', 'Lin Yibin', 'director','of', 'speed', 'and', 'passion', '3Qing,'']

[Lin Zhiling / nr, debut / vi, netizens / n,: / w, confirm / v, not / c, Potomo knot / nrj,? / w]

['Lin Zhiling', 'appearance', 'netizens',':', 'sure','no', 'Potomo knot','?]

[Qiguyama / nrj, and / cc, Kondo Park / nrj, in / p, Guishan / nz, Park / n, Li / f, drinking / vi, Flower appreciation / nz]

['Kiyama Qianguang', 'and', 'Kondo Park','in', 'Guishan', 'Park','Li', 'drink', 'enjoy the flowers']

= Japanese name standard word splitter is not turned on by default

[Beichuan / ns, Jingzi / n, starring / v, le / u, Lin Yibin / nr, director / n, / u, "/ w, speed / n, and / c, passion / n, 3choc," / w]

Place name recognition

Description

At present, the standard word splitter turns off place name recognition by default, and users need to turn it on manually; this is due to consumption performance, in fact, most place names are included in the core dictionary and user-defined dictionary.

In the production environment, the problems that can be solved by dictionaries are solved by dictionaries, which is the most efficient and stable method.

It is recommended that users with high requirements for named entity recognition use perceptual machine lexical analyzer.

Detailed explanation of algorithm

"actual combat HMM-Viterbi role tagging place name recognition"

# demonstrating numerals and quantifiers recognition

Sentences = [

"what does the 19 yuan package include?"

"9999 roses"

"Don't even give me a hundred dollars."

"9012345678 ants"

"three hundred grams of milk * 2"

"ChinaJoy" Anti-pornography rules A fine of more than 2cm for chest exposure "

]

StandardTokenizer = JClass ("com.hankcs.hanlp.tokenizer.StandardTokenizer")

StandardTokenizer.SEGMENT.enableNumberQuantifierRecognize (True)

For sentence in sentences:

Print (StandardTokenizer.segment (sentence))

Print ("\ n = demo numerals and quantifiers are not turned on by default =\ n")

CRFnewSegment.enableNumberQuantifierRecognize (True)

Print (CRFnewSegment.seg (sentences [0]))

[19 yuan / mq, package / n, including / v, what / ry]

[9999 / mq, rose / n]

[100 yuan / mq, du / d, no / d, give / p, I / rr]

[9012345678 / mq, ants / n]

[milk / nf, 300g / mq, * / w, 2ga]

[ChinaJoy/nx, "/ w, anti-pornography / vi," / w, rules / n, exposure / v, chest / ng, super / v, 2cm / mq, fine / vi]

= demo numerals and quantifiers are not turned on by default

[19 / m, yuan / Q, package / n, including / v, what / r]

Organization name identification

Description

At present, the separator turns off organization name recognition by default, and users need to turn it on manually. This is due to performance consumption. In fact, common organization names are included in core dictionaries and user-defined dictionaries.

The purpose of HanLP is not to demonstrate dynamic recognition. In a production environment, problems that can be solved by dictionaries are solved by dictionaries, which is the most efficient and stable method.

It is recommended that users with high requirements for named entity recognition use perceptual machine lexical analyzer.

Detailed explanation of algorithm

"Organization name recognition under cascading HMM-Viterbi role tagging Model"

# Organization name identification

Sentences = [

"I work part-time in Shanghai Linyuan Technology Co., Ltd."

"I often eat at the Taichuan Wedding Banquet Restaurant."

"occasionally go to the Kaiyuan Mediterranean Cinema to see a movie."

]

Segment = JClass ("com.hankcs.hanlp.seg.Segment")

Term = JClass ("com.hankcs.hanlp.seg.common.Term")

Segment = HanLP.newSegment () .enableOrganizationRecognize (True)

For sentence in sentences:

Term_list = segment.seg (sentence)

Print (term_list)

Print ("\ n = Organization name Standard Separator has all been turned off =\ n")

Print (CRFnewSegment.seg (sentences [0]))

Segment = HanLP.newSegment ('crf') .enableOrganizationRecognize (True)

[I / rr, in / p, Shanghai / ns, Linyuan Technology Co., Ltd. / nt, part-time / vn, work / vn, / w]

[I / rr, often / d, in / p, Taichuan Wedding Banquet Restaurant / nt, Dinner / vi, / w]

[occasionally / d, go / vf, Kaiyuan Mediterranean Cinema / nt, watch / v, movie / n,. / w]

= the organization name standard word separator has all been turned off

[I / r, in / p, Shanghai Linyuan Technology Co., Ltd. / nt, part-time / vn, work / vn, / w]

Place name recognition

Description

At present, the standard word splitter turns off place name recognition by default, and users need to turn it on manually; this is due to consumption performance, in fact, most place names are included in the core dictionary and user-defined dictionary.

In the production environment, the problems that can be solved by dictionaries are solved by dictionaries, which is the most efficient and stable method.

It is recommended that users with high requirements for named entity recognition use perceptual machine lexical analyzer.

Detailed explanation of algorithm

"actual combat HMM-Viterbi role tagging place name recognition"

# place name recognition

Def demo_place_recognition (sentences):

Segment = HanLP.newSegment () .enablePlaceRecognize (True)

For sentence in sentences:

Term_list = segment.seg (sentence)

Print (term_list)

Print ([i.word for i in term_list])

Sentences = [Lanxiang donated excavators to Heiniugou Village, Honghe Town, Pengyang County, Guyuan City, Ningxia "]

Demo_place_recognition (sentences)

Print ("\ n = place name is turned on by default =\ n")

Print (CRFnewSegment.seg (sentences [0]))

[Lanxiang / nr, to / p, Ningxia / ns, Guyuan City / ns, Pengyang County / ns, Honghe Town / ns, Heiniugou Village / ns, donation / v, / ule, excavator / n]

['Lanxiang','to', 'Ningxia', 'Guyuan City', 'Pengyang County', 'Honghe Town', 'Heiniugou Village', 'donation','le', 'excavator']

= place name is turned on by default

[Lanxiang / v, to / v, Ningxia / ns, Guyuan City / ns, Pengyang County / ns, Honghe Town / ns, Heiniugou Village / ns, donation / v, le / u, excavator / n]

URL recognition

Automatically identify URL, this part is found in demo, but the original author did not mention this in the document, this part can find URL, tests found that other classifiers should not enable this by default, and config does not have the option to enable this function, so this should be an additional class. I suggest that if necessary, you can try to use URLTokenizer to get URL first, and then add it to the user dictionary. Or directly use other tools or custom functions to solve the problem.

# URL recognition

The project address of text = 'HanLP is https://github.com/hankcs/HanLP

The publishing address is https://github.com/hankcs/HanLP/releases

I sometimes post some news on www.hankcs.com.

My Weibo is http://weibo.com/hankcs/ and I push hankcs.com news at the same time.

I heard that. Chinese domain name is open to apply, but I did not apply for hankcs. China, because of poverty.

''

Nature = SafeJClass ("com.hankcs.hanlp.corpus.tag.Nature")

Term = SafeJClass ("com.hankcs.hanlp.seg.common.Term")

URLTokenizer = SafeJClass ("com.hankcs.hanlp.tokenizer.URLTokenizer")

Term_list = URLTokenizer.segment (text)

Print (term_list)

For term in term_list:

If term.nature = = Nature.xu:

Print (term.word)

[HanLP/nx, / ude1, project / n, address / n, is / vshi, https://github.com/hankcs/HanLP/xu, / w

/ w, / w, publish / v, address / n, is / vshi, https://github.com/hankcs/HanLP/releases/xu, / w

/ w, / w, I / rr, sometimes / d, will / v, on / p, www.hankcs.com/xu, above / f, publish / v, some / m, messages / n, / w

/ w, / w, I / rr, / ude1, Weibo / n, is / vshi, http://weibo.com/hankcs//xu, / w, will / v, synchronization / vd, push / nz, hankcs.com/xu, / ude1, news / n. / w

/ w, / w, heard / v,. / w, China / ns, domain name / n, open / v, application / v, / ule, / w, but / c, I / rr, and / cc, no / v, application / v, hankcs. China / xu, / w, because / c, poor / a,. / w

/ w, / w]

Https://github.com/hankcs/HanLP

Https://github.com/hankcs/HanLP/releases

Www.hankcs.com

Http://weibo.com/hankcs/

Hankcs.com

Hankcs. China

This is what named entity recognition is like in pyhanlp. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report