In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces how to use HanLP to enhance the Elasticsearch word segmentation function, the article introduces in great detail, has a certain reference value, interested friends must read it!
Elasticsearch defaults to Chinese word segmentation by "character", which certainly can not meet the requirements of our word segmentation search. There is officially a SmartCN Chinese word segmentation plug-in, and there is also an IK word segmentation plug-in that is also widely used. But here, we use HanLP, a natural language processing tool, for Chinese word segmentation.
Elasticsearch
The default participle effect of Elasticsearch is terrible.
GET / _ analyze?pretty {"text": [Guangzhou Computing Technology Co., Ltd.]}
Output:
{"tokens": [{"token": "Guang", "start_offset": 0, "end_offset": 1, "type": "," position ": 0}, {" token ":" State "," start_offset ": 1," end_offset ": 2," type ":", "position": 1} {"token": "100 million", "start_offset": 2, "end_offset": 3, "type": "", "position": 2}, {"token": "Speed", "start_offset": 3, "end_offset": 4, "type": "," position ": 3} {"token": "Cloud", "start_offset": 4, "end_offset": 5, "type": "," position ": 4}, {" token ":" Plan "," start_offset ": 5," end_offset ": 6," type ":", "position": 5} {"token": "calculate", "start_offset": 6, "end_offset": 7, "type": "," position ": 6}, {" token ":" Section "," start_offset ": 7," end_offset ": 8," type ":", "position": 7} {"token": "Technology", "start_offset": 8, "end_offset": 9, "type": "," position ": 8}, {" token ":" Yes "," start_offset ": 9," end_offset ": 10," type ":", "position": 9} {"token": "limit", "start_offset": 10, "end_offset": 11, "type": "," position ": 10}, {" token ":" Public "," start_offset ": 11," end_offset ": 12," type ":", "position": 11} {"token": "Division", "start_offset": 12, "end_offset": 13, "type": "," position ": 12}]}
As you can see, the default is word segmentation.
Elasticsearch-hanlp
HanLP
HanLP is an excellent implementation using Java, with the following features:
Chinese word segmentation
Part of speech tagging
Named entity recognition
Keyword extraction
Automatic summary
Phrase extraction
Pinyin conversion
Simple-to-multiplication transformation
Text recommendation
Dependency parsing
Corpus tool
After installing the elasticsearch-hanlp (see: https://github.com/hualongdata/hanlp-ext/tree/master/es-plugin) plug-in, let's take a look at the effect of word segmentation.
GET / _ analyze?pretty {"analyzer": "hanlp", "text": [Guangzhou Computing Technology Co., Ltd.]}
Output:
{"tokens": [{"token": "Guangzhou", "start_offset": 0, "end_offset": 2, "type": "ns", "position": 0}, {"token": "", "start_offset": 2, "end_offset": 5, "type": "nr", "position": 1} {"token": "Computing", "start_offset": 5, "end_offset": 7, "type": "nr", "position": 2}, {"token": "start_offset": 7, "end_offset": 9, "type": "n", "position": 3} {"token": "Limited", "start_offset": 9, "end_offset": 13, "type": "nis", "position": 4}]} these are all the contents of the article "how to use HanLP to enhance the function of Elasticsearch word Segmentation" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.