Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use HanLP to enhance Elasticsearch word Segmentation

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how to use HanLP to enhance the Elasticsearch word segmentation function, the article introduces in great detail, has a certain reference value, interested friends must read it!

Elasticsearch defaults to Chinese word segmentation by "character", which certainly can not meet the requirements of our word segmentation search. There is officially a SmartCN Chinese word segmentation plug-in, and there is also an IK word segmentation plug-in that is also widely used. But here, we use HanLP, a natural language processing tool, for Chinese word segmentation.

Elasticsearch

The default participle effect of Elasticsearch is terrible.

GET / _ analyze?pretty {"text": [Guangzhou Computing Technology Co., Ltd.]}

Output:

{"tokens": [{"token": "Guang", "start_offset": 0, "end_offset": 1, "type": "," position ": 0}, {" token ":" State "," start_offset ": 1," end_offset ": 2," type ":", "position": 1} {"token": "100 million", "start_offset": 2, "end_offset": 3, "type": "", "position": 2}, {"token": "Speed", "start_offset": 3, "end_offset": 4, "type": "," position ": 3} {"token": "Cloud", "start_offset": 4, "end_offset": 5, "type": "," position ": 4}, {" token ":" Plan "," start_offset ": 5," end_offset ": 6," type ":", "position": 5} {"token": "calculate", "start_offset": 6, "end_offset": 7, "type": "," position ": 6}, {" token ":" Section "," start_offset ": 7," end_offset ": 8," type ":", "position": 7} {"token": "Technology", "start_offset": 8, "end_offset": 9, "type": "," position ": 8}, {" token ":" Yes "," start_offset ": 9," end_offset ": 10," type ":", "position": 9} {"token": "limit", "start_offset": 10, "end_offset": 11, "type": "," position ": 10}, {" token ":" Public "," start_offset ": 11," end_offset ": 12," type ":", "position": 11} {"token": "Division", "start_offset": 12, "end_offset": 13, "type": "," position ": 12}]}

As you can see, the default is word segmentation.

Elasticsearch-hanlp

HanLP

HanLP is an excellent implementation using Java, with the following features:

Chinese word segmentation

Part of speech tagging

Named entity recognition

Keyword extraction

Automatic summary

Phrase extraction

Pinyin conversion

Simple-to-multiplication transformation

Text recommendation

Dependency parsing

Corpus tool

After installing the elasticsearch-hanlp (see: https://github.com/hualongdata/hanlp-ext/tree/master/es-plugin) plug-in, let's take a look at the effect of word segmentation.

GET / _ analyze?pretty {"analyzer": "hanlp", "text": [Guangzhou Computing Technology Co., Ltd.]}

Output:

{"tokens": [{"token": "Guangzhou", "start_offset": 0, "end_offset": 2, "type": "ns", "position": 0}, {"token": "", "start_offset": 2, "end_offset": 5, "type": "nr", "position": 1} {"token": "Computing", "start_offset": 5, "end_offset": 7, "type": "nr", "position": 2}, {"token": "start_offset": 7, "end_offset": 9, "type": "n", "position": 3} {"token": "Limited", "start_offset": 9, "end_offset": 13, "type": "nis", "position": 4}]} these are all the contents of the article "how to use HanLP to enhance the function of Elasticsearch word Segmentation" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report