Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use the IK-Analyze Chinese word Segmentation plug-in in elasticsearch

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

elasticsearch 中如何使用IK-Analyze中文分词插件,很多新手对此不是很清楚,为了帮助大家解决这个难题,下面小编将为大家详细讲解,有这方面需求的人可以来学习下,希望你能有所收获。

安装中文分词插件

插件对应的版本需要和elasticsearch的版本一致

插件各个版本下载地址

https://github.com/medcl/elasticsearch-analysis-ik/releases

使用elasticsearch自带脚本进行安装

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.3.0/elasticsearch-analysis-ik-7.3.0.zip

插件jar包安装在elasticsearch-7.3.0/plugins/analysis-ik下

插件的配置文件存放在elasticsearch-7.3.0/config/analysis-ik下,在此目录中存放了许多词库,如果我们想根据自己业务去扩展一些自定义词库的话,可以修改此目录中的 IKAnalyzer.cfg.xml 文件

例如:

IK Analyzer 扩展配置 custom/mydict.dic; custom/ext_stopword.dic http://10.0.11.1:10002/elasticsearch/myDict http://10.0.11.1:10002/elasticsearch/stopWordDict

扩展词库可以配置在本地或存放在远程服务器上

custorm存放在IKAnalyzer.cfg.xml 文件所在目录中,需要注意的是扩展词典的文本格式为 UTF8 编码

配置在远程词库中更新词库后不需要重启,需要在http请求头中做些设置

该 http 请求需要返回两个头部(header),一个是 Last-Modified,一个是 ETag,这两者都是字符串类型,只要有一个发生变化,该插件就会去抓取新的分词进而更新词库。

该 http 请求返回的内容格式是一行一个分词,换行符用 \n 即可。

修改完IKAnalyzer.cfg.xml需要重启服务

// 创建索引PUT /full_text_test// 添加mappingPOST /full_text_test/_mapping{ "properties":{ "content":{ "type":"text", "analyzer":"ik_max_word", "search_analyzer":"ik_smart" } }}// 添加一条数据POST /full_text_test/_doc/1{ "content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}

测试分词效果

ik_max_word: 会将文本做最细粒度的拆分

ik_smart: 会做最粗粒度的拆分

POST /full_text_test/_analyze{ "text": ["中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"], "tokenizer": "ik_max_word"}结果{ "tokens" : [ { "token" : "中国", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "驻", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 1 }, { "token" : "洛杉矶", "start_offset" : 3, "end_offset" : 6, "type" : "CN_WORD", "position" : 2 }, { "token" : "领事馆", "start_offset" : 6, "end_offset" : 9, "type" : "CN_WORD", "position" : 3 }, { "token" : "领事", "start_offset" : 6, "end_offset" : 8, "type" : "CN_WORD", "position" : 4 }, { "token" : "馆", "start_offset" : 8, "end_offset" : 9, "type" : "CN_CHAR", "position" : 5 }, { "token" : "遭", "start_offset" : 9, "end_offset" : 10, "type" : "CN_CHAR", "position" : 6 }, { "token" : "亚裔", "start_offset" : 10, "end_offset" : 12, "type" : "CN_WORD", "position" : 7 }, { "token" : "男子", "start_offset" : 12, "end_offset" : 14, "type" : "CN_WORD", "position" : 8 }, { "token" : "子枪", "start_offset" : 13, "end_offset" : 15, "type" : "CN_WORD", "position" : 9 }, { "token" : "枪击", "start_offset" : 14, "end_offset" : 16, "type" : "CN_WORD", "position" : 10 }, { "token" : "嫌犯", "start_offset" : 17, "end_offset" : 19, "type" : "CN_WORD", "position" : 11 }, { "token" : "已", "start_offset" : 19, "end_offset" : 20, "type" : "CN_CHAR", "position" : 12 }, { "token" : "自首", "start_offset" : 20, "end_offset" : 22, "type" : "CN_WORD", "position" : 13 } ]}POST /full_text_test/_analyze{ "text": ["中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"], "tokenizer": "ik_smart"}结果{ "tokens" : [ { "token" : "中国", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "驻", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 1 }, { "token" : "洛杉矶", "start_offset" : 3, "end_offset" : 6, "type" : "CN_WORD", "position" : 2 }, { "token" : "领事馆", "start_offset" : 6, "end_offset" : 9, "type" : "CN_WORD", "position" : 3 }, { "token" : "遭", "start_offset" : 9, "end_offset" : 10, "type" : "CN_CHAR", "position" : 4 }, { "token" : "亚裔", "start_offset" : 10, "end_offset" : 12, "type" : "CN_WORD", "position" : 5 }, { "token" : "男子", "start_offset" : 12, "end_offset" : 14, "type" : "CN_WORD", "position" : 6 }, { "token" : "枪击", "start_offset" : 14, "end_offset" : 16, "type" : "CN_WORD", "position" : 7 }, { "token" : "嫌犯", "start_offset" : 17, "end_offset" : 19, "type" : "CN_WORD", "position" : 8 }, { "token" : "已", "start_offset" : 19, "end_offset" : 20, "type" : "CN_CHAR", "position" : 9 }, { "token" : "自首", "start_offset" : 20, "end_offset" : 22, "type" : "CN_WORD", "position" : 10 } ]}

实现一个可以从数据库管理的词库表,方便随时扩展词库

/** * elasticsearch ik-analysis 远程词库 * 1、该 http 请求需要返回两个头部(header),一个是 Last-Modified,一个是 ETag, * 这两者都是字符串类型,只要有一个发生变化,该插件就会去抓取新的分词进而更新词库。 * 2、该 http 请求返回的内容格式是一行一个分词,换行符用 \n 即可。 */@RequestMapping("myDict")public String myDict(HttpServletResponse response) { // 从数据库中查询当前version String version = esDictVersionMapper.selectById(1).getVersion(); // 设置请求头中的词库版本号 response.setHeader("Last-Modified", version); StringBuilder sb = new StringBuilder(); // 查出mysql中扩展词库表中所有数据,并以\n分隔 esDictMapper.selectList(null).forEach(item -> sb.append(item.getWord()).append("\n")); return sb.toString();}

common problems

Question 1:"analyzer [ik_max_word] not found for field [content]"

Workaround: After installing IK on all es nodes, the problem is resolved.

Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report