In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly explains "how to use ik for Chinese word segmentation in ElasticSearch". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how to use ik for Chinese word segmentation in ElasticSearch".
Full-text search and exact matching
ElasticSearch supports full-text search and precise search of text type data, but the corresponding type must be set in advance:
Keyword type, does not do word segmentation processing during storage, and supports accurate query and word segmentation matching query
Text type, which is processed by word segmentation when stored, and also supports precise query and word segmentation matching query.
For example, create an index named article (Index) and configure a mapping (Mapping) for its two fields (Filed), with the article content set to type text and the article title set to type keyword.
When storing, Elasticsearch will segment the content field of the article, obtain and save the tokens after word segmentation, and directly save the original value of the article title without word segmentation.
The right half of the image above shows two different types of stored procedures, keyword and text. The left half shows two query methods corresponding to ElasticSearch:
Term query, that is, accurate query, does not carry out word segmentation, but directly based on the input words.
Match query, that is, word segmentation matching query, first carries on the word segmentation to the input word, then carries on the query to the word element after word segmentation one by one.
For example, if there are two articles, one with the title and content of "programmer" and the other with "Program", the inverted index of both in ElasticSearch is stored as follows (assuming a special word splitter is used).
At this point, use the term and match queries to query the two fields, and you will get the result on the right side of the figure.
Analyzer process
It can be seen that the difference between the types of keyword and text and between term and match lies in whether there is a word segmentation. The process of segmenting this word in ElasticSearch is collectively referred to as Text analysis, which is the process of converting a field from an unstructured string (text) to a structured string (keyword).
Text analysis not only performs word segmentation, but also includes the following process:
Use a character filter (Character filters) to perform some processing on the original text, such as removing white space characters, etc.
Use Tokenizer to segment the original text and get some words (tokens).
Use the morpheme filter (Token filters) to continue processing the lexical elements obtained in the previous step, such as changing morphemes (lowercase), deleting morphemes (deleting quantifiers) or adding morphemes (adding synonyms), merging synonyms, etc.
The component in ElasticSearch that handles Text analysis is called Analyzer. Accordingly, Analyzer also consists of three parts, character filters, tokenizers and token filters.
Elasticsearch has built-in 3 character filters, 10 word splitters and 31 word meta filters. In addition, the corresponding components implemented by third parties can be obtained through the plug-in mechanism. Developers can customize the components of Analyzer according to their own needs.
"analyzer": {"my_analyzer": {"type": "custom", "char_filter": ["html_strip"], "tokenizer": "standard", "filter": ["lowercase",]}}
According to the above configuration, the functions of the my_analyzer parser are roughly as follows:
The character filter is html_strip, which removes the characters associated with the HTML tag
The word splitter is ElasticSearch's default standard word splitter standard.
The lexical filter is a lowercase lowercase processor that lowercase English words.
Generally speaking, the most important thing in Analyzer is the word separator, and the result of word segmentation will directly affect the accuracy and satisfaction of the search. The default word separator of ElasticSearch is not the best choice for dealing with Chinese word segmentation. At present, ik is mainly used for Chinese word segmentation in the industry.
Ik word segmentation principle
Ik is a mainstream ElasticSearch open source Chinese word segmentation component, which has built-in basic Chinese word library and word segmentation algorithm to help developers quickly build Chinese word segmentation and search functions. It also provides functions such as extended thesaurus dictionary and remote dictionary to facilitate developers to expand new words or buzzwords on the Internet.
Ik provides three built-in dictionaries, namely:
Main.dic: the main dictionary, including everyday common words, such as programmers and programming
Quantifier.dic: a dictionary of quantifiers, including daily quantifiers such as rice, hectare and hours
Stopword.dic: stop words, which mainly refer to English stop words, such as a, such, that, etc.
In addition, developers can extend these dictionaries by configuring extended thesaurus dictionaries and remote dictionaries.
When ik starts with ElasticSearch, it reads and loads the default dictionary and extension dictionary into memory and stores them using the dictionary tree tire tree (also known as prefix tree) data structure, which is convenient for subsequent word segmentation.
The typical structure of the dictionary tree is shown above, each node is a word, from the root node to the leaf node, the characters on the path are connected to the corresponding words of that node. So the words in the picture above include: programmer, Cheng Menlixue, weaving, coding and work.
First, load dictionary
During initialization, the Dictionary singleton object of ik will call the corresponding load function to read the dictionary file and construct three dictionary trees composed of DictSegment, namely MainDict, QuantifierDict and StopWords. Let's take a look at the loading and construction process of the main dictionary. The loadMainDict function is relatively simple. It first creates a DictSegment object as the root node of the dictionary tree, then loads the default master dictionary, extends the master dictionary and remote master dictionary to populate the dictionary tree.
Copy the code
During the execution of the loadDictFile function, a line of words is read from the dictionary file and handed over to DictSegment's fillSegment function for processing.
FillSegment is the core function of building a dictionary tree. The specific implementation is as follows. The processing logic consists of the following steps:
1. Get a word in a word according to the index
Second, check whether the word exists in the child nodes of the current node, and if not, add it to the charMap
Call the lookforSegment function to find the node in the dictionary tree that represents the word, and if not, insert a new
Fourth, recursively call the fillSegment function to process the next word.
Ik initialization process is roughly like this, further detailed logic you can directly look at the source code, in the middle are Chinese comments, relatively easy to read.
II. Logic of word segmentation
ElasticSearch-related abstract classes are implemented in ik to provide its own logical implementation of word segmentation:
IKAnalyzer inherits Analyzer, which is used to provide an analyzer for Chinese word segmentation
IKTokenizer inherits Tokenizer, which is used to provide a word splitter for Chinese word segmentation, and its incrementToken is the entry function for ElasticSearch to call ik for word segmentation.
The incrementToken function calls the next method of IKSegmenter to get the result of word segmentation, which is the core method of ik word segmentation.
As shown in the figure above, there are three word splitters in IKSegmenter, which traverses all the words in the word segmentation, and then allows the three word splitters to process the words in order:
LetterSegmenter, the English word separator is relatively simple, which is to segment consecutive English characters.
CN_QuantifierSegmenter, a Chinese classifier, determines whether the current character is a numeral and a quantifier, and divides the connected numerals and quantifiers into one word.
CJKSegmenter, the core word separator, carries on the word segmentation based on the dictionary tree mentioned above.
Let's just talk about the implementation of CJKSegmenter, whose analyze function is roughly divided into two logic:
Query the dictionary tree according to the word, and generate a word element if the word is a word; if it is a word prefix, put it in the temporary hit list
Then, according to the word and the temporary hit list data saved in the previous processing, the query is made in the dictionary tree, and if hit, the word element is generated.
The specific code logic, as shown above. To make it easier for you to understand, for example, the word entered is coding:
First deal with word editing.
Because the current tmpHits is empty, judge the word directly.
Directly take the word to the dictionary tree of the previous diagram (see matchInMainDict function for details), and find that the word can be hit, and the word is not the end of a word, so the editor and its position in the input word will generate a Hit object and store it in tmpHits.
Then process the codeword.
Because tmpHits is not empty, take the corresponding Hit object and codeword to query in the dictionary tree (see the matchWithHit function for details) and find that the word has hit the code, so the word is stored in AnalyzeContext; as one of the output elements, but because the code is already a leaf node and does not have a child node, it indicates that it is not a prefix of other words, so delete the corresponding Hit object.
Then take the word code to the dictionary tree to see if the word is a word or the prefix of a word.
And so on, all the words are processed.
III. Elimination of ambiguity and output of results
Through the above steps, sometimes a large set of participle results will be generated. For example, programmers love programming will be divided into five results: programmer, program, member, love and programming. This is also the output of ik's ik_max_word mode. But there are scenarios where developers want only three word segmentation results: programmer, love and programming, so they need to use ik's ik_smart mode, that is, to disambiguate.
Ik uses IKArbitrator to disambiguate and mainly uses combinatorial traversal for processing. The disjoint sets of participles are extracted from the results of the previous stage of word segmentation. the so-called intersection is whether their positions in the text coincide or not. For example, the three participles of programmer, program and member are intersected, but love and programming do not intersect. Therefore, when dealing with differences, programmers, programs and members will be treated as a set, love as a set, and coding as a set, respectively, and the result set of word segmentation with the highest priority according to the rules will be selected. The specific rules are as follows:
Long length of valid text takes priority
A small number of words gives priority.
Large path span is preferred.
The lower the position takes priority, because according to the statistical conclusion, the probability of reverse segmentation is higher than that of forward segmentation.
The longer the word is, the more average it is.
The weight of morpheme position is important and priority.
According to the above rules, in the first set, programmers are obviously more consistent with the rules than programs and staff, so the result of disambiguating is to output programmers, not programs and members.
Finally, for input words, some positions may not be in the output, so they will be output directly as a single word (see AnalyzeContext's outputToResult function for details). For example, the programmer is a professional, the word will not be participle, but in the final output, it should be output as a word.
Postscript
The combination of ElasticSearch and ik is a mainstream Chinese search technology at present. Understanding the basic flow and principle of its search and word segmentation will help developers to build Chinese search function faster or customize the search word segmentation strategy based on their own needs.
At this point, I believe you have a deeper understanding of "ElasticSearch how to use ik for Chinese word segmentation". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.