In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "how to understand Elasticsearch inverted index and participle". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "how to understand Elasticsearch inverted index and participle"!
1 inverted index
1.1 catalogue and index of books
The index is the table of contents page, and find the content according to the page number.
The inverted index is the index page, and find the corresponding page number according to the keyword.
1.2 search engine
Perpendicular index
Document Id = "the relationship between document content and words
Inverted index
Word = "relationship of document Id
Left: forward index = "right: inverted index
Inverted index query process
Query documents containing "search engine"
Through the inverted index, the corresponding documents of "search engine" Id have 1 and 3.
Query the complete contents of 1 and 3 through a positive index
Return the final result
1.3 composition of inverted indexes
1.3.1 word Dictionary (Term Dictionary)
Important components of inverted indexes
The words that record all documents are generally large.
Record the associated information from the word to the inverted list
The word dictionary is generally implemented with B + Tree, as shown in the following example
1.3.2 inverted list (Posting List)
A collection of documents corresponding to words is recorded, which is composed of inverted index items (Posting).
The inverted index entry (Posting) mainly contains the following information:
Document Id, used to get the original information
Word frequency (TF, Term Frequency), which records the number of times the word appears in the document for subsequent correlation score
Location (Position)
Record the word segmentation position in the document (multiple), used for word search (Phrase Query)
Offset (Offset)
Record words at the beginning and end of the document for highlighting
Case
Take search engine as an example
The structure of a word dictionary integrated with an inverted list
ES stores documents in JSON format that contain multiple fields, each with its own inverted index.
2 participle
The process of converting text into a series of words, also known as text analysis, is called Analysis in ES.
2.1 Analyzer- word splitter
The word separator is a component of ES that specializes in dealing with word segmentation, which is composed as follows:
2.1.1 Character Filters
Process the original text before Tokenizer, such as adding, deleting, or replacing characters.
Deal with the original text, such as removing the html special tag, as follows:
HTML Strip removes html tags and converts html entities
Mapping performs character replacement operation
Pattern Replace performs regular matching replacement
It will affect the postion and offset information of subsequent tokenizer parsing
2.1.2 Tokenizer
Cut the original text into words according to certain rules, built-in:
Standard is segmented by word
Letter is segmented by non-character class
Whitespace splits by SPAC
UAX URL Email is split by standard, but does not split mailbox and url
Conjunctive segmentation of NGram and Edge NGram
Path Hierachy is split by file path
Example:
POST _ analyze {"tokenizer": "path_hierarchy", "text": "/ one/two/three"}
2.1.3 Token Filters
Reprocess the words processed by tokenizer, such as turning to lowercase, deletion or addition, built-in:
Lowercase converts all term to lowercase
Stop Delete stop words
Conjunctive segmentation of NGram and Edge NGram
Term with synonyms added to Synonym
Example
/ / filter can have multiple POST _ analyze {"text": "a Hello world!", "tokenizer": "standard", "filter": ["stop", / / remove "lowercase" from a, / / lowercase {"type": "ngram", "min_gram": "4" "max_gram": "4"}]} / / get hell, ello, worl, orld
The calling order of the word splitter
3 Analyze API
ES provides an API interface for testing word segmentation to verify the effect of word segmentation. Endpoint is _ analyze:
3.1 specify analyzer
Request
POST _ analyze {"analyzer": "standard", # word splitter "text": "JavaEdge official account" # test text}
Response
{"tokens": [{"token": "java", # participle result "start_offset": 1, # start offset "end_offset": 5, # end offset "type": "," position ": 0 # participle position}, {" token ":" edge " "start_offset": 6, "end_offset": 10, "type": ", position": 1}]}
3.2 specify fields in the index
Index of POST test / _ analyze {"field": "username", # test field "text": "hello world" # test text}
3.3 Custom word Separator
POST _ analyze {"tokenizer": "standard", "filter": ["lowercase"], # Custom "text": "hello world"}
Previous default word splitter capitalization
Custom lowercase splitter
4 built-in word splitter
Standard Analyzer
Default word splitter, word segmentation, support for multilingual, lowercase processing
Simple Analyzer
Non-alphabetic segmentation, lowercase processing
Whitespace Analyzer
Segmentation by space
Stop Analyzer
Stop Word refers to modifier words such as mood auxiliaries, such as the, an, this, and so on. The feature is that it has more Stop Word processing than Simple Analyzer.
Keyword Analyzer
No word segmentation, directly output the input as a word
Pattern Analyzer
The delimiter is customized through regular expressions. The default is\ words, that is, symbols that are not words are delimiters.
Language Analyzer
Provides a word splitter for 30 + common languages
5 Chinese word segmentation
A sequence of Chinese characters is divided into individual words. In English, spaces are used as natural delimiters between words, while words in Chinese do not have a formal delimiter. And the Chinese is broad and profound, the context is different, the result of word segmentation is also very different.
For example:
Table tennis rackets / sold / sold out
Table tennis / auction / finished
The following are common word segmentation systems in ES:
IK
Realize the segmentation of Chinese and English words, customize the thesaurus and support hot updating of word segmentation dictionaries
Jieba
The most popular hungry word segmentation system in python, which supports word segmentation and part of speech tagging, traditional word segmentation, custom dictionary, parallel word segmentation.
The following is a word segmentation system based on natural language processing:
Hanlp
The java toolkit, which is composed of a series of models and algorithms, supports index segmentation, traditional word segmentation, simple matching word segmentation (extreme speed mode), word segmentation based on CRF model, N-shortest path word segmentation, etc., and implements many classical word segmentation methods. The goal is to popularize the application of natural language processing in the production environment.
Https://github.com/hankcs/HanLP
THULAC
THU Lexical Analyzer for Chinese, Natural language processing and Social Humanities Computing of Tsinghua University
A set of Chinese lexical analysis toolkit developed by the laboratory, which has the functions of Chinese word segmentation and part of speech tagging.
Https://github.com/microbun/elasticsearch-thulac-plugin
6 custom word splitter
When the built-in word segmentation can not meet the needs, you can customize the word separator, through the definition of Character Filters, Tokenizer, Token Filter. Custom participle needs to be set in the configuration of the index, as shown in the following example:
Customize the following word splitter
/ Custom Separator PUT test_index_name {"settings": {"analysis": {"analyzer": {"my_customer_analyzer": {"type": "custome", "tokenizer": "standard", "char_filter": ["html_strip"], "filter": ["lowercase" "asciifolding"]}} / / Test the effect of custom word splitter: POST test_index/_analyze {"tokenizer": "keyword", "char_filter": ["html_strip"], "text": "Is this a box?"} / / get is, this, a, box
Instructions for the use of 7 participle
Participles are used in the following two times:
When creating or updating a document (Index Time)
The corresponding document will be segmented into words.
Word segmentation during indexing is achieved by configuring the analyzer property of each field in the Index Mapping. When no participle is specified, the default standard is used.
When querying (Search Time)
The query statement is segmented. How to specify the participle when querying:
Specify a word splitter through analyzer when querying
Set search_analyzer through index mapping
Best practices for word segmentation
It is clear whether a field needs participle, and setting type to keyword for fields that do not require participle can save space and improve writing performance.
Make good use of _ analyze API to view the result of specific word segmentation in the document
Multi-hands test
At this point, I believe you have a deeper understanding of "how to understand Elasticsearch inverted indexes and participles". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.