Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand Elasticsearch inverted Index and word Segmentation

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to understand Elasticsearch inverted index and participle". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "how to understand Elasticsearch inverted index and participle"!

1 inverted index

1.1 catalogue and index of books

The index is the table of contents page, and find the content according to the page number.

The inverted index is the index page, and find the corresponding page number according to the keyword.

1.2 search engine

Perpendicular index

Document Id = "the relationship between document content and words

Inverted index

Word = "relationship of document Id

Left: forward index = "right: inverted index

Inverted index query process

Query documents containing "search engine"

Through the inverted index, the corresponding documents of "search engine" Id have 1 and 3.

Query the complete contents of 1 and 3 through a positive index

Return the final result

1.3 composition of inverted indexes

1.3.1 word Dictionary (Term Dictionary)

Important components of inverted indexes

The words that record all documents are generally large.

Record the associated information from the word to the inverted list

The word dictionary is generally implemented with B + Tree, as shown in the following example

1.3.2 inverted list (Posting List)

A collection of documents corresponding to words is recorded, which is composed of inverted index items (Posting).

The inverted index entry (Posting) mainly contains the following information:

Document Id, used to get the original information

Word frequency (TF, Term Frequency), which records the number of times the word appears in the document for subsequent correlation score

Location (Position)

Record the word segmentation position in the document (multiple), used for word search (Phrase Query)

Offset (Offset)

Record words at the beginning and end of the document for highlighting

Case

Take search engine as an example

The structure of a word dictionary integrated with an inverted list

ES stores documents in JSON format that contain multiple fields, each with its own inverted index.

2 participle

The process of converting text into a series of words, also known as text analysis, is called Analysis in ES.

2.1 Analyzer- word splitter

The word separator is a component of ES that specializes in dealing with word segmentation, which is composed as follows:

2.1.1 Character Filters

Process the original text before Tokenizer, such as adding, deleting, or replacing characters.

Deal with the original text, such as removing the html special tag, as follows:

HTML Strip removes html tags and converts html entities

Mapping performs character replacement operation

Pattern Replace performs regular matching replacement

It will affect the postion and offset information of subsequent tokenizer parsing

2.1.2 Tokenizer

Cut the original text into words according to certain rules, built-in:

Standard is segmented by word

Letter is segmented by non-character class

Whitespace splits by SPAC

UAX URL Email is split by standard, but does not split mailbox and url

Conjunctive segmentation of NGram and Edge NGram

Path Hierachy is split by file path

Example:

POST _ analyze {"tokenizer": "path_hierarchy", "text": "/ one/two/three"}

2.1.3 Token Filters

Reprocess the words processed by tokenizer, such as turning to lowercase, deletion or addition, built-in:

Lowercase converts all term to lowercase

Stop Delete stop words

Conjunctive segmentation of NGram and Edge NGram

Term with synonyms added to Synonym

Example

/ / filter can have multiple POST _ analyze {"text": "a Hello world!", "tokenizer": "standard", "filter": ["stop", / / remove "lowercase" from a, / / lowercase {"type": "ngram", "min_gram": "4" "max_gram": "4"}]} / / get hell, ello, worl, orld

The calling order of the word splitter

3 Analyze API

ES provides an API interface for testing word segmentation to verify the effect of word segmentation. Endpoint is _ analyze:

3.1 specify analyzer

Request

POST _ analyze {"analyzer": "standard", # word splitter "text": "JavaEdge official account" # test text}

Response

{"tokens": [{"token": "java", # participle result "start_offset": 1, # start offset "end_offset": 5, # end offset "type": "," position ": 0 # participle position}, {" token ":" edge " "start_offset": 6, "end_offset": 10, "type": ", position": 1}]}

3.2 specify fields in the index

Index of POST test / _ analyze {"field": "username", # test field "text": "hello world" # test text}

3.3 Custom word Separator

POST _ analyze {"tokenizer": "standard", "filter": ["lowercase"], # Custom "text": "hello world"}

Previous default word splitter capitalization

Custom lowercase splitter

4 built-in word splitter

Standard Analyzer

Default word splitter, word segmentation, support for multilingual, lowercase processing

Simple Analyzer

Non-alphabetic segmentation, lowercase processing

Whitespace Analyzer

Segmentation by space

Stop Analyzer

Stop Word refers to modifier words such as mood auxiliaries, such as the, an, this, and so on. The feature is that it has more Stop Word processing than Simple Analyzer.

Keyword Analyzer

No word segmentation, directly output the input as a word

Pattern Analyzer

The delimiter is customized through regular expressions. The default is\ words, that is, symbols that are not words are delimiters.

Language Analyzer

Provides a word splitter for 30 + common languages

5 Chinese word segmentation

A sequence of Chinese characters is divided into individual words. In English, spaces are used as natural delimiters between words, while words in Chinese do not have a formal delimiter. And the Chinese is broad and profound, the context is different, the result of word segmentation is also very different.

For example:

Table tennis rackets / sold / sold out

Table tennis / auction / finished

The following are common word segmentation systems in ES:

IK

Realize the segmentation of Chinese and English words, customize the thesaurus and support hot updating of word segmentation dictionaries

Jieba

The most popular hungry word segmentation system in python, which supports word segmentation and part of speech tagging, traditional word segmentation, custom dictionary, parallel word segmentation.

The following is a word segmentation system based on natural language processing:

Hanlp

The java toolkit, which is composed of a series of models and algorithms, supports index segmentation, traditional word segmentation, simple matching word segmentation (extreme speed mode), word segmentation based on CRF model, N-shortest path word segmentation, etc., and implements many classical word segmentation methods. The goal is to popularize the application of natural language processing in the production environment.

Https://github.com/hankcs/HanLP

THULAC

THU Lexical Analyzer for Chinese, Natural language processing and Social Humanities Computing of Tsinghua University

A set of Chinese lexical analysis toolkit developed by the laboratory, which has the functions of Chinese word segmentation and part of speech tagging.

Https://github.com/microbun/elasticsearch-thulac-plugin

6 custom word splitter

When the built-in word segmentation can not meet the needs, you can customize the word separator, through the definition of Character Filters, Tokenizer, Token Filter. Custom participle needs to be set in the configuration of the index, as shown in the following example:

Customize the following word splitter

/ Custom Separator PUT test_index_name {"settings": {"analysis": {"analyzer": {"my_customer_analyzer": {"type": "custome", "tokenizer": "standard", "char_filter": ["html_strip"], "filter": ["lowercase" "asciifolding"]}} / / Test the effect of custom word splitter: POST test_index/_analyze {"tokenizer": "keyword", "char_filter": ["html_strip"], "text": "Is this a box?"} / / get is, this, a, box

Instructions for the use of 7 participle

Participles are used in the following two times:

When creating or updating a document (Index Time)

The corresponding document will be segmented into words.

Word segmentation during indexing is achieved by configuring the analyzer property of each field in the Index Mapping. When no participle is specified, the default standard is used.

When querying (Search Time)

The query statement is segmented. How to specify the participle when querying:

Specify a word splitter through analyzer when querying

Set search_analyzer through index mapping

Best practices for word segmentation

It is clear whether a field needs participle, and setting type to keyword for fields that do not require participle can save space and improve writing performance.

Make good use of _ analyze API to view the result of specific word segmentation in the document

Multi-hands test

At this point, I believe you have a deeper understanding of "how to understand Elasticsearch inverted indexes and participles". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report