Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the inverted index structure in Elasticsearch

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is the inverted index structure in Elasticsearch". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the inverted index structure in Elasticsearch"?

Inverted index (Inverted Index) is also called reverse index, and if there is a reverse index, there must be a forward index. Generally speaking, forward indexes find value through key, and reverse indexes find key through value.

Let's first recall how we inserted an index record:

Curl-X PUT "localhost:9200/user/_doc/1"-H 'Content-Type: application/json'-d' {"name": "Jack", "gender": 1, "age": 20}'

In fact, it is to directly PUT a JSON object, which has multiple fields, and while inserting these data into the index, Elasticsearch also indexes these fields-inverted index, because the core function of Elasticsearch is search.

So, what does an inverted index look like?

The inverted index consists of two parts:

Word dictionary

Inverted list

Its structure is as follows:

Word dictionaries have two data structures: B + tree and Hash table.

The B+ tree is the same as the primary key index data structure in the Mysql index structure, so we will not introduce it here.

The key of the hash table is the hash value of the word, and the value is the linked list of words, because there will be conflicts in the hash algorithm, so the value here is a collection in which the pointers to the inverted list of the same hash worthy words and changed words are stored.

Inverted list

Inverted list properties:

Record a list of documents that have appeared a word

At the same time, the occurrence times and offset positions of words in all documents are recorded.

Invert the data structure of table elements:

(DocID;TF;)

Where:

DocID: the document ID in which a word appears

TF (Term Frequency): the number of times a word appears in this document

POS: the position of the word in the document

Give an example

There is a single document below

-content document 1 Baidu's annual goal document 2AI Technology and Ecology Department's annual goal document 3AI market's annual goal

The inverted index they generated

Word ID word inverted document Frequency list (DocID;TF;) 1 goal 3 (1x;), (2x 1;), (3x 1;) 2 year 3 (1x 1;), (2x 1;), (3x 1;) 3AI2 (2x 1;), (3x x 1;) 4 Technology 1 (2x 1;) 5 Ecology 1 (2x 1;) 6 Market 1 (3x 1;)

For example, the word "year", the word ID is 2, has appeared in three documents, so the inverted document frequency is 3, and there are also three elements in the inverted index: (1), (2), (3), (3). Explain with the first element (1 / 10 / 1;), which means that "year" appears once in a document with an ID of 1, in the second word.

First of all, to clarify a few concepts, to this end, to give an example:

Suppose you have an user index that has four fields: name,gender,age,address. If you draw it, it will look like this, just like a relational database.

Term (word): after a piece of text is analyzed by the analyzer, it will output a list of words, one by one is called Term (literally translated as words)

Term Dictionary (word dictionary): as the name implies, it maintains a Term, which can be understood as a collection of Term

Term Index (word index): in order to find a word faster, we index the word

Posting List (inverted list): an inverted list records a list of all documents in which a word has appeared and the location information of the word in that document. Each record is called an inverted Posting. According to the inverted list, you can know which documents contain a word. (PS: the actual inverted list does not only store the document ID, but also some other information, such as word frequency (number of Term occurrences), offset (offset), etc., which can be thought of as objects in Java)

(PS: if you compare a modern Chinese dictionary, Term is equivalent to words, Term Dictionary is equivalent to the Chinese dictionary itself, and Term Index is equivalent to the directory index of the dictionary.)

We know that every document has an ID, and if the insert time is not specified, the Elasticsearch will automatically generate one, so there is no need to say much about the ID field.

In the above example, the index built by Elasticsearch is roughly as follows:

Name field:

Gender field:

Elasticsearch sets up an inverted index for each field. For example, "Zhang San", "Beijing" and 22 above are all Term, and [1pr 3] is Posting List. A Posting list is an array that stores all the document ID that conforms to a certain Term.

As long as you know the document ID, you can quickly find the document. However, how to quickly find this Term through our given keywords?

Indexing, of course, for Terms, the best is the B-Tree index (PS:MySQL is the best example of a B-tree index).

First, let's recall what the index in the MyISAM storage engine looks like:

The process of finding Term is roughly the same as recording ID in MyISAM.

In MyISAM, the index is separated from the data. The address of the record can be found through the index, and then the record can be found.

In the inverted index, the location of Term in Term Dictionary can be found through Term index, and then Posting List can be found. With the inverted list, the document can be found according to ID.

(PS: to put it this way, if you compare MyISAM, Term Index is equivalent to index file and Term Dictionary is equivalent to data file.)

PS: actually, there are three steps in front of us. We can look at Term Index and Term Dictionary as one step, that is, to find Term. Therefore, the inverted index can be understood in this way: the corresponding inverted list is found by the words, and the document record can be found according to the inverted items in the inverted list.

The actual term index is a trie tree:

The example is a trie tree containing "A", "to", "tea", "ted", "ten", "I", "in", and "inn". This tree will not contain all the term, it will contain some prefixes of term. With term index, you can quickly locate an offset in term dictionary, and then look it up sequentially from that location. Coupled with some compression techniques (search Lucene Finite State Transducers) the size of the term index can be only a tenth of the size of all term, making it possible to cache the entire term index in memory. On the whole, this is the effect.

These are three posting list. We now need to merge them with the relationship of AND to get the intersection of posting list. First select the shortest posting list, and then traverse from small to large. The traversal process can skip some elements, for example, when we traverse to the green 13, we can skip the blue 3, because 3 is smaller than 13.

The whole process is as follows

Next-> 2Advance (2)-> 13Advance (13)-> 13Already on 13Advance (13)-> 13 matchmakers next-> 17Advance (17)-> 22Advance (22)-> 98Advance (98)-> 98Advance (98)-> 98 matchmakers!

The final intersection is [13998], which takes much faster time than traversing three posting list in its entirety. But the premise is that each list needs to indicate the operation Advance and quickly move to the location. What kind of list can do a frog jump forward with Advance like this? Skip list:

Consider the frequent occurrence of term (the value of the so-called low cardinality), such as male or female in gender. If there were 1 million documents, there would be 500000 int values in the posting list of male sex. Compressing with Frame of Reference coding can greatly reduce disk footprint. This optimization is very important for reducing the size of the index. Of course, there is a similar posting list thing in mysql b-tree, which has not been compressed in this way.

Because the coding of this Frame of Reference has a decompression cost. With skip list, in addition to skipping the cost of traversal, you can also skip the process of decompressing these compressed block, thus saving cpu.

Using bitset to merge

Bitset is a very intuitive data structure, corresponding to posting list, such as:

[1,3,4,7,10]

The corresponding bitset is:

[1,0,1,1,0,0,1,0,0,1]

Each document is sorted by document id corresponding to one of the bit. Bitset itself has the feature of compression, which can represent 8 documents with one byte. So only 125000 byte are needed for 1 million documents. But given that there may be billions of documents, keeping bitset in memory is still a luxury. And each filter consumes a bitset, for example, if age=18 is cached, it is a bitset,18.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 246

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report