In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what is the introduction and principle of Elasticsearch". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "the introduction and principle of Elasticsearch".
Basic introduction of Elasticsearch- and Analysis of Index principle
Recently, I participated in the design of a real-time statistical query based on Elasticsearch as the underlying data framework to provide a large amount of data (hundreds of millions of levels). I spent some time learning the basic theory of Elasticsearch and sorted it out. I hope it will be helpful to the students who are interested in Elasticsearch / want to know. At the same time, I also hope to find that the content is incorrect or in doubt, hope to point out, discuss together, learn, and make progress.
Introduction
Elasticsearch is a distributed and scalable real-time search and analysis engine, a search engine based on full-text search engine Apache Lucene (TM). Of course, Elasticsearch is not just Lucene, it not only includes full-text search function, but also can do the following:
Distributed real-time file storage and each field is indexed so that it can be searched.
Distributed search engine for real-time analysis.
It can scale to hundreds of servers to handle structured or unstructured data at the PB level.
Basic concept
First, let's talk about the file storage of Elasticsearch. Elasticsearch is a document-oriented database. Here, a piece of data is a document. JSON is used as the document serialization format, such as the following user data:
{"name": "John", "sex": "Male", "age": 25, "birthDate": "1990-05-01", "about": "I love to go rock climbing", "interests": ["sports", "music"]}
Using a database like Mysql to store it is easy to think of creating a User table, with balabala fields, etc., in Elasticsearch, this is a document, of course, this document will belong to a User type, a variety of types exist in an index. Here is a simple comparison of Elasticsearch and relational data terms:
Relational database ⇒ database ⇒ table ⇒ row ⇒ column (Columns) Elasticsearch ⇒ index (Index) ⇒ type (type) ⇒ document (Docments) ⇒ field (Fields)
An Elasticsearch cluster can contain multiple indexes (databases), that is, many types (tables). These types contain many documents (rows), and then each document contains many fields (columns). For the interaction of Elasticsearch, you can either use Java API or directly use the Restful API method of HTTP. For example, if we intend to insert a record, we can simply send a request for HTTP:
PUT / megacorp/employee/1 {"name": "John", "sex": "Male", "age": 25, "about": "I love to go rock climbing", "interests": ["sports", "music"]}
Update, query is also similar to this operation, the specific operation manual can be found in the authoritative guide of Elasticsearch
Indexes
The most important thing for Elasticsearch is to provide strong indexing ability. In fact, the secret of this time series database of InfoQ (2)-the index is very well written. I also hope to help you understand this article better by combing this article with my own understanding.
The essence of the Elasticsearch index:
All designs are designed to improve the performance of search.
Another meaning: in order to improve the performance of search, it is inevitable to sacrifice some other aspects, such as insert / update, otherwise other databases do not have to be mixed up. Earlier, we saw that inserting a record into Elasticsearch is actually directly PUT a json object, which has multiple fields, such as name, sex, age, about, interests in the above example. While inserting these data into Elasticsearch, Elasticsearch silently indexes these fields-inverted index, because the core function of Elasticsearch is search.
How does Elasticsearch achieve fast indexing?
InfoQ's article said that Elasticsearch uses inverted indexes faster than B-Tree indexes in relational databases. Why?
What is a B-Tree index?
When we were in college, our teacher taught us that the efficiency of binary tree search is logN, and it is not necessary to move all nodes when inserting new nodes at the same time, so using tree structure to store indexes can take into account the performance of both insertion and query. Therefore, on this basis, combined with the read characteristics of the disk (sequential read / random read), the traditional relational database adopts a data structure such as B-Tree/B+Tree:
In order to improve the efficiency of query and reduce the number of disk seek, multiple values are stored as an array in a continuous interval, multiple data are read at one time, and the height of the tree is also reduced.
What is an inverted index?
Continuing with the above example, suppose there are several pieces of data (for simplicity, remove the field of about and interests):
| | ID | Name | Age | Sex | |-|: | | 1 | Kate | 24 | Female | 2 | John | 24 | Male | 3 | Bill | 29 | Male |
ID is an Elasticsearch-built document id, so the index built by Elasticsearch is as follows:
Name:
| | Term | Posting List | |-- |::-- | | Kate | 1 | | John | 2 | Bill | 3 | |
Age:
| | Term | Posting List | |-- |: -: | | 24 | [1Magin2] | | 29 | 3 | |
Sex:
| | Term | Posting List | |-- |::-| | Female | 1 | | Male | [2mem3] |
Posting List
Elasticsearch sets up an inverted index for each field, Kate, John, 24, Female these are called term, and [1mem2] is Posting List. Posting list is an array of int that stores all the document id that conforms to a certain term.
When you see this, don't think it's over, the wonderful part is just beginning.
It seems that it can be searched quickly through the index method of posting list. For example, if you are looking for a classmate of age=24, Xiao Ming, who likes to answer questions, immediately raised his hand and replied, "I know that id is a classmate of 1Magne2." But what if there are tens of millions of records here? What if you want to find it through name?
Term Dictionary
In order to find a certain term quickly, Elasticsearch arranges all the term in order, dichotomy to find the search efficiency of term,logN, just like looking up through a dictionary, this is Term Dictionary. Now it seems to be similar to the way traditional databases use B-Tree. Why is it faster than B-Tree queries?
Term Index
B-Tree improves query performance by reducing the number of disk searches. Elasticsearch also uses the same idea to find term directly through memory without reading the disk. However, if there are too many term, the term dictionary will also be very large, so it is not realistic to store memory, so there is a Term Index, just like the index page in the dictionary. Which term begins with An and which page it is on, you can understand that term index is a tree:
This tree will not contain all the term, it will contain some prefixes of term. With term index, you can quickly locate an offset in term dictionary, and then look it up sequentially from that location.
⭕️ represents a state
-- > indicates the process of state change, and the letters / numbers above indicate state change and weight
The word is divided into individual letters expressed by ⭕️ and-- >, and the weight of 0 is not displayed. If there is a branch after the ⭕️, mark the weight, and the weight on the whole path adds up to the sequence number of the word.
FSTs are finite-state machines that map a term (byte sequence) to an arbitrary output.
FST stores all term in bytes. This compression method can effectively reduce the storage space and make the term index enough to put into memory, but this way will also lead to the need of more CPU resources for lookup.
The rest is even more exciting. Tired students can have a cup of coffee.
Compression technique
In addition to the above mentioned compression of term index with FST, Elasticsearch also has compression skills for posting list.
Xiao Ming raised his hand again after drinking coffee: "isn't posting list already storing only the document id? still need to compress?"
Well, let's go back to the original example, if Elasticsearch needs to index the gender of a classmate (by this time the traditional relational database is already crying in the toilet.) Hey, what's gonna happen? If there are tens of millions of students and there are only two genders in the world, each posting list will have at least a million document id. How does Elasticsearch effectively compress the id of these documents?
Frame Of Reference
Incremental coding compression, changing large numbers to decimal numbers and storing them in bytes
First of all, Elasticsearch requires that posting list be ordered (even wayward requirements have to be met to improve search performance). One of the benefits of this is that it is easy to compress, as shown in the following illustration:
At this time, the careful Xiao Ming raised his hand again: "Why take 65535 as the limit?"
In the programmer's world, in addition to 1024, 65535 is also a classical value, because it = 2 ^ 16-1, which is exactly the maximum number of energy saving expressed in two words, the storage unit of a short. Notice the last line "If a block has more than 4096 values, encode as a bitset, and otherwise as a simple array using 2 bytes per value" in the picture above. If it is a large block, save it with bitset, and the small block will be generous. I don't care about 2 bytes. It is convenient to save with a short [].
So why use 4096 to distinguish large chunks from small ones?
Personal understanding: it is said that the programmer's world is binary, 4096*2bytes = 8192bytes
< 1KB, 磁盘一次寻道可以顺序把一个小块的内容都读出来,再大一位就超过1KB了,需要两次读。 联合索引 上面说了半天都是单field索引,如果多个field索引的联合查询,倒排索引如何满足快速查询的要求呢? 利用跳表(Skip list)的数据结构快速做"与"运算,或者 利用上面提到的bitset按位"与" 先看看跳表的数据结构:If you use a jump table, for each id in the shortest posting list, look in the other two posting list one by one to see if they exist, and finally get the result of intersection.
If you use bitset, it's very intuitive, directly bitwise and the result is the final intersection.
Summary and thinking
The index idea of Elasticsearch:
Move the contents of the disk into memory as much as possible, reduce the number of random disk reads (but also make use of disk sequential reading characteristics), combined with a variety of skillful compression algorithms, use memory in a very harsh manner.
Therefore, you need to be aware of when indexing using Elasticsearch:
Fields that do not need to be indexed must be clearly defined, because the default is automatic indexing.
By the same token, fields of type String need to be clearly defined without analysis, because the default is also analysis.
It is important to choose regular ID. ID with too much randomness (such as UUID of java) is not good for query.
On the last point, I think there are a number of factors:
One (perhaps not the most important) factor: the compression algorithms seen above compress a large number of ID in Posting list, so if the ID is sequential, or has a common prefix and other regular ID, the compression ratio will be higher.
Another factor: probably the most affecting query performance should be the last step to find Document information in disk through the ID in Posting list, because the Elasticsearch is stored in Segment, and the efficiency of locating the Segment according to the large-scale Term of ID directly affects the performance of the final query. If the ID is regular, you can quickly skip the Segment that does not contain the ID, thus reducing unnecessary disk reads. For more information, please refer to this article on how to choose an efficient global ID solution (the comments are also excellent)
At this point, I believe you have a deeper understanding of "the introduction of Elasticsearch and what is its principle". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.