Summary of entry knowledge points for Elasticsearch 04/23 Update SLTechnology News&Howtos

Summary of entry knowledge points for Elasticsearch

2025-04-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "Elasticsearch entry knowledge point summary". In daily operation, I believe many people have doubts about Elasticsearch entry knowledge point summary. Xiaobian consulted all kinds of materials and sorted out simple and easy operation methods. I hope to help you answer the doubts of "Elasticsearch entry knowledge point summary"! Next, please follow the small series to learn together!

text

Let me introduce you to some of the current mainstream database storage methods:

Row storage: data in the same row is physically stored together

Common row database systems are MySQL, Postgres and MS SQL Server.

Storage structure:

Query efficiency of downward storage database in some scenarios:

Column Storage: Values from different columns are stored separately, data from the same column is stored together

Common columnar databases include Vertica, Paraccel (Actian Matrix, Amazon Redshift), Sybase IQ, Exasol, Infobright, InfiniDB, MonetDB (VectorWise, Actian Vector), LucidDB, SAP HANA, Google Dremel, Google PowerDrill, Druid, kdb+, Hbase, clickhouse.

Query efficiency of the following storage databases for certain scenarios:

Recently, I have been in contact with Clickhouse, which is column storage. The reason why he is so fast is mainly the following three reasons:

input/output

For analytical queries, you usually need to read only a small number of columns of a table. In a columnar database you can read only the data you need. For example, if you only need to read 5 out of 100 columns, this will help you reduce I/O consumption by at least 20 times.

Since data is always packed into bulk reads, compression is very easy. And the data is stored in columns, which is easier to compress. This further reduces the volume of I/O.

This will help more data to be cached by the system due to lower I/O.

Note: These two are listed here to compare the efficiency of some special scenarios, and also to pave the way for the speed and data structure of the later es. In fact, databases such as Clickhouse are only suitable for some scenarios, and most scenarios require row databases.

If you are interested, I can click Clickhouse to share it later (although I am still watching)

Next, let's talk about another storage structure:

document

In fact, es is similar to column documents to some extent, and you can see it later.

{ "name": "name" "size": 24 "sex': "male" }

Above I introduced several common storage structures is actually to illustrate the scene of es, as well as some of the advantages of es, we all know that the database is indexed, and also very fast, then es is how to store data, his index is what kind of it?

inverted index

Inversion, as the name implies, is to find key through Value, which is not quite the same as our traditional meaning of finding value according to key.

For example, again with the above data, we can see that es creates the following index:

Name inverted index

Size inverted index

Sex inverted index

As you can see, all inversions have the concepts of Term and Posting List. Posting list is an array of int, storing all document ids that match a term.

How to find key based on value? For example, if I want to find all the men of gender, the Posting list of the inverted index of Sex can tell me that I am the person with id 1 and 3, then through the term of Name, I can see that 1 is a person aobing, 3 is an egg, and so on.

Es query speed is very fast, but at present it seems that if only to Term look up and not fast ah? What is it?

This leads to the next two concepts, Term Dictionary and Term Index.

Term Dictionary: This is very easy to understand, I said above are composed of various Terms, that in order to find Term convenience, es all the Terms are sorted, is the dichotomy search.

Trem Index: This is to optimize the Term Dictionary and exist, we think ah so many Term light is sorted certainly not, want to quickly have to put into memory, but es data magnitude is often very large, that on disk? Disk addressing will be slow, so how to reduce the addressing overhead on disk? Term Index

In fact, it is the same as Xinhua Dictionary. What is the beginning of each letter, and then sort it according to pinyin.

This is the relationship between the three, is a very classic picture, basically all people who learn es should have seen.

Term Index stores some prefixes and mapping relationships, which can greatly reduce the number of random reads on the disk.

Smart compression

Do you think this design is clever? And es retrieval speed is much faster than MySQL, we can find in the use of MySQL index and Trem Dictionary is the same, but es more than one Index more layer of screening, less random times.

Another point I would like to mention is the storage structure of Term index on disk. This has been written in my history article, and I also stepped on his pit at that time. Today, in view of the length, I will briefly introduce it.

FST can be understood as a compression technology, the simplest way to compress bytes, I said above that Term index can not be put into memory, but what about compression?

I will not expand the details, the following article explains in particular detail, because this is a general popular science, I will specifically introduce the details of the cluster and its compression later.

Link: cs.nyu.edu/~mohri/pub/fla.pdf

Here are some of the concepts that I think are important in ES:

Near Real Time (NRT)

ES writes data to a memory bufferr first (the data in the buffer is not searched), and then every second will be flushed to os cache by default.

In the operating system, disk files actually have a thing called os cache, operating system cache, that is to say, before data is written to disk files, it will first enter os cache, and first enter a memory cache at the operating system level.

As long as the data in the buffer is refreshed and brushed into the os cache, it means that the data can be searched. The default is to refresh every 1 second, so es is quasi-real-time because the data written takes 1 second to be seen.

Why is it designed like this?

Let's see what would happen if we didn't:

If you write cache directly to the hard disk, in fact, it is very resource-consuming, and write immediately to the hard disk read, concurrency is very difficult to go up, you can imagine tens of thousands of QPS write, but also to query the disk, is what a disaster level scene.

So how does ES do it?

The data is written to the buffer, and then brushed to the cache every second. At this time, it can be searched, so it is said that quasi-real-time, rather than real-time, is the one-second gap. This design can reduce disk pressure, not to mention that writing and querying will not be affected, and concurrency will also go up.

Word segmentation text analysis is the process of converting a full text into a series of words (term/token), also known as word segmentation.

When a document is indexed, each Term may create an inverted index. The process of inverted indexing is to divide the document into Term by Analyzer, and each Term points to the document collection containing this Term.

Word segmentation

Is the core function of es, but its default participle is actually not friendly to Chinese. For example, if I search China, I may find out both the middle and the country, but China is a word that should not be divided like this.

Machine learning algorithms can now be used to word segmentation, there are some Chinese word segmentation plug-ins, such as ik word segmentation.

Its built-in participle is easier to use in English scenes.

split brain

The split brain problem actually exists on all machines deployed in the cluster. Suppose that the ES cluster now has two nodes, node 1 is the master node providing services to the outside world, and node 2 is the replica shard node.

Now that two nodes are disconnected due to network reasons, what will you find? The master node discovers that it is the master node and continues to provide services to the outside world. The replica node discovers that there is no master node. It elects itself to be the master node and also provides services to the outside world. Because the master node is unavailable, it is also forced to be the master node (dog head).

For callers, it's hard to see the difference unless you compare the data, and I've had brain splitting in production before, or user feedback, because searching for a word sometimes he can find that product, sometimes he can't, because the request is on a different node.

So, normal, how do we fix this? Elasticsearch.yml has a configuration:discovery.zen.minimum_master_nodes. This parameter determines how many nodes need to communicate during the master selection process. The default is 1. The setting principle is to set it to the number of cluster nodes/2+1.

If your cluster is three nodes, then this parameter is set to 3/2+1=2, that hangs one, the other two can communicate, so you can choose a master, if your cluster is three nodes, the parameter is still 2, but you find that only one node hangs with itself, you will not choose the master.

However, there are drawbacks to this. When there are only 2 nodes, hanging one is equivalent to the service being unavailable, so it is best to ensure that there are more than three clusters.

Elasticsearch's election algorithm is based on Bully's election algorithm. Simply put, in Bully's algorithm, each node has a number, and only the surviving node with the largest number can become the master node. The Bully algorithm's specific process is:

When any process P finds that the master is not responding to the request, it initiates an election. The election process is as follows:

(1)P A process sends an election message to all processes with a higher number than it;

(2)If no one responds, P wins and becomes master.

(3)If a process with a larger number responds, the responder takes over the election job and P's job is done.

At any one time, a process can only receive an election message from a process with a lower number than it. When the message arrives, the receiver sends an OK message to the sender indicating that it is running and taking over the work.

Eventually, all but one of the processes give up, and that process becomes the new coordinator, who then sends a win message to all the other processes informing them that a new coordinator has been born.

ELK

In fact, ES is often mentioned by the three ELK brothers together. At the end, I will talk about the other two brothers.

L is Logstash, an open source data collection engine with real-time pipeline capabilities. Logstash can dynamically unify data from different data sources and standardize data to destinations of your choice.

Logstash pipelines have two required elements: input and output, and one optional element: filters. The input plug-in consumes data from the data source, the filter plug-in modifies the data according to your expectations, and the output plug-in writes the data to the destination.

Kibana is an open source analytics and visualization platform for Elasticsearch, designed to search and view data stored interactively in the Elasticsearch index. Kibana enables advanced data analysis and presentation through a variety of charts.

Kibana makes massive amounts of data easier to understand. It is simple to operate, and the browser-based user interface can quickly create dashboards to display Elasticsearch query dynamics in real time.

Setting up Kibana is simple, no coding or additional infrastructure required, Kibana installation and Elasticsearch index monitoring can be started in minutes.

At this point, the study of "Elasticsearch Introduction Knowledge Point Summary" is over, hoping to solve everyone's doubts. Theory and practice can better match to help you learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.