What is the basic principle of Elasticsearch? 10/21 Update SLTechnology News&Howtos

What is the basic principle of Elasticsearch?

2025-10-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "what are the basic principles of Elasticsearch". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Search engine is the retrieval of data, so let's start with the data in life. The data in our lives are generally divided into two types:

Structured data

Unstructured data

Structured data: also known as row data, is a two-dimensional table structure to logically express and implement the data, strictly follow the data format and length specifications, mainly through the relational database for storage and management. Refers to data with a fixed format or limited length, such as database, metadata, etc.

Unstructured data: also known as full-text data, variable length or no fixed format, not suitable for database two-dimensional table performance, including all formats of office documents, XML, HTML, Word documents, mail, all kinds of reports, pictures and frequency, video information and so on.

Note: to make a more detailed distinction, XML and HTML can be divided into semi-structured data. Because they also have their own specific label format, they can be processed according to structured data as needed, or plain text can be extracted as unstructured data.

According to two kinds of data classification, search is also divided into two kinds:

Structured data search

Unstructured data search

For structured data, because they have a specific structure, we can generally store and search through two-dimensional tables (Table) of relational databases (MySQL,Oracle, etc.), and we can also build indexes.

There are two main ways to search for unstructured data, that is, full-text data:

Sequential scanning

Full-text retrieval

Sequential scanning: you can also know its general search method through the text name, that is, searching for specific keywords in the way of sequential scanning.

For example, give you a newspaper so that you can find out where the word "peace" has appeared. You definitely need to scan the newspaper from beginning to end and mark where the keyword has appeared and where it appears.

This method is undoubtedly the most time-consuming and inefficient, if the typesetting font of the newspaper is small, and there are more pages or even more newspapers, you will be almost done by the time you finish scanning your eyes.

Full-text search: sequential scanning of unstructured data is slow, can we optimize it? Can't we just find a way to structure our unstructured data?

Part of the information in the unstructured data is extracted and reorganized to make it have a certain structure, and then the data with a certain structure is searched, so as to achieve the purpose of relatively fast search.

This way constitutes the basic idea of full-text retrieval. This part of the information extracted from unstructured data and then reorganized is called an index.

The main workload of this approach is the creation of the index in the early stage, but it is fast and efficient for the later search.

Let's talk about Lucene first.

After a brief understanding of the types of data in our lives, we know that SQL retrieval in relational databases cannot handle this kind of unstructured data.

The processing of this kind of unstructured data needs to rely on full-text search, and the best open source full-text search engine toolkit on the market belongs to Apache's Lucene.

But Lucene is just a toolkit, it is not a complete full-text search engine. The purpose of Lucene is to provide a simple and easy-to-use toolkit for software developers to conveniently realize the function of full-text retrieval in the target system, or to establish a complete full-text retrieval engine based on it.

At present, the open source available full-text search engines based on Lucene are mainly Solr and Elasticsearch.

Solr and Elasticsearch are both mature full-text search engines, and their functions and performances are basically the same.

But ES itself has the characteristics of distributed and easy to install and use, and the distribution of Solr needs the help of third parties, such as the use of ZooKeeper to achieve distributed coordination management.

Both Solr and Elasticsearch rely on Lucene, while Lucene can realize full-text search mainly because it implements the query structure of inverted index.

How to understand the inverted index? If there are three data documents, the contents of the documents are as follows:

Java is the best programming language.

PHP is the best programming language.

Javascript is the best programming language.

To create an inverted index, we split the content field of each document into separate words (we call it terms or Term) through a word splitter, create a sorted list of all unrepeated entries, and then list the document in which each term appears.

The results are as follows:

Term Doc_1 Doc_2 Doc_3

Java | X | |

Is | X | X | X

The | X | X | X

Best | X | X | X

Programming | x | X | X

Language | X | X | X

PHP | | X |

Javascript | X

This structure consists of a list of all the unrepeated words in the document, each of which has a list of documents associated with it.

This structure in which the location of the record is determined by the attribute value is the inverted index. Files with inverted indexes are called inverted files.

We convert the above into the form of a diagram to illustrate the structure information of the inverted index, as shown in the following figure:

There are mainly the following core terms to understand:

Entry (Term): the smallest storage and query unit in an index, which is a word for English and generally refers to a word after word segmentation in Chinese.

Dictionary (Term Dictionary): or dictionary, is a collection of entries Term. The usual index unit of a search engine is a word. A word dictionary is a collection of strings made up of all the words that have appeared in the document collection. Each index entry in the word dictionary records some information about the word itself and a pointer to the "inverted list".

Inverted list (Post list): a document is usually made up of multiple words. An inverted list records where and where a word appears in a document.

Each record is called an inverted Posting. The inverted table records not only the document number, but also the word frequency and other information.

Inverted file (Inverted File): an inverted list of all words is often stored sequentially in a file on disk, which is called an inverted file, which is a physical file that stores an inverted index.

From the figure above, we can see that the inverted index is mainly composed of two parts:

Dictionaries

Inverted file

Dictionary and inverted table are two important data structures in Lucene, and they are the important cornerstone of fast retrieval. Dictionaries and inverted files are stored in two parts, dictionaries are in memory and inverted files are stored on disk.

Core concepts of ES

After laying the groundwork for some basic knowledge, we officially enter the introduction of today's protagonist Elasticsearch.

ES is an open source search engine written in Java. It internally uses Lucene to index and search. Through the encapsulation of Lucene, it hides the complexity of Lucene and provides a set of simple and consistent RESTful API instead.

However, Elasticsearch is not only a Lucene, but also more than just a full-text search engine.

It can be accurately described as follows:

A distributed real-time document storage where each field can be indexed and searched.

A distributed real-time analysis search engine.

Capable of expanding hundreds of service nodes and supporting structured or unstructured data at the PB level.

The official website introduces Elasticsearch as a distributed, scalable, near-real-time search and data analysis engine.

Let's look at some core concepts to see how Elasticsearch can be distributed, scalable, and searched in near real time.

Cluster (Cluster)

The cluster construction of ES is very simple, it does not need to rely on the third-party coordination management components, and realizes the cluster management function internally.

An ES cluster consists of one or more Elasticsearch nodes, and each node can join the cluster with the same cluster.name configuration. The default value is "elasticsearch".

Make sure that different cluster names are used in different environments, otherwise nodes will eventually join the wrong cluster.

An Elasticsearch service startup instance is a Node. The node sets the node name through node.name, and if it is not set, a random universal unique identifier is assigned to the node as the name at startup.

① discovery mechanism

So there is a question, how can ES connect different nodes to the same cluster through the same setting cluster.name? The answer is Zen Discovery.

Zen Discovery is the built-in default discovery module for Elasticsearch (the discovery module is responsible for discovering nodes in the cluster and electing Master nodes).

It provides unicast and file-based discovery and can be extended to support cloud environments and other forms of discovery through plug-ins.

Zen Discovery integrates with other modules, for example, all communication between nodes is done using the Transport module. The node uses the discovery mechanism to find other nodes by means of Ping.

Elasticsearch is configured by default to use unicast discovery to prevent nodes from inadvertently joining the cluster. Only nodes running on the same machine are automatically clustered.

If the nodes in the cluster are running on different machines and using unicast, you can provide Elasticsearch with a list of nodes it should try to connect to.

When a node contacts the members of the unicast list, it gets the status of all nodes in the entire cluster, and then it contacts the Master node and joins the cluster.

This means that the unicast list does not need to include all the nodes in the cluster, it just needs enough nodes when a new node contacts one of them and speaks.

If you use Master candidate nodes as a unicast list, all you have to do is list three. This configuration is in the elasticsearch.yml file:

Discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]

After the node starts, Ping first. If discovery.zen.ping.unicast.hosts is set, then Host in the Ping setting, otherwise try several ports of ping localhost.

Elasticsearch supports the same host to start multiple nodes, and the Response of Ping will contain the basic information of that node and the Master node that the node thinks.

The election begins with the selection from the Master considered by each node. The rule is very simple. Sort according to the dictionary order of ID, and take the first one. If none of the nodes have the Master they think, then select from all the nodes, and the rule is the same as above.

The restriction here is discovery.zen.minimum_master_nodes, and if the number of nodes does not reach the minimum limit, cycle through the above process until there are enough nodes to start the election.

The final result of the election is that you can definitely elect a Master, and if there is only one Local node, then you will elect yourself.

If the current node is Master, start waiting for the number of nodes to reach discovery.zen.minimum_master_nodes, and then provide services.

If the current node is not Master, try to join Master. Elasticsearch refers to the above process of service discovery and selection as Zen Discovery.

Because it supports any number of clusters (1-N), it is not possible to restrict the number of nodes to be odd as Zookeeper does, so it is impossible to use the voting mechanism to select the host, but to adopt a rule.

As long as all nodes follow the same rules, the information obtained is equal, and the selected master nodes must be consistent.

However, the problem of distributed system lies in the situation of information asymmetry, which is prone to the problem of Split-Brain.

Most solutions simply set a Quorum value that requires that the available nodes must be greater than Quorum (typically more than half of the nodes) in order to provide services.

In Elasticsearch, the configuration of this Quorum is discovery.zen.minimum_master_nodes.

The role of the ② node

Each node can be either a candidate master node or a data node, which can be set in the configuration file.. / config/elasticsearch.yml. The default is true.

Node.master: whether true / / is a candidate for master node

Node.data: true / / whether it is a data node

Data nodes are responsible for data storage and related operations, such as adding, deleting, modifying, querying and aggregating data, so data nodes (Data nodes) have high requirements for machine configuration and consume a lot of CPU, memory and Imax O.

Usually as the cluster expands, more data nodes need to be added to improve performance and availability.

The candidate master node can be elected as the master node (Master node). In the cluster, only the candidate master node has the right to vote and be elected, and other nodes do not participate in the election work.

The primary node is responsible for creating the index, deleting the index, tracking which nodes are part of the cluster, deciding which shards are allocated to the relevant nodes, tracking the status of the nodes in the cluster, and so on. A stable primary node is very important to the health of the cluster.

A node can be either a candidate master node or a data node, but the data node consumes a lot of CPU and memory core.

So if a node is both a data node and a master node, it may have an impact on the primary node and thus on the state of the entire cluster.

Therefore, in order to improve the health of the cluster, we should divide and isolate the roles of the nodes in the Elasticsearch cluster. Several low-configured machine clusters can be used as candidate primary node groups.

The master node and other nodes check each other through Ping, and the master node is responsible for Ping all other nodes to determine whether any node has died. Other nodes also use Ping to determine whether the primary node is available.

Although the role of the node is distinguished, the user's request can be sent to any node, and the node is responsible for distributing requests, collecting results and other operations, without the need for the primary node to forward.

This kind of node can be called the coordination node, the coordination node does not need to be specified and configured, and any node in the cluster can act as the coordination node.

③ brain fissure phenomenon

At the same time, if multiple Master nodes are elected in the cluster due to network or other reasons, which makes the data update inconsistent, this phenomenon is called brain fissure, that is, different nodes in the cluster have differences in the choice of Master, resulting in multiple Master competition.

The "brain fissure" problem may be caused by the following reasons:

Network problem: due to the network delay between clusters, some nodes are unable to access the Master. They think that the Master is dead, so a new Master is elected, and the shards and replicas on the Master are marked red, and the new primary shards are assigned.

Node load: the role of the master node is both Master and Data. When the traffic is large, it may cause the ES to stop responding (false death) and cause a large area of delay. When other nodes do not get the response from the master node, they think that the master node is dead, and the master node will be re-selected.

Memory recovery: the role of the master node is both Master and Data. When the ES process on the Data node occupies a large amount of memory, it causes a large-scale memory recovery of JVM, causing the ES process to lose its response.

In order to avoid the occurrence of brain fissure, we can start from the reasons through the following aspects to make optimization measures:

Appropriately increase the response time to reduce misjudgment. Set the response time of the node state through the parameter discovery.zen.ping_timeout. The default is 3s, which can be increased appropriately.

If the Master does not respond within the range of the response time, it is determined that the node has hung up. The misjudgment can be reduced appropriately by adjusting the parameters (such as 6scentury discovery.zen.pingshield timeoutbank 6).

The election is triggered. We need to set the value of the parameter discovery.zen.munimum_master_nodes in the configuration file of the nodes in the candidate cluster.

This parameter indicates the number of candidate master nodes that need to participate in the election when electing the master node. The default value is 1, and the official recommended value (master_eligibel_nodes/2) + 1, where master_eligibel_nodes is the number of candidate master nodes.

This can not only prevent the occurrence of brain fissure, but also maximize the high availability of the cluster, because as long as no less than discovery.zen.munimum_master_nodes candidate nodes survive, the election can proceed normally.

When it is less than this value, the election behavior cannot be triggered, the cluster cannot be used, and fragmentation confusion will not be caused.

Separation of roles. That is, the role separation of the candidate master node and the data node mentioned above can reduce the burden of the master node, prevent the false death of the master node, and reduce the misjudgment that the master node is "dead".

Fragmentation (Shards)

ES supports PB-level full-text search. When the amount of data on the index is too large, ES splits the data on an index horizontally and allocates it to different data blocks, and the split database block is called a fragment.

This is similar to the sub-library sub-table of MySQL, except that the MySQL sub-library sub-table requires the help of third-party components, and ES itself implements this function.

When writing data in a multi-shard index, routing is used to determine which shard is written, so the number of shards needs to be specified when creating the index, and once the number of shards is determined, it cannot be modified.

The number of shards and the number of replicas described below can be configured by the Settings when the index is created. ES creates 5 primary shards for one index by default, and creates one copy for each shard.

PUT / myIndex

{

"settings": {

"number_of_shards": 5

"number_of_replicas": 1

}

ES improves the size and performance of the index through the function of sharding. Each shard is an index file in Lucene, and each shard must have a master shard and zero or more copies.

Copy (Replicas)

A replica is the Copy of a shard. Each master shard has one or more replica shards. When the master shard is abnormal, the replica can provide operations such as data query.

The main shard and the corresponding replica shard are not on the same node, so the maximum number of replica shards is NMel 1 (where N is the number of nodes).

Requests for new, indexing, and deleting documents are all write operations and must be completed on the main shard before they can be copied to the relevant replica shard.

In order to improve the writing ability of ES, this process is concurrent, and in order to solve the problem of data conflicts in the process of concurrent writing, ES is controlled by optimistic locking. Each document has a _ version (version) number, and the version number is incremented when the document is modified.

Once all the replica fragments are reported to be successful, the success will be reported to the coordinator node, and the coordinator node will report the success to the client.

As you can see from the figure above, in order to achieve high availability, the Master node avoids putting the primary shard and the replica shard on the same node.

Assuming that the Node1 service of the node is down or the network is unavailable, then the primary shard S0 on the primary node is also unavailable.

Fortunately, there are two other nodes that are working properly, and ES will re-elect the new primary node, and all the data we need for S0 exists on these two nodes.

We will promote the copy of S0 as the main shard, and the process of promoting the master shard occurs instantly. At this point the status of the cluster will be Yellow.

Why is our cluster status Yellow instead of Green? Although we have all two master shards, we also set up that each master shard needs to correspond to two copy shards, and there is only one copy shard at this time. So the cluster cannot be in the state of Green.

If we also turn off Node2, our program can still run without losing any data, because Node3 keeps a copy of each shard.

If we restart Node1, the cluster can redistribute the missing replica fragments, and the state of the cluster will return to its original normal state.

If Node1 still has the previous shards, it will try to reuse them, but the shards on the Node1 node are no longer primary shards but replica shards. If there is changed data during this period, you only need to copy the modified data file from the main shard.

Summary:

The purpose of sharding the data is to improve the capacity of the data that can be processed and to be easy to scale horizontally, and to make copies for the slicing is to improve the stability of the cluster and improve the concurrency.

The copy is multiplication, and the more it consumes, the more secure it is. Slicing is division, and the more shards, the less and more scattered the data of a single shard.

The more replicas, the higher the availability of the cluster, but since each shard is equivalent to an index file of Lucene, it takes up certain file handles, memory, and CPU.

And the data synchronization between fragments will also occupy a certain amount of network bandwidth, so the number of shards and copies of the index is not the more the better.

Mapping (Mapping)

Mapping is used to define the storage type, word segmentation and whether to store the fields in the index by ES, just like the Schema in the database, which describes the possible fields or attributes of the document and the data type of each field.

However, the relational database must specify the field type when building the table, while ES can not specify the field type and then guess the field type dynamically, or it can specify the field type when creating the index.

The mapping that automatically identifies the field type according to the data format is called dynamic mapping (Dynamic Mapping). When we create the index, the mapping that defines the field type is called static mapping or display mapping (Explicit Mapping).

Before we talk about the use of dynamic and static mappings, let's take a look at the field types of data in ES. Then we'll explain why we need to establish a static mapping instead of a dynamic mapping when creating an index.

The main field data types in ES (v6.8) are as follows:

Fields that Text uses to index full-text values, such as email body or product description. These fields are segmented and are passed through a participle to convert a string into a list of individual terms before being indexed.

The parsing process allows Elasticsearch to search for each complete text field in a single word. Text fields are not used for sorting and are rarely used for aggregation.

Fields used by Keyword to index structured content, such as email address, hostname, status code, zip code, or label. They are commonly used for filtering, sorting, and aggregation. The Keyword field can only be searched by its exact value.

Through the understanding of the field type, we know that some fields need to be clearly defined, for example, whether a field is a Text type or a Keyword type is very different, the time field may need to specify its time format, there are some fields we need to specify a specific word splitter, and so on.

If this can not be done accurately by using dynamic mapping, automatic recognition is often different from what we expect.

Therefore, when creating an index, a complete format should specify the number of shards and replicas, as well as the definition of Mapping, as follows:

PUT my_index

{

"settings": {

"number_of_shards": 5

"number_of_replicas": 1

}

"mappings": {

"_ doc": {

"properties": {

"title": {"type": "text"}

"name": {"type": "text"}

"age": {"type": "integer"}

"created": {

"type": "date"

"format": "strict_date_optional_time | | epoch_millis"

}

Basic use of ES

The first thing to consider when deciding to use Elasticsearch is the version. Elasticsearch (excluding 0.x and 1.x) currently has the following commonly used and stable major versions: 2.xPower5.xmem6.xmem7.x (current).

You may find that without 3.x and 4.x Magi es jumped directly from 2.4.6 to 5.0.0. In fact, it is for the unification of the version of ELK (ElasticSearch,Logstash,Kibana) technology stack to avoid bringing confusion to users.

When Elasticsearch is 2.x (the last release of 2.x is July 25, 2017), Kibana is already 4.x (the release time of Kibana 4.6.5 is July 25, 2017).

So the next major version of Kibana must be 5.x, so Elasticsearch directly released its own major version to 5.0.0.

After unification, we will not hesitate to choose the version of Elasticsearch and then choose the same version of Kibana. We do not have to worry about the incompatibility of the version.

Elasticsearch is built using Java, so in addition to paying attention to the uniform version of ELK technology, we also need to pay attention to the version of JDK when choosing the version of Elasticsearch.

Because each major version depends on a different version of JDK, version 7.2 already supports JDK11.

Installation and use

① downloads and decompresses Elasticsearch, which can be used without installation and decompression. The decompressed directory is shown above:

Bin: binary system instruction directory, including startup commands and install plug-in commands, etc.

Config: configuration file directory.

Data: data storage directory.

Lib: depends on the package directory.

Logs: log file directory.

Modules: a module library, such as a module for x-pack.

Plugins: plug-in directory.

Run bin/elasticsearch under the ② installation directory to start ES.

③ runs on port 9200 by default. Request curl http://localhost:9200/ or the browser to enter http://localhost:9200 to get a JSON object containing the current node, cluster, version and other information.

{

"name": "U7fp3O9"

"cluster_name": "elasticsearch"

"cluster_uuid": "- Rj8jGQvRIelGd9ckicUOA"

"version": {

"number": "6.8.1"

"build_flavor": "default"

"build_type": "zip"

"build_hash": "1fad4e1"

"build_date": "2019-06-18T13:16:52.517138Z"

"build_snapshot": false

"lucene_version": "7.7.0"

"minimum_wire_compatibility_version": "5.6.0"

"minimum_index_compatibility_version": "5.0.0"

}

"tagline": "You Know, for Search"

}

Cluster health status

To check the health of the cluster, we can run the following command GET / _ cluster/health in the Kibana console to get the following information:

{

"cluster_name": "wujiajian"

"status": "yellow"

"timed_out": false

"number_of_nodes": 1

"number_of_data_nodes": 1

"active_primary_shards": 9

"active_shards": 9

"relocating_shards": 0

"initializing_shards": 0

"unassigned_shards": 5

"delayed_unassigned_shards": 0

"number_of_pending_tasks": 0

"number_of_in_flight_fetch": 0

"task_max_waiting_in_queue_millis": 0

"active_shards_percent_as_number": 64.28571428571429

}

Cluster status is identified by green, yellow, and red:

Green: the cluster is in good health, all functions are complete and normal, and all fragments and copies can work normally.

Yellow: warning status, all main fragments function normally, but at least one copy does not work properly. At this point, the cluster will work, but high availability will be affected to some extent.

Red: the cluster is not working properly. If one or some shards and their copies are unavailable, the query operation of the cluster can still be performed, but the returned results will be inaccurate. Write requests assigned to this shard will report an error, which will eventually result in data loss.

When the cluster status is red, it will continue to provide search request services from available shards, but you need to fix those unallocated shards as soon as possible.

Principle of ES mechanism

After the introduction of the basic concepts and basic operations of ES, we may still have a lot of doubts:

How do they work internally?

How are master shards and replica shards synchronized?

What is the process of creating an index?

How does ES distribute index data to different shards? And how is the index data stored?

Why is ES a near-real-time search engine while the CRUD (create-read-update-delete) operation of a document is real-time?

And how does Elasticsearch ensure that updates are persisted without losing data in the event of a power outage?

And why does deleting a document not immediately free up space?

With these questions, let's move on to the following content.

Principle of writing index

The following figure depicts a cluster of 3 nodes with a total of 12 shards, including 4 main shards (S0, S1, S2, S3) and 8 replica shards (R0, R1, R2, R3), each of which corresponds to two replica shards. Node 1 is the master node (Master node) responsible for the state of the entire cluster.

The index can only be written on the main shard and then synchronized to the replica shard. There are four main shards. According to what rules is a piece of data ES written on a specific shard?

Why is this index data written on S0 and not on S1 or S2? Why is that data written on S3 instead of S0?

First of all, it certainly won't be random, otherwise we won't know where to look when we get the document in the future.

In fact, this process is based on the following formula:

Shard = hash (routing)% number_of_primary_shards

Routing is a variable value, the default is the document's _ id, or it can be set to a custom value.

Routing generates a number through the Hash function, which is then divided by number_of_primary_shards (the number of main slices) to get the remainder.

The remainder between 0 and number_of_primary_shards-1 is where the document we are looking for is located.

This explains why we have to determine the number of primary shards when creating the index and never change that number: because if the number changes, all previously routed values will be invalid and the document will never be found again.

Because each node in the ES cluster knows where the documents in the cluster are stored through the above calculation formula, each node has the ability to handle read and write requests.

After a write request is sent to a node, the node is the coordination node mentioned earlier, and the coordination node will calculate which shard to write to according to the routing formula, and then forward the request to the main shard node of the shard.

If the data is left over by the routing calculation formula at this time, the value obtained is shard=hash (routing)% 4room0.

The specific process is as follows:

If the client sends a write request to the ES1 node (the coordinating node) and gets a value of 0 through the routing calculation formula, the current data should be written to the main slice S0.

The ES1 node forwards the request to ES3,ES3, the node where the S0 master shard is located, to accept the request and write it to disk.

Data is copied concurrently to two replica fragments R0, in which the conflict of data is controlled by optimistic concurrency. Once all replica shards are reported to be successful, the node ES3 reports success to the coordinator node, and the coordinator node reports success to the client.

Storage principle

The above describes the write process of the index within ES, which is performed in the memory of ES. After the data is allocated to specific shards and copies, it is finally stored on disk, so that the data will not be lost in the event of a power outage.

The specific storage path can be set in the configuration file.. / config/elasticsearch.yml, which is stored by default in the Data folder of the installation directory.

Default values are not recommended because if ES is upgraded, it may cause all data to be lost:

Path.data: / path/to/data / / Index data

Path.logs: / path/to/logs / / logging

① segmented storage

Indexed documents are stored on disk in the form of segments. What is a segment? If the index file is split into multiple sub-files, each sub-file is called a segment, and each segment itself is an inverted index, and the segment is invariant. Once the indexed data is written to the hard disk, it can no longer be modified.

The segmented storage mode is adopted at the bottom, which almost completely avoids the occurrence of locks when reading and writing, and greatly improves the performance of reading and writing.

After the segment is written to disk, a commit point is generated, which is a file used to record all the information after the submission.

Once a segment has a commit point, it means that the segment only has read permission and loses write permission. On the contrary, when the segment is in memory, it only has write permission, but does not have the permission to read data, which means that it cannot be retrieved.

The concept of segment is mainly because in the early full-text retrieval, a large inverted index of the whole document collection was established and written to disk.

If the index is updated, you need to recreate a full index to replace the original index. This method is inefficient when the amount of data is very large, and because the cost of creating an index is very high, the data can not be updated too frequently, so it can not guarantee timeliness.

Index files are stored in segments and cannot be modified, so how to add, update, and delete?

Add new, new easy to deal with, because the data is new, so only need to add a new paragraph to the current document.

Delete, because it cannot be modified, so for the delete operation, the document is not removed from the old segment, but by adding a .del file, which lists the segment information of these deleted documents.

The marked deleted document can still be matched by the query, but it will be removed from the result set before the final result is returned.

Update, can not modify the old paragraph to reflect the document update, in fact, the update is equivalent to delete and add these two actions. The old document is marked out in the .del file, and the new version of the document is indexed into a new segment.

It is possible that both versions of the document will be matched by a query, but the deleted older version will be removed before the result set is returned.

Setting the paragraph as unmodifiable has certain advantages and disadvantages, the main advantages are as follows:

You don't need a lock. If you never update the index, you don't have to worry about multiple processes modifying data at the same time.

Once the index is read into the kernel's file system cache, it will be left there because of its immutability. As long as there is enough space in the file system cache, most read requests directly request memory without hitting the disk. This provides a significant performance improvement.

Other caches, such as the Filter cache, are always valid for the lifetime of the index. They do not need to be rebuilt every time the data changes, because the data does not change.

Writing to a single large inverted index allows data to be compressed, reducing the use of disk I / O and indexes that need to be cached in memory.

The disadvantages of paragraph invariance are as follows:

When old data is deleted, the old data is not deleted immediately, but is marked for deletion in the .del file. On the other hand, the old data can only be removed when the segment is updated, which will cause a lot of space waste.

If there is a piece of data that is updated frequently, and each update is new and marked old, there will be a lot of space waste.

Each time you add data, you need to add a new segment to store the data. When the number of segments is too large, the consumption of server resources such as file handles can be very large.

Include all the result sets in the results of the query, and you need to exclude the old data that has been marked and deleted, which increases the burden of the query.

② delay write strategy

After introducing the form of storage, what is the process of writing the index to disk? Is it physical to write to disk by directly calling Fsync?

The answer is obvious: if you write directly to disk, the disk's Icano consumption will seriously affect performance.

Then when there is a large amount of data written, it will cause the ES to freeze and the query can not respond quickly. If that were the case, ES would not call it a near real-time full-text search engine.

In order to improve the performance of writes, ES does not add a segment to disk for each new piece of data, but adopts the strategy of delaying writing.

Whenever there is new data, it is first written to memory, and between memory and disk is the file system cache.

When the default time (1 second) or a certain amount of data in memory is reached, a Refresh is triggered to generate the data in memory to a new segment and cache it to the file cache system, which is later flushed to disk and a commit point is generated.

The memory here uses ES's JVM memory, while the file cache system uses the operating system's memory.

New data will continue to be written to memory, but the data in memory is not stored in segments, so retrieval function cannot be provided.

When flushed from memory to the file cache system, a new segment is generated and opened for search without waiting for it to be flushed to disk.

In Elasticsearch, the lightweight process of writing and opening a new segment is called Refresh (that is, memory flushing to the file cache system).

By default, each shard is automatically refreshed once a second. This is why we say Elasticsearch is a near-real-time search, because changes to the document are not immediately visible to the search, but become visible within a second.

We can also manually trigger Refresh,POST / _ refresh to refresh all indexes and POST / nba/_refresh to refresh the specified indexes.

Tips: although refresh is a much lighter operation than commit, it still has performance overhead. Manual refresh is useful when writing tests, but do not manually refresh one document at a time in the production > environment. And not all situations need to be refreshed every second.

Maybe you are using Elasticsearch to index a large number of log files, and you may want to optimize the indexing speed instead of > near-real-time search.

At this time, when creating the index, you can increase the value of refresh_interval = "30s" in Settings to reduce the refresh frequency of each index. You need to pay attention to the time unit after setting the value, otherwise the default is millisecond. When refresh_interval=-1, the automatic refresh of the index is turned off

Although the strategy of delaying writing can reduce the number of times data is written to disk and improve the overall write capacity, we know that the file cache system is also memory space, which belongs to the memory of the operating system. As long as there is a risk of power outage or abnormal data loss in memory.

To avoid data loss, Elasticsearch adds a transaction log (Translog), which records all data that has not been persisted to disk.

The whole process of writing the index after adding the transaction log is shown in the figure above:

After a new document is indexed, it is first written to memory, but to prevent data loss, a copy of the data is appended to the transaction log.

New documents are constantly written to memory and are also recorded in the transaction log. At this time, the new data cannot be retrieved and queried.

When the default refresh time or a certain amount of data in memory is reached, a Refresh is triggered to flush the data in memory into the file cache system in the form of a new segment and empty the memory. At this time, although the new segment is not submitted to disk, it can provide document retrieval function and cannot be modified.

As the new document index continues to be written, a Flush will be triggered when the log data size exceeds 512m or when the time exceeds 30 minutes.

The data in memory is written to a new segment and written to the file cache system, the data in the file system cache is flushed to disk through Fsync, the commit point is generated, the log file is deleted, and an empty new log is created.

In this way, when the power is cut off or needs to be restarted, ES not only needs to load the persisted segments according to the submission point, but also needs the records in the tool Translog to repersist the unpersisted data to disk, thus avoiding the possibility of data loss.

③ segment merging

Because the automatic refresh process creates a new segment every second, this results in a surge in the number of segments in a short period of time. And too many segments will bring more trouble.

Each segment consumes file handles, memory, and CPU runtime cycles. More importantly, each search request must check each segment in turn and then merge the query results, so the more segments, the slower the search.

Elasticsearch solves this problem by periodically merging segments in the background. Small segments are merged into large segments, and then these large segments are merged into larger segments.

When segments are merged, those old deleted documents are removed from the file system. Deleted documents are not copied to new large segments. Indexing and searching are not interrupted during the merge.

Segment merging occurs automatically when indexing and searching, and the merging process selects a small number of segments of similar size and merges them into larger segments in the background, which can be either uncommitted or committed.

After the merge, the old segment is deleted, the new segment is Flush to disk, and a new commit point that contains the new segment and excludes the old and smaller segments is written, and the new segment is opened for search.

Segment merging requires a large amount of computation, and a large number of disks are eaten. Segment merging will slow down the write rate, and if left unchecked, it will affect the search performance.

Elasticsearch imposes resource restrictions on the merge process by default, so the search still has enough resources to perform well.

Performance optimization

Storage Devic

Disks are usually the bottleneck on modern servers. Elasticsearch uses disks heavily, and the more throughput your disk can handle, the more stable your node will be.

Here are some tips for optimizing disk Ibank O:

Use SSD. As mentioned elsewhere, they are much better than mechanical disks.

Use RAID 0. Striping RAID increases the disk I hand O, obviously at the cost of a total failure when a hard drive fails. Do not use mirroring or parity RAID because the copy already provides this feature.

In addition, use multiple hard drives and allow Elasticsearch to stripe data over them through multiple path.data directory configurations.

Do not use remotely mounted storage, such as NFS or SMB/CIFS. This introduced delay is completely the opposite of performance.

If you are using EC2, beware of EBS. Even SSD-based EBS is generally slower to store than local instances.

Internal index optimization

Elasticsearch in order to quickly find a certain Term, first put all the Term in order, and then according to the dichotomy to find Term, the time complexity is logN, just like looking up through a dictionary, this is Term Dictionary.

Now it seems to be similar to the way traditional databases use B-Tree. But if there are too many Term, the Term Dictionary will also be very large, memory is not realistic, so there is a Term Index.

Just like the index page in the dictionary, which Term begins with An and which page it is, it can be understood that Term Index is a tree.

This tree will not contain all the Term, it will contain some prefixes of Term. With Term Index, you can quickly locate an Offset in Term Dictionary, and then look it up sequentially from that location.

Use FST to compress Term Index,FST in memory and store all Term in bytes. This compression method can effectively reduce the storage space and make the Term Index enough to put into memory, but this way will also lead to the need of more CPU resources.

For the inverted table stored on disk, compression technology is also used to reduce the storage space.

Adjust configuration parameters

It is recommended to adjust the configuration parameters as follows:

Assign an ordered ID with a well-compressed sequential pattern to each document to avoid random ID such as UUID-4, which has a low compression ratio and significantly slows down Lucene.

Disable Doc values for index fields that do not require aggregation and sorting. Doc Values is an ordered list of mappings based on document= > field value.

Fields that do not require fuzzy retrieval use the Keyword type instead of the Text type, which avoids participle of the text before indexing.

If your search results do not require near-real-time accuracy, consider changing the index.refresh_interval of each index to 30s.

If you are doing a mass import, you can turn off the refresh by setting this value to-1, and you can also close the copy by setting index.number_of_replicas: 0. Don't forget to reopen it when it's finished.

To avoid deep paging queries, it is recommended to use Scroll for paging queries. In a normal paging query, an empty priority queue of from+size is created, and each shard returns from+size data. By default, only the document ID and the score Score are sent to the orchestration node.

If there are N fragments, the coordinator node sorts the (from+size) × n pieces of data twice, and then selects the documents to be retrieved. When the from is very large, the sorting process becomes heavy and takes up CPU resources seriously.

Reduce mapping fields and provide only fields that need to be retrieved, aggregated, or sorted. Other fields can be stored on other storage devices, such as Hbase, which can be queried by Hbase after getting the results in ES.

Specify the routing Routing value when creating the index and query, which can be accurate to the specific sharding query and improve the query efficiency. The choice of routing needs to pay attention to the balanced distribution of data.

JVM tuning

The recommendations for JVM tuning are as follows:

Make sure that the minimum heap memory (Xms) and the maximum heap memory (Xmx) are the same size, preventing the program from changing the heap memory size at run time.

By default, the heap memory set by Elasticsearch after installation is 1GB. It can be configured through a.. / config/jvm.option file, but it is best not to exceed 50% of physical memory and exceed 32GB.

GC uses CMS by default, but there is a problem with STW, so you can consider using the G1 collector.

ES relies heavily on file system caching (Filesystem Cache) for fast search. In general, you should ensure that at least half of the physically available memory is allocated to the file system cache.

This is the end of the content of "what are the basic principles of Elasticsearch". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.