What are Elasticsearch nodes, clusters, sharding and replicas 07/01 Update SLTechnology News&Howtos

What are Elasticsearch nodes, clusters, sharding and replicas

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what are Elasticsearch nodes, clusters, fragments and replicas". The explanation in the article is simple and clear, easy to learn and understand. Please follow the editor's train of thought to study and learn what Elasticsearch nodes, clusters, shards and replicas are.

Elasticsearch is a Lucene-based search server. It provides a full-text search engine with distributed multi-user capability, based on RESTful web interface. Developed in the Apache language and released as open source under the Apache license terms, Java is a popular enterprise search engine.

Elasticsearch distributed

The Elasticsearch distributed feature includes the following points:

High availability

What is high availability? CAP theorem is not only the foundation of distributed system, but also the three indicators of distributed system.

Consistency (consistency)

Availability (availability)

Partition tolerance (Partition Fault tolerance)

What is high availability (High Availability)? High availability, referred to as HA, is a characteristic or index of the system, which usually refers to the running time of services with certain performance, which is higher than the average normal time period. On the contrary, eliminate the time when the system service is unavailable.

The measure of whether the system is highly available is that when one or more servers are down, the system as a whole and services are still available. For example, some well-known websites guarantee more than 4 9s of availability, that is, more than 99.99% of usability. That 0.01% is the so-called percentage of downtime.

Elasticsearch embodies the following two points in terms of high availability:

Service availability: allows some nodes to stop service, but the overall service has no impact.

Data availability: allow some nodes to be lost without losing data eventually

Scalable

As the company's business grows, Elasticsearch also faces two challenges:

The amount of search data ranges from millions to hundreds of millions.

Search requests QPS also soared.

Then you need to redistribute the original nodes and incremental data from 10 nodes to 100 nodes. Elasticsearch can scale out to hundreds (or even thousands) of server nodes while processing PB-level data. Elasticsearch is created for scalability, and the process of growing from a small cluster to a large cluster is almost completely automated, which is the embodiment of horizontal expansion.

Elasticsearch distributed feature

From the extensibility above, we can see that the benefits of Elasticsearch distribution are obvious:

Storage can be expanded horizontally, and horizontal space can be exchanged for time.

Some nodes stop the service, and the whole cluster service is not affected, but the service is provided normally.

Elasticsearch automatically completes the distributed related work in the background, as follows:

Automatically assign documents to different fragments or multiple nodes

Balancer allocates shards to cluster nodes, and load balancing is performed during index and search operations

Copy each shard, support data redundancy and prevent data loss from hardware failure

When the cluster is expanded, new nodes are seamlessly integrated and shards are redistributed.

Elasticsearch cluster knowledge points are as follows:

Different clusters are distinguished by name. The default cluster name is "elasticsearch".

The cluster name is cluster name, which can be set through configuration file modification or command line-E cluster.name=user-es-cluster. A cluster consists of multiple nodes.

Elasticsearch Node & Cluster

The Elasticsearch cluster is composed of multiple nodes to form a distributed cluster. So, what is a node?

Node (Node) is an example of Elasticsearch application. Everyone knows that the Elasticsearch source code is written by Java, so the node is a Java process.

So like Spring applications, a server or local machine can run multiple nodes, as long as the corresponding ports are different. However, in a production server, one server usually runs an Elasticsearch node. There is also something to note:

Elasticsearch will have name, which is specified by-E node.name=node01 or configured in the configuration file.

If each node starts up successfully, a UID will be assigned and saved in the data directory.

You can check the health status of the cluster by using the command _ cluster/health, as follows:

Green master shard and replica shard are normal.

Yellow main shard is normal, replica shard is abnormal.

Red has abnormal primary shard. It is possible that the capacity of a shard exceeds the disk size, etc.

As shown in the figure, there are Master nodes and other nodes. So how many types of nodes are there?

Master-eligible Node and Master Node

When Elasticsearch is started, the default is Master-eligible Node. Then by participating in the selection process, you can become a Master Node. Specific selection of the main principle, followed by a separate article. What is the purpose of Master Node?

Master Node is responsible for synchronizing cluster status information:

All node information

All indexes, namely their Mapping and Setting information

All shard routing information

Only Master nodes can modify information because this ensures data consistency

Data Node and Coordinating Node

Data Node, also known as data node. It is used to save data, which plays a vital role in data expansion.

Coordinating Node is responsible for receiving any Client requests, including REST Client, etc. The node distributes the request to the appropriate node and finally brings the results together. Generally speaking, each node has the responsibility of Coordinating Node by default.

Other node types

There are other node types that are not common, but you need to know:

Hot & Warm Node: Data Node with different hardware configurations to implement hot and cold data node architecture and reduce the cost of operation and maintenance deployment

Machine Learning Node: the node responsible for machine learning

Tribe Node: responsible for connecting different clusters. Support searching Cross Cluster Search across clusters

Generally, in a development environment, you set up a single role node:

Master node: configured via node.master, default true

Data node: configured via node.data, default true

Ingest node: configured via node.ingest, default true

Coordinating node: by default, each node is a coordinating node, and all other types are set to false.

Machine learning: configured through node.ml. Default true, which needs to be enabled through x-pack.

Master slicing and copies

Also look at this figure, the three nodes are Node1, Node2, Node3. And Node3 has a main fragment P0 and a copy R2. So what is the main slicing?

The main slicing is used to solve the problem of horizontal data expansion. For example, the solution in the figure above can distribute the data to all nodes:

There can be a main shard or no master shard on a node

The main shard is determined when the index is created, and subsequent modifications are not allowed. Unless the Reindex operation is modified

Copies are used to back up data and improve the high availability of data. A replica shard is a copy of a master shard

Number of replica fragments, which can be adjusted dynamically

Increasing the number of copies can improve the throughput and availability of service reads to a certain extent.

How to view the sharding configuration of an Elasticsearch cluster? As can be seen from settings:

Number of number_of_shards main fragments number_of_replicas copies {"my_index": {"settings": {"index": {"number_of_shards": "8", "number_of_replicas": "1"}

Practical advice: in a production environment, shard setting is very important and requires the best capacity assessment and planning

According to the evaluation of the allocation number according to the data capacity, the setting is too small to expand horizontally later; the amount of data in a single shard is too large, which can easily lead to serious time-consuming data fragmentation.

If the number of fragments is set too large, it will lead to a waste of resources and a decline in performance; at the same time, it will also affect the scoring of search results and search accuracy.

Index evaluation, the number of individual fragments under each index does not need to be too large. How to evaluate it? For example, if the amount of data in this index is 100 G, then set 10 shards, and the average amount of data per shard is 10 GB. Each shard has such a large amount of 10G data that it must take a lot of time. Therefore, the number of fragments can be arranged reasonably according to the amount of data evaluated. If you need to adjust the number of primary shards, you need to perform index relocation operations such as reindex.

Summary

For example, only when you know the search performance scenarios, such as how much data, how much writing, whether to write or query, and so on, can you determine:

Disk, it is recommended that the maximum Xmx of SSD,JVM should not exceed 30g. Copy shards are set to at least 1. Main shard, a single storage should not exceed 30 GB. According to this, you can calculate the number of shards and set them.

When the disk in the cluster is almost full, you add more machines, which may indeed lead to the possibility that all the newly created indexes will be allocated to the new node. It eventually leads to uneven distribution of data. Therefore, to remember to monitor the cluster, to 70%, you need to consider deleting data or increasing the number of nodes to set the maximum number of shards to alleviate this problem.

If the size of the slice is too large, there is really no way to recover quickly, so try to ensure that the size of the slice is less than 40G.

Thank you for your reading. the above is the content of "what are Elasticsearch nodes, clusters, shards and replicas". After the study of this article, I believe you have a deeper understanding of what Elasticsearch nodes, clusters, shards and replicas are, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.