What is the problem of so-called cerebral fissure in Elasticsearch cluster? 07/06 Update SLTechnology News&Howtos

What is the problem of so-called cerebral fissure in Elasticsearch cluster?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail what the so-called brain fissure problem in Elasticsearch cluster is, and the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

The so-called brain fissure problem (similar to schizophrenia) is that different nodes in the same cluster have a different understanding of the state of the cluster.

Today, there is an extremely slow query in the Elasticsearch cluster. Check the cluster status with the following command:

Curl-XGET 'es-1:9200/_cluster/health'

It is found that the overall state of the cluster is red, while the original nine-node cluster only shows four in the result; however, after sending the request to different nodes, I found that even though the overall state is red, the number of available nodes is not the same.

Normally, all nodes in the cluster should have the same choice of master in the cluster, so the status information obtained should also be consistent and inconsistent, indicating that different nodes have abnormal selection of master nodes-that is, the so-called brain fissure problem. Such a state of brain fissure directly makes the nodes lose the correct state of the cluster, resulting in the cluster can not work properly.

Possible causes:

1. Network: due to private network communication, network communication problems cause some nodes to think that master is dead, but it is less likely to choose master. If you check the Ganglia cluster monitoring, no abnormal private network traffic is found, so the reason can be ruled out.

two。 Node load: because the master node and the data node are mixed together, when the workload of the worker node is large (and indeed large), the corresponding ES instance stops responding, and if the server is acting as a master node, then some nodes will think that the master node is invalid and re-elect a new node. At the same time, because the ES process on the data node occupies a large amount of memory, the large-scale memory recovery operation can also cause the ES process to lose response. Therefore, the possibility of this reason should be the greatest.

Ways to deal with the problem:

1. Corresponding to the above analysis, it is inferred that the reason is that the node load causes the master process to stop responding, which leads to some nodes' differences in the choice of master. To this end, an intuitive solution is to separate master nodes from data nodes. To do this, we added three servers to the ES cluster, but their roles are only master nodes, not storage and search, so they are relatively lightweight processes. Its role can be restricted by the following configuration:

Node.master: true

Node.data: false

Of course, other nodes can no longer act as master, just reverse the above configuration. This separates the master node from the data node. Of course, in order to enable the newly added node to quickly determine the master location, you can change the default master discovery method of the data node from multicast to unicast:

Discovery.zen.ping.multicast.enabled: false

Discovery.zen.ping.unicast.hosts: ["master1", "master2", "master3"]

two。 There are also two intuitive parameters that can slow down the occurrence of brain fissure problems:

Discovery.zen.ping_timeout (default is 3 seconds): by default, a node will think that if the master node does not reply within 3 seconds, then the node is dead. Increasing this value will increase the time for the node to wait for a response and reduce misjudgment to a certain extent.

Discovery.zen.minimum_master_nodes (default is 1): this parameter controls the minimum number of qualified master nodes that a node needs to see before it can operate in the cluster. The official recommended value is (DOWN 2) + 1, where N is the number of nodes eligible for master (our case is 3, so this parameter is set to 2, but in the case of only 2 nodes, setting to 2 is somewhat problematic. After a node DOWN is dropped, you will definitely not be able to connect to 2 servers, which should be noted).

The above solution can only slow down the occurrence of this phenomenon, not fundamentally put an end to it. On the Elasticsearch cluster so-called brain fissure question is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.