How to solve the problem of brain fissure in Elasticsearch cluster 07/03 Update SLTechnology News&Howtos

How to solve the problem of brain fissure in Elasticsearch cluster

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces you how to solve the brain fissure problem of Elasticsearch cluster, the content is very detailed, interested friends can refer to, hope to be helpful to you.

# Elasticsearch Cluster brain fissure problem normally, all nodes in the cluster should have the same choice of master in the cluster, so that the status information obtained should be consistent and inconsistent, indicating that different nodes have abnormal selection of master nodes-that is, the so-called brain fissure problem. Such a state of brain fissure directly makes the nodes lose the correct state of the cluster, resulting in the cluster can not work properly.

Possible causes:

Network: due to private network communication, network communication problems cause some nodes to think that master is dead, but it is less likely to choose master. If you check the Ganglia cluster monitoring, no abnormal private network traffic is found, so the reason can be ruled out.

Node load: because the master node and the data node are mixed together, when the workload of the worker node is large (and indeed large), the corresponding ES instance stops responding, and if the server is acting as a master node, then some nodes will think that the master node is invalid and re-elect a new node. At the same time, because the ES process on the data node occupies a large amount of memory, the large-scale memory recovery operation can also cause the ES process to lose response. Therefore, the possibility of this reason should be the greatest.

Ways to deal with the problem:

Corresponding to the above analysis, it is inferred that the reason is that the node load causes the master process to stop responding, which leads to some nodes' differences in the choice of master. To this end, an intuitive solution is to separate master nodes from data nodes. To do this, we added three servers to the ES cluster, but their roles are only master nodes, not storage and search, so they are relatively lightweight processes. Its role can be restricted by the following configuration:

Node.master: true node.data: false

Of course, other nodes can no longer act as master, just reverse the above configuration. This separates the master node from the data node. Of course, in order to enable the newly added node to quickly determine the master location, you can change the default master discovery method of the data node from multicast to unicast:

Discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: ["master1", "master2", "master3"]

Discovery.zen.ping_timeout (default is 3 seconds): by default, a node will think that if the master node does not reply within 3 seconds, then the node is dead. Increasing this value will increase the time for the node to wait for a response and reduce misjudgment to a certain extent. Discovery.zen.minimum_master_nodes (default is 1): this parameter controls the minimum number of qualified master nodes that a node needs to see before it can operate in the cluster. The official recommended value is (DOWN 2) + 1, where N is the number of nodes eligible for master (our case is 3, so this parameter is set to 2, but in the case of only 2 nodes, setting to 2 is somewhat problematic. After a node DOWN is dropped, you will definitely not be able to connect to 2 servers, which should be noted).

Discovery.zen.ping.multicast.enabled: falsediscovery.zen.ping_timeout: 120sdiscovery.zen.minimum_master_nodes: 2 client.transport.ping_timeout: 60sdiscovery.zen.ping.unicast.hosts: ["10.0.31.2", "10.0.33.2"]

Really rest easy? In fact, the problem still exists, and ES's issue space is also discussing a special case "# 2488": even if minimum_master_nodes is set to the correct value, a brain fissure can occur.

How to identify this problem? It is important to identify this problem as soon as possible in your cluster. A relatively easy method is to get the / _ nodes response of each node regularly, which returns the status report of all nodes in the cluster. If the two nodes return different cluster states, it is a warning signal of the occurrence of brain fissure.

For a fully functional ES node, the new solution must have an active Master node. After ES1.4.0.Beta1, a new blocking cluster operation setting without Master is added: discovery.zen.no_master_block.

When there are no active Master nodes in the cluster, this setting specifies which operations (read, write) need to be rejected (that is, blocking execution). There are two settings: all and write, and the default is wirte.

This configuration does not affect the basic api (such as cluster state, node information, and state API), and the execution of these nodes on any node will not be blocked.

Ps: the advantages of es: because of its out-of-the-box use, natural clustering, automatic fault tolerance, strong scalability and other advantages, we still choose it for full-text search.

How to solve the brain fissure problem of Elasticsearch cluster is shared here. I hope the above content can be helpful to everyone and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.