What is the pit avoidance guide for Zookeeper cluster operation and maintenance? 10/22 Update SLTechnology News&Howtos

What is the pit avoidance guide for Zookeeper cluster operation and maintenance?

2025-10-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

What this article shares with you is about what the Zookeeper cluster operation and maintenance guide is like. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it with the editor.

The first is the "Monitoring" series.

Monitoring, which can judge the health degree of the service, locate the service problems, and perspective the internal state of the system, is a very important part of the operation and maintenance work. This series will share Jingdong Cloud's best practices in service monitoring.

In this issue, we focus on Zookeeper cluster monitoring

Zookeeper (ZK in this paper) is an open source distributed application coordination service, an open source implementation of Chubby services of Google, and an important component of open source software such as Hadoop and Hbase. From the perspective of ZK monitoring cases, this article will let you know some important monitoring indicators of ZK. Enjoy:

Service failure case

Capacity issues:

After some follower is out of sync, restart the abnormal follower manually, and as a result, follower still cannot join the cluster. It is suspected that there is a problem with the cluster, so restart the entire cluster. The cluster cannot enter the normal state after restart, and the service is paralyzed without leader. In hindsight, the snapshot volume reached the GB level, but the initLimit default value was only 20s. After the follower restarted, the GB level data could not be synchronized within 20s, so it was kicked out of the cluster. The restart operation aggravates this problem, leading to the collapse of the cluster as a whole. Finally, the service is restored by manually synchronizing the snapshots of the pre-failure leader nodes to all nodes and increasing the synchronization time-related parameters of the zoo.cfg.

In this case, the large size of the snapshot is the main cause of the failure. We need to optimize the parameters of initLimit and syncLimit, regulate the way the business uses ZK, avoid using ZK as a general file storage system, and also need to add monitoring of snapshot volume (zk_approximate_data_size). If you exceed 1GB, you need to give an alarm. In a similar problem, if there are too many nodes in ZK, the performance of the cluster will be seriously degraded, so you also need to add monitoring on the number of nodes (zk_znode_count) of the ZK cluster. If there are more than 100000 nodes, you need to give an alarm.

Resource issues:

The ZK cluster and the Hadoop are deployed on the same batch of physical machines. When the Hadoop computing tasks are increased, the physical machine CPU is filled up, and the ZK cluster deployed on the same machine cannot respond to external requests, and all Hadoop services that rely on the ZK will crash. Not only CPU,ZK but also rely on stand-alone disk space, disk IO capabilities, network, and so on. In view of this, it is recommended to deploy the ZK cluster independently instead of mixing parts. At the same time, the CPU/MEM/NET/IO of the machine where the ZK is located is monitored to avoid its resources being occupied.

Traffic issues:

A new feature that comes online in a distributed system, its client has not found a problem after a gradual update a few days ago, so it made a full update to the client on a certain day, and all clients regularly requested the ZK cluster, resulting in the ZK cluster being unable to handle such a large number of requests, and the cluster crashed directly. The client also has to roll back all of it. Although the ZK cluster set leader not to receive requests at that time, and limited the maximum number of concurrent requests in a single IP, it still could not change the result that the cluster crashed directly in the face of a large number of requests.

In this case, if traffic-related monitoring is added early, such as ZK node connections (zk_num_alive_connections) and ZK node traffic (zk_packets_received/zk_packert_sent), the problem of sudden increase in cluster traffic can be sensed in advance.

Service exception:

Follower failures are not handled in time, resulting in the number of follower failures in a single cluster exceeding the maximum that the cluster can tolerate, and the cluster crashes completely. At this point, the faulty follower needs to be repaired immediately. It is found that the previous follower cannot be restored in a short period of time due to hardware failures and other reasons, while most of the business parties are directly connected to the IP, so they cannot be modified quickly. At this time, the pressure of the cluster is still relatively large, even if forced to switch to stand-alone mode, it is also necessary to limit the current. No matter how to deal with it, the service will be damaged for a long time.

In this case, this tragedy would not have happened if follower-related monitoring, such as zk_followers / zk_synced_followers and zk_server_state, had been added early, and the service would not have been restored immediately after the alarm occurred.

Capacity issues:

The number of file handles in the ZK cluster uses the default 10240 of the system, but the actual pressure on the system is much more than that, so ZK will be unable to handle some new requests, and the cost and time-consuming of problem location will also increase. After the problem is found, it can be solved by adjusting the file handle limit of the ZK running account and restarting the service.

In this case, the problem could have been avoided if zk_open_file_descriptor_count/zk_max_file_descriptor_count had been added early. At the same time, many open source software will encounter the problem of the number of file handles, and cause major failures of various systems many times, so it should be treated with caution.

Isolate the problem:

ZK cluster provides region-wide coordination service. When the ZK cluster fails, the service becomes unavailable in all regions of the country. At this time, the ZK cluster should be split and a separate cluster should be deployed in each region to limit the failure scope to a single region. In this case, monitoring is not the main problem and solution, but the main purpose of this case is to give people a more comprehensive understanding of ZK cluster failures.

Operation and maintenance dashboard

Collection item screening

The above let you understand the importance of some core indicators by sharing some ZK failures with you. Next, we systematically sort out and summarize the ZK monitoring according to the monitoring theory of Google SRE:

Black box monitoring

Cluster function

Create / delete / read nodes

Note: under the / zookeeper_monitor node, periodically create / delete nodes to ensure that this feature is available

Suggestion: create / zookeeper_monitor node, do not use business node, avoid mutual influence

Experience: simulate at least 3 nodes requested by the user, thus ensuring that all nodes of the ZK are covered

Read / update content

Note: under the / zookeeper_monitor node, the content is read and updated periodically

Suggestion: you can write a timestamp to make it easier to judge the write delay

White box monitoring

Collection mode

Mode 1:zookeeper four-word command mntr

Mode 2:JMX interface

Error

Zk_server_state

Note: there is only one leader in the cluster. If there is no leader, the cluster will not work properly. Two or more leader will be regarded as brain fissure, which will lead to data inconsistency.

Importance: high

Zk_followers / zk_synced_followers

Note: if the above two values are not equal, it means that some follower anomalies need to be dealt with immediately. Many low-level accidents are caused by too many follower failures in a single cluster.

Importance: high

Zk_outstanding_requests

Note: normally, this value should persist to 0, and there should be no outstanding requests.

Importance: high

Zk_pending_syncs

Note: normally, this value should last at 0, and there should be no unsynchronized data.

Importance: high

Capacity

Zk_znode_count

Note: the more nodes, the greater the pressure on the cluster, and the performance will decline sharply.

Importance: high

Experience: do not exceed 1 million

Suggestion: when there are too many nodes, you need to consider splitting according to the dimensions of computer room / region / business.

Zk_approximate_data_size

Note: when the snapshot size is too large, after the ZK node is restarted, it will not be able to join the cluster because the entire snapshot cannot be synchronized within the initLimit time.

Importance: high

Experience: do not exceed the 1GB volume

Suggestion: do not use ZK as a file storage system

Zk_open_file_descriptor_count/zk_max_file_descriptor_count

Description: when the above two values are equal, the cluster cannot receive and process new requests

Importance: high

Suggestion: modify / etc/security/limits.conf to adjust the number of file handles of the online account to 1 million

Zk_watch_count

Note: if there is a large number of watch, then the notification pressure of the changed ZK will be greater.

Importance: medium

Flow

Zk_packets_received/zk_packert_sent

Description: the number of packet received / sent by ZK node, the specific value of each node is different, and the overall value of the cluster is obtained by summing.

Suggestion: get the difference through the interval of 1s between two command execution

Importance: medium

Zk_num_alive_connections

Description: the number of client connections of the ZK node, the specific value of each node is different, and the overall value of the cluster is obtained by summing.

Suggestion: get the difference through the interval of 1s between two command execution

Importance: medium

Delay time

Zk_avg_latency/zk_max_latency/zk_min_latency

Note: you need to pay attention to the drastic changes in the average delay. If there are clear business requirements for delay, you can set a specific threshold.

Other monitoring

Process monitoring (JVM monitoring)

Port monitoring

Log monitoring

Host monitoring

TIPS

Zookeeper four-word command

Mntr

Stat

The above is what the Zookeeper cluster operation and maintenance guide is like, and the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.