In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
What this article shares with you is about what the Zookeeper cluster operation and maintenance guide is like. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it with the editor.
The first is the "Monitoring" series.
Monitoring, which can judge the health degree of the service, locate the service problems, and perspective the internal state of the system, is a very important part of the operation and maintenance work. This series will share Jingdong Cloud's best practices in service monitoring.
In this issue, we focus on Zookeeper cluster monitoring
Zookeeper (ZK in this paper) is an open source distributed application coordination service, an open source implementation of Chubby services of Google, and an important component of open source software such as Hadoop and Hbase. From the perspective of ZK monitoring cases, this article will let you know some important monitoring indicators of ZK. Enjoy:
Service failure case
Capacity issues:
After some follower is out of sync, restart the abnormal follower manually, and as a result, follower still cannot join the cluster. It is suspected that there is a problem with the cluster, so restart the entire cluster. The cluster cannot enter the normal state after restart, and the service is paralyzed without leader. In hindsight, the snapshot volume reached the GB level, but the initLimit default value was only 20s. After the follower restarted, the GB level data could not be synchronized within 20s, so it was kicked out of the cluster. The restart operation aggravates this problem, leading to the collapse of the cluster as a whole. Finally, the service is restored by manually synchronizing the snapshots of the pre-failure leader nodes to all nodes and increasing the synchronization time-related parameters of the zoo.cfg.
In this case, the large size of the snapshot is the main cause of the failure. We need to optimize the parameters of initLimit and syncLimit, regulate the way the business uses ZK, avoid using ZK as a general file storage system, and also need to add monitoring of snapshot volume (zk_approximate_data_size). If you exceed 1GB, you need to give an alarm. In a similar problem, if there are too many nodes in ZK, the performance of the cluster will be seriously degraded, so you also need to add monitoring on the number of nodes (zk_znode_count) of the ZK cluster. If there are more than 100000 nodes, you need to give an alarm.
Resource issues:
The ZK cluster and the Hadoop are deployed on the same batch of physical machines. When the Hadoop computing tasks are increased, the physical machine CPU is filled up, and the ZK cluster deployed on the same machine cannot respond to external requests, and all Hadoop services that rely on the ZK will crash. Not only CPU,ZK but also rely on stand-alone disk space, disk IO capabilities, network, and so on. In view of this, it is recommended to deploy the ZK cluster independently instead of mixing parts. At the same time, the CPU/MEM/NET/IO of the machine where the ZK is located is monitored to avoid its resources being occupied.
Traffic issues:
A new feature that comes online in a distributed system, its client has not found a problem after a gradual update a few days ago, so it made a full update to the client on a certain day, and all clients regularly requested the ZK cluster, resulting in the ZK cluster being unable to handle such a large number of requests, and the cluster crashed directly. The client also has to roll back all of it. Although the ZK cluster set leader not to receive requests at that time, and limited the maximum number of concurrent requests in a single IP, it still could not change the result that the cluster crashed directly in the face of a large number of requests.
In this case, if traffic-related monitoring is added early, such as ZK node connections (zk_num_alive_connections) and ZK node traffic (zk_packets_received/zk_packert_sent), the problem of sudden increase in cluster traffic can be sensed in advance.
Service exception:
Follower failures are not handled in time, resulting in the number of follower failures in a single cluster exceeding the maximum that the cluster can tolerate, and the cluster crashes completely. At this point, the faulty follower needs to be repaired immediately. It is found that the previous follower cannot be restored in a short period of time due to hardware failures and other reasons, while most of the business parties are directly connected to the IP, so they cannot be modified quickly. At this time, the pressure of the cluster is still relatively large, even if forced to switch to stand-alone mode, it is also necessary to limit the current. No matter how to deal with it, the service will be damaged for a long time.
In this case, this tragedy would not have happened if follower-related monitoring, such as zk_followers / zk_synced_followers and zk_server_state, had been added early, and the service would not have been restored immediately after the alarm occurred.
Capacity issues:
The number of file handles in the ZK cluster uses the default 10240 of the system, but the actual pressure on the system is much more than that, so ZK will be unable to handle some new requests, and the cost and time-consuming of problem location will also increase. After the problem is found, it can be solved by adjusting the file handle limit of the ZK running account and restarting the service.
In this case, the problem could have been avoided if zk_open_file_descriptor_count/zk_max_file_descriptor_count had been added early. At the same time, many open source software will encounter the problem of the number of file handles, and cause major failures of various systems many times, so it should be treated with caution.
Isolate the problem:
ZK cluster provides region-wide coordination service. When the ZK cluster fails, the service becomes unavailable in all regions of the country. At this time, the ZK cluster should be split and a separate cluster should be deployed in each region to limit the failure scope to a single region. In this case, monitoring is not the main problem and solution, but the main purpose of this case is to give people a more comprehensive understanding of ZK cluster failures.
Operation and maintenance dashboard
Collection item screening
The above let you understand the importance of some core indicators by sharing some ZK failures with you. Next, we systematically sort out and summarize the ZK monitoring according to the monitoring theory of Google SRE:
Black box monitoring
Cluster function
Create / delete / read nodes
Note: under the / zookeeper_monitor node, periodically create / delete nodes to ensure that this feature is available
Suggestion: create / zookeeper_monitor node, do not use business node, avoid mutual influence
Experience: simulate at least 3 nodes requested by the user, thus ensuring that all nodes of the ZK are covered
Read / update content
Note: under the / zookeeper_monitor node, the content is read and updated periodically
Suggestion: you can write a timestamp to make it easier to judge the write delay
White box monitoring
Collection mode
Mode 1:zookeeper four-word command mntr
Mode 2:JMX interface
Error
Zk_server_state
Note: there is only one leader in the cluster. If there is no leader, the cluster will not work properly. Two or more leader will be regarded as brain fissure, which will lead to data inconsistency.
Importance: high
Zk_followers / zk_synced_followers
Note: if the above two values are not equal, it means that some follower anomalies need to be dealt with immediately. Many low-level accidents are caused by too many follower failures in a single cluster.
Importance: high
Zk_outstanding_requests
Note: normally, this value should persist to 0, and there should be no outstanding requests.
Importance: high
Zk_pending_syncs
Note: normally, this value should last at 0, and there should be no unsynchronized data.
Importance: high
Capacity
Zk_znode_count
Note: the more nodes, the greater the pressure on the cluster, and the performance will decline sharply.
Importance: high
Experience: do not exceed 1 million
Suggestion: when there are too many nodes, you need to consider splitting according to the dimensions of computer room / region / business.
Zk_approximate_data_size
Note: when the snapshot size is too large, after the ZK node is restarted, it will not be able to join the cluster because the entire snapshot cannot be synchronized within the initLimit time.
Importance: high
Experience: do not exceed the 1GB volume
Suggestion: do not use ZK as a file storage system
Zk_open_file_descriptor_count/zk_max_file_descriptor_count
Description: when the above two values are equal, the cluster cannot receive and process new requests
Importance: high
Suggestion: modify / etc/security/limits.conf to adjust the number of file handles of the online account to 1 million
Zk_watch_count
Note: if there is a large number of watch, then the notification pressure of the changed ZK will be greater.
Importance: medium
Flow
Zk_packets_received/zk_packert_sent
Description: the number of packet received / sent by ZK node, the specific value of each node is different, and the overall value of the cluster is obtained by summing.
Suggestion: get the difference through the interval of 1s between two command execution
Importance: medium
Zk_num_alive_connections
Description: the number of client connections of the ZK node, the specific value of each node is different, and the overall value of the cluster is obtained by summing.
Suggestion: get the difference through the interval of 1s between two command execution
Importance: medium
Delay time
Zk_avg_latency/zk_max_latency/zk_min_latency
Note: you need to pay attention to the drastic changes in the average delay. If there are clear business requirements for delay, you can set a specific threshold.
Other monitoring
Process monitoring (JVM monitoring)
Port monitoring
Log monitoring
Host monitoring
TIPS
Zookeeper four-word command
Mntr
Stat
The above is what the Zookeeper cluster operation and maintenance guide is like, and the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.