How to check whether a machine is down 07/06 Update SLTechnology News&Howtos

How to check whether a machine is down

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to detect whether a machine is down, which has a certain reference value. Interested friends can refer to it. I hope you will gain a lot after reading this article. Let's take a look at it.

The application scenarios for checking whether a machine is down are as follows:

1. When the work machine is down, the master node needs to be able to detect and migrate the original services to other nodes in the cluster.

2. The master node is down, and the backup node of the master node (commonly known as Slave) needs to be able to detect and replace it with the master node to continue to serve.

It must be reliable to detect whether a machine is down. In large-scale clusters, machines may have a variety of anomalies, such as power outages, disk failures, false death caused by being too busy, and so on. For machine fake death, if the master control node thinks that the machine is down and migrates the service to other nodes, and the fake dead machine thinks that it can still provide services, then multiple nodes will serve the same data, resulting in data inconsistency.

First of all, it must be clear that it is impossible to test whether another machine is down in theory. Interested students can refer to Fischer's paper. It can be simply understood as follows: machine A sends heartbeats to machine B. if machine B does not send a response, A cannot determine whether machine B is down or too busy. Since the clocks of machines An and B may not be synchronized, machine B cannot determine how long it has not received a heartbeat packet from machine A. it can be considered that the service must be stopped. Therefore, there is no way for Machine A to determine that Machine B is down or to take measures to force Machine B to stop service.

Of course, in engineering practice, because there is clock synchronization between machines, we always assume that the local clocks of An and B machines are not much different, for example, the difference is less than 0.5 seconds. In this way, we can detect downtime through the Lease mechanism. The Lease mechanism is an authorization with a timeout. Assuming that the master control node needs to detect whether the work node is down, the master control node can issue Lease authorization to the work node, and the work node is allowed to provide services only if it holds the Lease within the validity period, otherwise it is actively offline to stop the service. When the worker node's Lease is about to expire, reapply for Lease (commonly known as renewLease) to the master control node. The master control node periodically checks whether the Lease authorization of all workers is legal. If a worker's Lease is found to be invalid, the service on the worker can be migrated to other machines in the cluster. In this case, the worker will actively stop the service because it finds its own Lease failure. Of course, it should be noted that because the clocks of the master control node and the worker may be inconsistent and there is a network delay, the Lease timeout on the master control node is longer, that is, if the Lease timeout of the worker node is 12 seconds, it may take 13 seconds for the master control node to confirm that the worker node has stopped service, so as to avoid data inconsistency.

The selector between isomorphic nodes also has a downtime detection problem. For example, if the master control node is down, the backup node needs to be able to detect and upgrade the primary node to continue to serve. Mysql database often adopts the high availability scheme of Heartbeat + DRBD (Distributed Replicated Block Device) + Mysql, which is said to be able to achieve high availability of 3 9s. The primary node and the standby node maintain the Heartbeat heartbeat. When the primary node providing the service fails, the Heartbeat of the standby node detects that the primary node has no heartbeat (for example, Ping is not connected to the primary node), and the standby node automatically takes over the virtual IP and upgrades the primary node to provide Mysql read and write services. Because Heartbeat is unreliable to detect machine master node downtime, this scheme has a well-known brain fissure problem, that is, there may be multiple master nodes in the cluster to provide services at the same time. To solve this problem, it is necessary to introduce arbitration nodes, such as Fence nodes in Heartbeat + DRBD scheme to separate the problematic nodes from the cluster, or distributed locking services, such as Chubby's open source implementation of Zookeeper services. The election of the primary node of the distributed lock service is roughly as follows: the primary node and the standby node grab the lock in the Chubby, and the node that grabs the lock provides the service within the lock validity period (Lease period). When the Lease of the master node lock is about to expire, the primary node applies to extend the lock timeout. Normally, the distributed lock service always gives priority to the request of the primary node, when the primary node fails. The standby node can grab the lock and switch to provide services for the primary node.

* there is another problem. Assuming that the master node detects whether the worker node is down through the Lease mechanism, this solution is reliable. However, when the master node goes down, if no action is taken, all the working nodes in the cluster will stop service because they cannot reapply for Lease. This is the inherent vulnerability of the design with the master node, and a design or coding error may have a serious impact. To solve this problem, there is generally a mechanism called Grace Period. The worker node Lease stops service when it times out, but the worker node is not restarted or offline at the beginning, but is in a dangerous state (called Jeopardy), which lasts for one Grace Period, such as 45 seconds. If the master control node restarts in the Grace Period, the working node and the master control node are recontacted so that they can switch to the normal state to continue to provide services.

If you need a better understanding of downtime and election-related issues, you can read and think about Paxos-related papers, such as Paxos made simple, The Part-time Parliament, Paxos made live, Paxos made practical, Chubby, etc. If you have any questions, you are welcome to discuss.

Thank you for reading this article carefully. I hope the article "how to detect whether a machine is down" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.