Introduction to High availability of Linux Cluster 07/15 Update SLTechnology News&Howtos

Introduction to High availability of Linux Cluster

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "introduction to high availability of Linux cluster". In daily operation, I believe many people have doubts about the introduction of high availability of Linux cluster. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "introduction to high availability of Linux cluster"! Next, please follow the editor to study!

The reliability of the computer system is measured by the mean time between failure (MTTF), that is, how long the computer system can run normally before a failure occurs. The higher the reliability of the system, the longer the average fault-free time. Maintainability is measured by the average maintenance time (MTTR), that is, the average time it takes to repair and return to normal operation after a system failure. The higher the reliability of the system, the longer the average fault-free time. The availability of the computer system is defined as MTTF/ (MTTF+MTTR) * 100%. Thus, the availability of a computer system is defined as the percentage of time that the system remains up and running.

The computer industry usually classifies the availability of computer systems by the number of "9" as shown in the following table.

Availability classification availability level annual downtime fault-tolerant availability 99.9999 < 1 min extremely high availability 99.9995 min availability with automatic failure recovery 99.9953 min high availability 99.98.8 h commodity availability 9943.8 h

The availability of the system can be greatly improved by means of hardware redundancy or software. Hardware redundancy is mainly through the maintenance of multiple redundant components in the system, such as hard disk, network cable, etc., to ensure that the service can be provided by continuing to use redundant components when the working parts fail; the method of the software is to monitor the running status of multiple machines in the cluster through the software, and start the standby machine to take over the work of the failed machine when a machine fails.

In general, it is necessary to ensure the high availability of the cluster manager and the high availability of nodes. Eddie, Linux Virtual Server, Turbolinux, Piranha, and Ultramonkey all adopt a high availability solution similar to figure 1.

Figure 1 schematic diagram of high availability solution

High availability of cluster manager

In order to shield the cluster manager from failure, a backup machine needs to be established for it. Both the master manager and the backup manager run heartbeat programs that monitor each other's health by sending messages such as "I'm alive". When the backup cannot receive such a message within a certain period of time, it activates the fake program and lets the backup manager take over the main manager to continue to provide services; when the backup manager receives a message such as "I am alive" from the master manager, it invalidates the fake program and releases the IP address, so that the master manager begins to manage the cluster again.

High availability of nodes

The high availability of the node can be achieved by constantly monitoring the status of the node and the running status of the application on the node. When it is found that the node has failed, the system can be reconfigured and the workload can be handed over to those nodes that are running normally. As shown in figure 1, the system monitors the health of the service programs on the actual servers in the cluster by running the mon wizard on the cluster manager. For example, use fping.monitor to monitor whether the actual server is still running at regular intervals, use http.monitor to monitor http services, use ftp.monitor to monitor ftp services, and so on. If a real server is found to have failed, or if the service on it has failed, delete all rules about the actual server in the cluster manager. Conversely, if it is soon found that the system has been able to provide services again, all the corresponding rules will be added. In this way, the cluster manager can automatically shield the failure of the server and the service programs running on it, and rejoin them to the cluster system when the actual server is running normally.

At this point, the study on "introduction to high availability of Linux clusters" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.