How to do the highly available health examination of the system 04/16 Update SLTechnology News&Howtos

How to do the highly available health examination of the system

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "how to do a highly available health examination of the system". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

I. Preface

With the continuous improvement of people's living standards, people pay more and more attention to their health. Many people have had a physical examination. In general, companies will have annual physical examination benefits, and physical examination is a household name.

With the rapid development of the Internet, there is more and more competition among similar homogeneous products. An important difference between products is user experience. In addition to the product design factors, the technical level is also an important factor that affects the user experience, mainly reflected in the availability and response speed of the service. It is so important to improve service availability and response speed, in order to achieve this goal, there must be corresponding means, in which health check is a very important prerequisite to ensure service availability and rapid response.

What is a health check-up

Health examination refers to the diagnosis and treatment of patients through medical means and methods to understand their health status, early detection of disease clues and health risks. The health check of the system is a process of using technical means to detect whether a series of objects such as network, host, application, service and so on are healthy or available.

Third, why do you need a health check?

Internet products put forward high requirements for user experience, but often due to technical reasons, a series of problems affecting user experience, such as slow service response or unavailability of services, lead to business interruption and affect revenue. Corporate brand and word-of-mouth will also be greatly negatively affected.

There are many factors that affect the unavailability and slow response of the service. It may be that the service hardware is damaged, the optical fiber is dug up, the large number of requests may lead to the database CPU load, the disk IO is too high, or it may be that a classmate buries the mine, and the OOM occurs when the new online function runs for the first time.

What should we do to ensure high availability of the system? Some people say that the redundancy of system nodes can eliminate the failure of single node. It is true that eliminating single node is a common means of high availability of the system. A very important premise to eliminate a single node is to find the problem node, kick out the problem node or switch the traffic to other normal nodes.

How to "find the problem node" is what the system health check needs to do.

Fourth, how to do a health examination

Before you talk about how to have a health check-up, the first thing you need to figure out is who you want to check. The object can be connected to the network, it can be a small functional component, it can be a process, it can be a service cluster, it can be a computer room unit. Therefore, in order to achieve "high availability", we must first find out which level of high availability we want to do, which objects may have a single point of problem, and make the "objects" clear.

So, how do you do a health check? There are usually two ways: active and passive.

4.1 active mode

The inspector, as the active party, initiates the health check request periodically and initiatively. The content or format of the request message is usually designed independently, and the healthy object returns the response after a simple self-test. For example:

Check interval=3000 rise=2 fall=5 timeout=1000 type=http;check_http_send "HEAD / check.do HTTP/1.0\ r\ n\ r\ n"; check_http_expect_alive http_2xx http_3xx

Configure an interval of 2000 seconds to send check requests to the http://(ip:port)/check.do interface of the backend web server. If the number of consecutive failures reaches fall=5, the server is considered to be down. If the number of consecutive successes reaches rise=2, the server is considered to be up healthy. Of course, the response status code must be 2xx or 3xx to be considered healthy.

4.2 passive mode

Passive health check does not design an independent health check request, but uses the normal connection or the response of the business request as an indicator to measure the health status of the object. For example, the passive health check configuration of the official open source version of nginx:

Server 127.0.0.1:8080 max_fails=3 fail_timeout=30s

Nginx is based on connection detection. If three attempts to connect fail within 30 seconds, the backend web service is considered unavailable.

4.3 eliminate a single point

As mentioned above, in order to achieve high availability, it is necessary to eliminate a single point of failure. The simplest and direct solution is to provide a service node. After the primary service node is found to be down through a scheduled heartbeat health check, the standby service node takes over the work of the master. The client switches the request traffic to the standby service node.

The health check is carried out between the master service node and the standby service node through a dedicated heartbeat line. Due to network zoning and other reasons, they may not be able to receive each other's heartbeat. At this time, the standby node will think that the master node is down, and the master node also thinks that the standby node is down. However, the state of both the master and slave nodes is normal, and the client can access the master and slave nodes normally, resulting in "double write". This phenomenon is called "split-brain" in the industry.

The occurrence of brain fissure will lead to data confusion, which will affect the correctness of the business. at this time, the introduction of third-party arbitration can effectively avoid the occurrence of brain fissure. The occurrence of brain fissure will lead to data confusion, which will affect the correctness of the business. at this time, the introduction of third-party arbitration can effectively avoid the occurrence of brain fissure.

4.4 third-party arbitration

Since the master and slave can not confirm the survival of the other party, it can be decided by the third-party arbitration node when there is a dispute. The third-party arbitration node is generally implemented by the highly available scheme such as Zookeeper.

5. Health examination example 5.1 Network equipment

Keepalived is a service software that ensures the high availability of clusters. Its function is similar to heartbeat and is used to prevent a single point of failure. However, it generally does not appear alone, but works with other load balancing technologies (such as LVS, HAProxy, Nginx) to achieve high availability of the cluster.

Its health check also includes two aspects, one is the health check between Keepalived components (through the VRRP heartbeat message), as shown in the following figure

The other is the health check of the Keepalived component and the local load balancer component, which is configured as follows:

Vrrp_script check_nginx_running {script "/ usr/local/bin/check_running" (define script) interval 10 (interval between script execution) weight-10 (priority of script execution)}

Among them, the health check mode of the application is realized by custom script.

The health check between Keepalived components is carried out through VRRP protocol. If the primary server is down, the standby server is elected as the new primary server through VRRP protocol, which grabs the virtual IP from the old primary server to achieve high availability.

VRRP message is encapsulated on IP message and supports a variety of upper layer protocols. Network devices usually use VRRP protocol to achieve high availability handoff between master and standby, such as switches, routers, firewalls and so on.

When the network equipment fails, the VRRP mechanism can select new network devices to undertake data traffic, so as to ensure the reliable communication of the network.

5.2 Network connection

Mobile devices connect to the Internet through NAT, the PUSH push of mobile App needs to maintain a long connection with the server, but most mobile network operators will eliminate the corresponding connection in the NAT list when there is no data exchange for a period of time, resulting in connection interruption. In order to keep the network connection "healthy", after the connection is established, the App and the server send Ping Pong heartbeat messages to each other periodically to keep the connection valid.

The above is the connection health check scheme of the application layer, and the operating system also supports the connection health check of the underlying network, namely Keepalive. TCP Keepalive can send an empty probe message after the connection is inactive for a period of time, so that the TCP connection will not be closed by intermediate network devices such as clients or firewalls. Linux can configure the interval, frequency, and threshold of Keepalive with the following three parameters:

Net.ipv4.tcp_keepalive_time = 7200net.ipv4.tcp_keepalive_intvl = 75net.ipv4.tcp_keepalive_probes = 95.3 hosts and processes

The reachability between hosts can be identified by the Ping command, which uses the ICMP protocol and identifies the network connectivity of the entire path from the client to the target host. Ping is usually used to manually test whether a host is up and connected to the network.

ICMP is a network layer protocol, which has nothing to do with the specific process, and it is impossible to identify the existence of the process through Ping. However, the process has ports and process information, and you can detect the existence of the process through the telnet port or ps command. Processes may be abnormally shut down by kill or other reasons due to insufficient memory, which can be automatically pulled up after being detected and identified by cron timed scripts. This solution is very effective in improving the availability of applications that can only be deployed in a single instance in old and dilapidated projects.

5.4 Middleware-RocketMQ

NameServer is the routing center of RocketMQ. The service status and routing information of Producer cluster, Broker cluster and Consumer cluster are maintained in NameServer. When a new Consumer joins the cluster, it not only reports its own information, but also obtains the address, Topic, queue and other information of each Broker, so that it can know which Broker and queue its consumed Topic messages are stored on.

Multiple NameServer can be deployed, and the NameServer is independent of each other and not interoperable. Multiple NameServer needs to be specified when Producer, Broker and Consumer services are started, and the information of the service will be registered with multiple specified NameServer at the same time to achieve high availability.

Each Broker node maintains a long TCP connection with all NameServer, sending a heartbeat message to NameServer every 30s, telling NameServer that it is still alive. Each NameServer checks the last heartbeat time of each Broker every 10 seconds. If it is found that a Broker has not sent a heartbeat message for more than 120 seconds, the Broker is considered to be down, and the corresponding network connection channel will be closed and removed from the routing information.

5.5 Application layer-Spring Boot Actuator

A service instance or process reports its survival to other services through regular heartbeats, but this heartbeat is not enough to reflect its health. For example, due to insufficient disk space, the service can no longer write data, but it can still respond to heartbeat packets; the service relies on Redis, but the Redis service cannot be connected with the problem, but it can still respond to heartbeat packets; some functions of the service rely on distributed storage services, but distributed storage services are not available, but it can still respond to heartbeats. As we can see, there are many things to consider to determine whether a service instance is alive and "healthy". Spring Boot Actuator can better solve this problem, it can reflect the health status of the entire service, including the health status of the subsystems it depends on.

Spring Boot Actuator is a subproject of Spring Boot, and Actuator provides Endpoint (endpoints) for external applications to access and interact. Actuator includes many functions, such as health check, audit, metrics collection, and so on, to help us monitor and manage Spring Boot applications. Health is one of the Endpoint, which provides basic health information about Spring Boot applications, allowing other cloud services or K8s to detect the health status of applications regularly and respond to anomalies in a timely manner.

If a microservice application uses resource systems such as MySQL, Amazon S3, Elastic Search, and DynamocDB, its health check results should include the health status of all these subsystems:

The health check of Actuator is implemented by the HealthIndicator interface, and the HealthIndicator interface has only one health () method, and the return value is the Health health object.

@ FuncationalInterfacepublic class HealthIndicator {/ * * Return an indication of health. * @ result the health for * / public Health health ();

The Health object has two fields: state status and details. Status has four values: UNKNOWN, UP, DOWN and OUT_OF_SERVICE by default, which can be customized and extended by users. Details is a KV structure, and users can customize the data values to be returned at will.

@ JsonInclude (Include.NON_EMPTY) public final class Health extends HealthComponent {private final Status status; private final Map details;...}

Actuator comes with many commonly used HealthIndicator:

Users can customize it according to the actual situation, such as:

@ Overridepublic Health health () {int errorCode = check (); / / perform some specific health check if (errorCode! = 0) {return Health.down () .withDetail ("Error Code", errorCode) .build ();} return Health.up () .build ();}

By default, the status of health is enabled and open to the public. You can query the health status of the application through http://locahost:8080/actuator/health: {"status": "UP"}, which is a summary status. Detailed health information can be opened through the configuration item management.endpoint.health.show-details=always. A complete health check information containing details is as follows:

The health status of the summary is summarized by HealthAggregator, and the summary algorithm is that the health status of all subsystems is sorted according to the order of DOWN, OUT_OF_SERVICE, UP and UNKNOWN to take the first state value.

For example, if ehCache is UP,MySQL, UNKNOWN,diskSpace is OUT_OF_SERVICE;, then the order is: OUT_OF_SERVICE, UP, UNKNOWN, and the first one is OUT_OF_SERVICE, that is, the service is not available.

This is the end of the content of "how to do a highly available health check-up in the system". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.