How to improve the availability of micro-service architecture 04/15 Update SLTechnology News&Howtos

How to improve the availability of micro-service architecture

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to improve the availability of micro-service architecture, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

The industry usually uses how many 9s to measure the availability of the system, for example, 99.99% means that there is about 1 hour of unavailability in a year. The availability of any service will not be 100%, which means that a failure may still occur while the service is running. When a single architecture with centralized functions and running in the same application is split into multiple independent micro-service architectures, although the overall failure risk can be reduced, due to a large number of dependencies between micro-services, with the increase in the number of micro-services, the dependency relationship will become more and more complex, and each micro-service may fail, if the isolation of interdependence can not be done well. Avoid the chain reaction of failure, the result may be worse than the monomer.

Assuming that there are 100 microservices and only one failure occurs for each microservice, there will be a total of 2100 different failure scenarios, and each microservice itself may have more than one failure. When a micro-service fails, how to ensure that other dependent micro-services will not be unavailable, how to ensure that the system is automatically degraded to eliminate the malfunctioning micro-services, and how to ensure that the fault will not extend to the whole system? Then how to effectively ensure the availability of the micro-service architecture will become a challenge.

Set a user request to rely on the collaboration of five micro-services (pod is defined in the K8S container framework and is a set of containers with the same function).

At the beginning, each dependent Service is normal, but now suppose there is a Service exception, and there may be three situations:

This request is successful. If a node in Service C is unavailable due to a network exception or downtime, but the failed node is replaced by a highly available node, Service C is still available without being affected.

The request is successful, assuming that it is a Service D failure, but the Service is not critical, and the failure can continue. For example, the registered user needs to call a service to send a successful email to the user. If the Service is not available, it will not affect the user's registration, so the user registration will still be successful, and the mail can be sent after the service is restored, with only time delay. At this point, Service An is unaffected and still available, as shown in the following figure:

This request fails, for example, the exception node is ServiceE, while ServiceE is a code-level logic exception, and all highly available nodes are not available. In this case, ServiceE needs to be isolated from dependencies, otherwise ServiceA may be affected by ServiceE and become unavailable. Something needs to be done to ensure that Service An is not affected and is still available.

The availability of the microservice architecture can be improved from the following strategies:

1) failure transfer

The most basic principle to improve the high availability of services is to eliminate a single point and build a cluster through load balancing technology. All cluster nodes are stateless and completely peer-to-peer. As in the first case mentioned above. When a node is abnormal, the load balancing server will send the access request sent by the user to the available node. For the user, a node exception is imperceptible, and the user request is transparently transferred to the available node for execution.

2) Asynchronous call

It is important to use asynchronous invocations to avoid the failure of an entire application request caused by one service failure. As in the second case mentioned above. If synchronous invocation is used, when the mail service is abnormal, it will cause the other two services to fail to execute, resulting in the failure of user registration. If asynchronous invocation is used, Service A sends the user registration information to the message queue and immediately returns the response of successful user registration. Although the mail service is not available, the services such as writing to the database and permission activation can be executed normally. Therefore, even if the email cannot be sent successfully, it will not affect the execution of other services, and the user registration can be completed successfully.

3) dependency isolation

The user request is sent to Service A Service Service A to assign thread resources to call other Service remotely through the network. Assuming that there is an exception in calling Service E, the thread calling Service E in Service A may respond slowly or dead, and the thread is the resource of the system. If it is not released in a short time, the resources will be exhausted in the case of high concurrency, resulting in Service A not available, although other services are still available.

The resources of Service An are limited, for example, when Service A starts, it allocates 400threads. When 400threads cannot be released in time due to abnormal calls to Service E, such as thread deadlock and slow response time, all 400threads are deadlocked on the call to Service E. here Service A has no idle threads to receive new user requests, which will cause Service A to hang or freeze. So to avoid being dragged down by dependent services in Service An is to ensure that Service A's thread resources are not exhausted by invoked dependent services. Two very important methods are summarized in the book "Release It!": setting timeouts and using circuit breakers.

Set timeout

After setting the timeout of the service call in the application, once the execution time of the thread exceeds the set time, an exception information is thrown and the connection is automatically disconnected, so that the service thread will not be dead on the service that invokes the exception for a long time, resulting in no idle thread receiving new user requests, which can prevent Service A from being dragged down by calling Server E exception and becoming unavailable. Therefore, when invoking externally dependent services over the network, you must set a timeout.

Use circuit breaker

Circuit breakers are no stranger to everyone. Home meters will trip when the current is overloaded or short-circuited. If they are not tripped, the circuit will continue to open, and the wires will heat up, causing fire. With the circuit breaker, the circuit will be automatically tripped when the current is overloaded to avoid causing a greater disaster. The same is true in the program, when it is known that a service invokes a dependent service with a large number of timeouts, allowing new requests to access will only time out, can not get the expected results, and will consume existing resources and increase the load. Make the service unavailable.

At this time, the waste of resources can be avoided by using a circuit breaker. A circuit breaker is placed between its own service and the dependent service, and the status of the access is counted in real time through the monitoring of the circuit breaker. When the access time-out or failure reaches a certain threshold (such as 50% request timeout, or 20 consecutive requests failed), the circuit breaker is opened, and subsequent requests will directly return failure instead of a long wait. Then try to turn off the circuit breaker (or replace the fuse) according to a time interval (such as 30 seconds) or request timeout (such as 0% timeout) to see if the dependency is back in service.

When a service depends on multiple services, if one of the non-core dependencies is not available, by setting a timeout and using a circuit breaker, you can ensure that Service A does not cause its own exception when invoking abnormal Service E. in most cases, the service can run healthily, and dependency isolation can be well achieved.

4) set current limit

During the peak period of service access, the performance may be degraded due to a large number of concurrency, and in severe cases, there will be a large number of requests queued, which may lead to service downtime. In order to ensure the availability of the application, you can reject low-priority calls, make high-priority requests successful, avoid the failure of all calls, and provide a small thread pool for each dependent service. If the thread pool is full, the call will be rejected immediately. Queuing is not used by default, and the failure determination time can be accelerated. The result is that some users can access, and some users fail, but the failed user re-access can be normal access. This ensures the availability of the service rather than being completely unavailable.

Although there are some measures to improve the availability of the system above, the system is complex, a simple fix can cause unimaginable consequences, and the system is dynamic, and some systems may be released several times a day, dozens of times. In this case, failure is still inevitable. More will be done to improve the availability of the system than when the system fails to be a firefighter in the middle of the night or enjoying a good holiday. For example, in some enterprises, the emergency drills of the production environment will be held regularly.

In the past, artificial failures were conducted to test the effectiveness of high availability solutions at low business peaks, including host, network, application, storage, and other architectures, but now they are gradually conducting fault emergency drills during normal production hours to check the high availability of the system. The problem is that it may be able to recover immediately during the drill, but when a real failure occurs, there will still be a situation in which the failure can not be recovered for a long time. One is that the exercise is implemented in accordance with the known scenario, the second is that the scope of the exercise is basically the switching of highly available nodes or the switching of disaster preparedness system, and the third problem is that this exercise is man-made and requires the participation of all staff. will not be held frequently. But the system is dynamic, and this time it is highly available, but that doesn't mean it will be highly available next week or next month.

When a single architecture becomes a micro-service architecture, the application layer walkthrough becomes complex, and as mentioned earlier, if each service has only one failure, there may be 2100 different. Therefore, it is necessary to have an automatic fault testing method to avoid the maneuverability of the drill implementation after microservice.

Netflix has proposed an automated fault testing scheme to improve the availability of the micro-service architecture. This test scheme is also carried out in the production environment, and the ultimate goal of fault testing is that when a fault really occurs, the production environment will not stop service, and the whole system can be without human intervention. Gracefully remove some of the faulty components by downgrading. They believe that if it is only tested in the test environment, but the business pressure, business scenario, environment configuration, network performance and hardware performance of the real production environment have not been tested, when the failure actually occurs in the production environment, it is found that the solution to alleviate the problem may fail. And the test is only run during working hours so that engineers can be alerted and respond in a timely manner.

Netflix mentioned a set of algorithm called "Molly" in the paper "path-driven Fault injection (Lineage-driven fault injection)" by Peter Alvaro and its own fault injection test FIT (failure injection testing) to realize this set of safe automatic fault injection testing. Molly starts from the fault-free state of a system, and then tries to answer, "how did the system reach its current fault-free state?" Give a brief example to introduce the principle of this algorithm. First, use your own tracking system to draw a tree to represent all the microservices that each request passes through.

(An or R or P or B)

At the beginning, all four nodes in the above image are necessary and normal. Then extrapolate from this correct output, randomly select a node to inject the fault, find and build a logical chain diagram that supports its correctness. When a node injection fails, there may be three situations:

This request failed and we have found a node that will fail so that we can delete the fault in future experiments.

The request is successful-but the failed node is not critical

The request was successful and the failure was replaced by a highly available node

In this example, the injection failed first in Ratings, but the request was successful. Indicating that the failure of Rating will not affect the use of the service, then exclude this node and redraw the request tree.

(An or P or B) and (An or P or B or R)

At this point, you can see that the request can be realized through (An or P or B) or (An or P or B or R). Then the fault is injected into the Playlist, and the request is successful because the request is forwarded to the standby node for execution, where a new node will be accessible.

(An or PF or B) and (An or P or B) and (An or P or B or R)

At this point, you can update the formula to indicate that you can request services in three ways: (An or PF or B) and (An or P or B) and (An or P or B or R). Then through this constant testing until all the correct output has been traversed, no failed nodes can be found.

Molly does not specify how to search space, so the implementation estimates all scenarios and then randomly selects the smallest set of solutions. For example, the final scheme may be [{A}, {PF}, {B}, {Pforce PF}, {R ·A}, {R} B} …] . First select all single-node injection failures, then select the combined injection failures with two nodes, and so on.

The purpose of this test is to find and fix the fault before affecting a large number of members. When fault testing is carried out in the production environment, it is not acceptable to cause a large number of problem nodes. To avoid this risk, tests can only be built in a specified scope, which contains two key concepts: fault scope (failure scope) and injection point (injection points). Fault scope refers to limiting the possible impact of a fault test to a controllable range, which can be as small as a particular user or device, or as large as 1% of all users. Injection points refer to components in the system that are scheduled to fail, such as the RPC layer, the cache layer, or the persistence layer.

Fault simulation testing begins when the FIT service injects fault simulation metadata into Zuul (Edge Gateway Service), and the injection fails if the request falls within the fault scope (failure scope). This failure may be a delay in service invocation or a failure to reach the persistence layer. Each contact injection point (injection points) checks whether the context of the request is the specified component to be injected into the fault, and if so, simulates the fault at this injection point.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.