What is the fault-tolerant isolation method of micro-service architecture 07/15 Update SLTechnology News&Howtos

What is the fault-tolerant isolation method of micro-service architecture

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what is the fault-tolerant isolation method of micro-service architecture". The content of the explanation is simple and clear, and it is easy to learn and understand. let's study and learn "what is the fault-tolerant isolation method of micro-service architecture"!

1. What are the availability risks in the microservice architecture?

Before we talk about the "fault-tolerant isolation" approach, let's take a look at the common availability risks in the micro-service architecture, and then we know how to avoid and isolate the risks.

We can analyze the risk in terms of the scale of the project deployment:

Stand-alone availability risk:

This is easy to understand, which is the availability risk caused by the failure of one of the machines where the microservice is deployed. This risk rate is very high, because a single machine is prone to all kinds of failures in operation and maintenance, such as hard disk failure, machine power failure and so on, which happen from time to time. However, although the incidence of this risk is high, the harm is limited, because most of our services are not only deployed on one machine, but may have more than one machine, so we only need to do a good job of monitoring and finding faults. Remove the faulty machine from the service cluster in time, and then go back online to the cluster after it has been repaired.

Availability risk of stand-alone room:

The probability of this risk is much lower than that of a single machine, but it is not completely impossible. In practice, there is still a certain probability. For example, the most common is that the optical fiber leading to the computer room has been dug off. Some time ago, the computer room where Alipay is located, the optical fiber has been dug.

Cities all over the country are crazily carrying out infrastructure construction, building bridges, roads and houses, and GDP is doing it like this. it is not normal to dig a few optical fibers underground, .

If all our services are deployed in a single computer room, and the computer room breaks down again, there is nothing we can do about it. Fortunately, most large and medium-sized projects will adopt the scheme of multi-computer room deployment, such as double living in the same city, multi-living in different places, and so on. Once a computer room fails and becomes unavailable, we immediately adopt the way of switching routes to cut the traffic from this computer room to other computer rooms.

Cross-room cluster availability risk:

Now that they are all clustered across computer rooms, there should be no problem with usability in theory. But you should know that there is no problem at the physical level. If there is a hole in our code, or the surge in user traffic due to special reasons makes our service unbearable, it will also be unavailable in the case of cross-server room clusters. However, if we have made some schemes of "fault-tolerant isolation" in advance, such as current limiting, circuit breaker, and so on, we can still ensure that the access of some services or some users is normal.

What are the methods of "fault-tolerant isolation"?

Well, the above mentioned that there are so many availability risks that may be encountered in the micro-service architecture, and we also know the importance of "fault-tolerant isolation". Let's take a look at the common "fault-tolerant isolation" methods:

Timeout:

This is also a simple way to tolerate faults. It means that when calling between services, set an active timeout. After this time threshold is exceeded, if the "dependent service" has not returned data, the "caller" will give up voluntarily to prevent it from being affected by the failure of the "dependent service".

Current limit

As the name implies, it is to limit the maximum traffic. The maximum concurrency that the system can provide is limited, and there are too many requests at the same time, so we have to queue up to limit the flow, which is the same as queuing up to buy tickets at scenic spots and queuing at shopping malls.

Downgrade

This is similar to the current limit, the same is too much traffic, the system service can not come. At this time, we can downgrade the less important functional modules and stop the service, so that more resources can be released for the use of the core functions. At the same time, it can also deal with users in layers, giving priority to the requests of important users, such as VIP charging users and so on.

Delayed processing

This method means setting a traffic buffer pool in which all requests first enter the buffer pool for processing, and the real service processor takes requests out of the buffer pool in order to process them in turn. This method can reduce the pressure on the back-end service, but there is a delay in the experience for the user.

Fuse break

Understandably, achievement is like a fuse on a switch. When the flow is too large or the error rate is too large, the fuse will fuse, the link will be disconnected, and the service will not be provided. When the traffic returns to normal, or the back-end service is stable, the fuse will automatically close on the street (fuse closure), and the service can be provided normally again. This is a good way to protect back-end microservices.

There is a very important concept in fuse technology: circuit breaker, you can refer to the following figure:

In fact, a circuit breaker is a state machine principle, which has three states: Closed (closed state, that is, normal state), Open (open state, that is, when the back-end service fails, the link is disconnected and no service is provided), and Half-Open (semi-closed state, that is, a small part of the traffic is allowed to try, and when it is found that the service is normal, it changes to Closed state, and when the service is still abnormal, it changes to Open state).

Third, the application of "fault tolerant isolation"?

The most famous framework for fault-tolerant isolation or fuse technology is Hystrix. Hystrix is open source by Netflix and is widely used in the industry.

The following is the principle flow chart of Hystrix:

This is a new version of the process, much more complex than the previous version, if you do not explain it, it is estimated that many people will not be easy to understand.

The number 1-9 is marked in the figure, and the process can be understood in this numerical order.

When we use Hystrix, the request is encapsulated in HystrixCommand, which is the first step. The second step is to start executing the request, and Hystrix supports synchronous execution (.execute method in the figure), asynchronous execution (.queue method in the figure), and responsive execution (.execution in the figure). Then the third step is to judge the cache and return the cache result directly if it exists in the cache. If it is not in the cache, take the fourth step to determine whether the status of the circuit breaker is on. If it is turned on, that is, if it is short-circuited, then return the failure and skip to step 8. In the eighth step, you also need to make a judgment on the processing of the failure return. Either the normal failure is returned and the corresponding information is returned, or the processing logic of the failure return is not implemented at all, and the error is reported directly. If the circuit breaker is not in the open state, the request goes on, proceed to step 5, determine whether the thread / queue is full, if it is full, also skip to step 8, if the thread is not full, go to step 6, execute the remote call logic, and then determine whether the remote call is successful. If the call is abnormal, go to step 8, and if the call is normal, pick step 9 to return information normally.

The seventh step in the figure, a very powerful module, is to collect all kinds of information in the Hystrix process to monitor and judge the system.

In addition, the implementation principle of Hystrix circuit breaker is also very important. here is the schematic diagram of Hystrix circuit breaker:

Hystrix uses the sliding time window algorithm to realize the circuit breaker, which is a sliding bucket statistics in seconds, which contains a total of 10 buckets, one per second to generate a new bucket, moving forward, the old bucket is discarded.

The status, number of calls, success and other information of all service calls are recorded in each bucket. The switch of the circuit breaker aggregates these 10 buckets to determine whether they should be opened or closed.

Thank you for your reading. the above is the content of "what is the fault-tolerant isolation method of micro-service architecture". After the study of this article, I believe you have a deeper understanding of what the fault-tolerant isolation method of micro-service architecture is. Specific use also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.