Distributed system concerns-"compensation" and best practices that 99% of people can understand 07/19 Update SLTechnology News&Howtos

Distributed system concerns-"compensation" and best practices that 99% of people can understand

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

If this is the second time to see my article, you are welcome to scan the code at the end of the article to subscribe to my personal account (cross-border architect).

The length of this article is 4229 words. It is recommended to read for 11 minutes.

This is the end of the second chapter in this series, "High availability", after "data consistency".

In the previous articles, Brother Z talked to you about the meaning of "high availability", and how to do "load balancing" and "high availability three musketeers" (circuit breaker, current limit, downgrade, and the previous link will be attached at the end of the article). This time, let's talk about how to digest the "internal injury" through the "compensation" mechanism while ensuring high external availability.

I. what is the significance of the "compensation" mechanism?

Take the shopping scene of e-commerce as an example:

Client-> shopping cart micro-service-> order micro-service-> payment micro-service.

This kind of call chain is very common.

So why do we need to consider compensation mechanisms?

As mentioned in previous articles, a cross-machine communication may go through DNS services, network cards, switches, routers, load balancers, and other devices, which may not always be stable, and problems will occur as long as any link goes wrong in the whole process of data transmission.

In a distributed scenario, a complete service is composed of multiple cross-machine communications, so the probability of problems increases exponentially.

However, these problems do not entirely mean that the real system cannot handle requests, so we should digest these exceptions as automatically as possible.

You may ask, you have seen "compensation" and "transaction compensation" or "retry" before. What is the relationship between them?

In fact, you don't have to worry too much about these names, they are all the same for the purpose. That is, once an exception occurs in an operation, how to eliminate the "inconsistent" state caused by the exception through the internal mechanism.

Digression: in Brother Z's view, no matter what way is used, as long as the problem is solved in an additional way, it can be understood as "compensation", so "transaction compensation" and "retry" are both subsets of "compensation". The former is a reverse operation, while the latter is a forward operation.

But judging from the results, the meanings of the two are different. "transaction compensation" means "abandon", and the current operation is bound to fail.

▲ transaction compensation

"retry" still has a chance to deal with success. These two methods are suitable for different scenarios.

▲ retry

Because "compensation" is already an additional process, since we can take this additional process, it shows that timeliness is not the first consideration, so the core point of compensation is: it is better to be slow than wrong.

Therefore, do not hastily determine the implementation plan of compensation, which needs to be carefully evaluated. Although mistakes can not be avoided 100%, but holding such a state of mind can more or less reduce the occurrence of some mistakes.

Second, what should be done to "compensate"?

The mainstream way to do "compensation" is to "transaction compensation" and "retry" mentioned earlier, which will be called "rollback" and "retry".

Let's talk about rollback first. It is logically simpler than "retry".

"Roll back"

Brother Z divides rollback into two modes, one is called "explicit rollback" (calling the reverse interface), and the other is called "implicit rollback" (no need to call the reverse interface).

The most common is "explicit rollback". The plan is nothing more than to do two things:

The first step is to determine the steps and status of the failure to determine the scope of the rollback. The process of a business is often defined at the beginning of the design, so it is easier to determine the scope of the rollback. But the only thing to note here is that if not all the services involved in a business process provide a "rollback interface", then the service that provides the "rollback interface" should be put in front when orchestrating the service. in this way, there is still a chance to "roll back" when the later work service goes wrong.

Second, it should be able to provide the business data used by the "rollback" operation. The more data you provide when you roll back, the more beneficial it is to the robustness of the program. Because the program can check the business when it receives the "rollback" operation, such as checking whether the account is equal, whether the amount is the same, and so on.

Since the data structure and size of this intermediate state are not fixed, Brother Z suggests that you serialize the relevant data into a json and store it in a nosql type storage when implementing this.

"implicit rollback" uses relatively few scenarios. It means that you don't need to do any extra processing for this rollback action, and there are mechanisms like "preemption" and "timeout failure" within downstream services. For example:

In the e-commerce scenario, the goods in the order will be preempted in stock and wait for the user to pay within 15 minutes. If no payment is received from the user, the inventory is released.

The following talk can have a lot of ways to play, but also more likely to fall into the pit of the "retry".

"retry"

The biggest advantage of "retry" is that the business system does not need to provide a "reverse interface", which is particularly good for long-term development costs, after all, the business is changing every day. Therefore, where possible, priority should be given to using "retry".

However, there are fewer scenarios for "retry" than "rollback", so our first step is to determine whether the current scenario is suitable for "retry". For example:

When the downstream system returns temporary states such as "request timeout" and "being restricted", we can consider retrying

If you return "insufficient balance", "no permission" and other business errors that clearly cannot be continued, you do not need to try again.

When some middleware or rpc framework returns Http503, 404, etc., when there is no expectation of when to restore, there is no need to retry.

If we are sure to retry, we also need to choose an appropriate retry strategy. The mainstream "retry strategies" are mainly the following.

Strategy 1. Try again immediately. Sometimes the failure is temporary and may be caused by events such as network packet collisions or peak hardware component traffic. In this case, it is appropriate to retry the operation immediately. However, there should be no more than one immediate retry, and if the immediate retry fails, you should use a different strategy.

Strategy 2. Regular intervals. The interval between each attempt of the application is the same. This is easy to understand, for example, to retry the operation every 3 seconds. (the specific numbers in all the sample code below are for reference only. )

Strategy 1 and strategy 2 are mostly used in the interactive operation of the front-end system.

Strategy 3. Incremental interval. The interval between each retry is incremented. For example, the first time 0 seconds, the second time 3 seconds, the third time 6 seconds, 9, 12, 15 like this.

Return (retryCount-1) * incrementInterval

The higher the number of failures, the lower the priority of the retry request, making way for the new retry request.

Strategy 4. Exponential interval. The interval of each retry increased exponentially. The goal of "going the same way" with the incremental interval is to prioritize retry requests with more failures, but the increase in this scheme is greater.

Return 2 ^ retryCount

Strategy 5. All wobble. On an incremental basis, increase randomness (the exponential growth can be replaced by incremental growth. ). It is suitable for scenarios where pressure is dispersed by a large number of retry requests generated at a certain time.

Return random (0,2 ^ retryCount)

Strategy 6. Wait for the wobble. Seek a moderate solution between "exponential interval" and "full jitter" to reduce the role of randomness. Applicable scenarios are the same as "full dithering".

Var baseNum = 2 ^ retryCount;return baseNum + random (0, baseNum)

The performance of strategies 3, 4, 5 and 6 is roughly like this. (X axis is the number of retries)

Why is there a trap in "retry"?

As mentioned earlier, for the sake of development costs, you may be a routinely called interface for reuse when you do a "retry". Then we have to raise a question of "idempotency" at this time.

If the technical solution chosen to achieve "retry" cannot be 100% guaranteed that retry will not be initiated repeatedly, then the issue of "idempotency" must be considered. Even if the technical solution ensures that 100% of the retry will not be repeated, try to consider the "idempotency" issue for the sake of unexpected circumstances.

Idempotency: no matter how many repeated calls are made to the program, the state of the program (all related data changes) is consistent with the result of one call, which ensures idempotency.

This means that the operation can be repeated or retried as needed without causing unexpected effects. For non-idempotent operations, the algorithm may have to track whether the operation has been performed.

Therefore, once a function supports "retry", the interfaces on the entire link need to consider the idempotent problem, and the cumulative increase or decrease of business data can not be caused by multiple calls to the service.

Meeting "idempotency" means finding ways to identify duplicate requests and filter them out. The idea is:

Define a unique identity for each request.

Determine whether the request has been executed or is being executed during the "retry", and if so, discard the request.

First, we can use a globally unique id generator or a generation service (which can be extended to read, a necessary cure in distributed systems-globally unique document number generation). Or be simple and rude, or use the Guid, uuid and so on that come with the official class library.

Then, through the rpc framework, each request is assigned a unique identification field in the client that initiates the call.

Second, we can work together to do verification before and after the server cuts into the actual processing logic code through Aop.

The general code ideas are as follows.

[before method execution] if (isExistLog (requestId)) {/ / 1. Determine whether the request has been received. The corresponding serial number 3 var lastResult = getLastResult (); / / 2. Gets is used to determine whether the previous request has been processed. The corresponding serial number 4 if (lastResult = = null) {var result = waitResult (); / / suspending waiting for processing to be completed return result;} else {return lastResult;}} else {log (requestId); / / 3. Record that the request has been received} / / do something.. [after method execution] logResult (requestId, result); / / 4. Update the results, too.

If the work of "compensation" is done through MQ, it can be done directly in the SDK encapsulated by docking MQ. The global unique identity is assigned on the production side, and the weight is eliminated through the unique identity on the consumer side.

III. Best practices of "retry"

Let's talk about some of the best practices accumulated by Brother Z (highlight:), all for "retry", which is indeed the most commonly used solution at work.

"retry" is especially suitable for being "degraded" under high load, and of course it should also be affected by "current limiting" and "circuit breaker" mechanism. When the "spear" of "retry" is used with the "shield" of "current limit" and "fuse", the effect is the best.

The input-output ratio of increasing the compensation mechanism needs to be measured. When there are some unimportant questions, you should "fail quickly" instead of "retry".

It is important to note that overly aggressive retry strategies, such as too short intervals or too many retries, can adversely affect downstream services.

Be sure to develop a termination strategy for retry.

When the rollback process is difficult or costly, it is acceptable to have a long interval and a large number of retries, as is the case with the "saga" pattern that is often mentioned in DDD. However, the premise is that other operations (such as 1, 2, 3, 4, 5) are not blocked by retaining or locking scarce resources. Because 2 has not been finished, 3, 4, 5 can not continue.

IV. Summary

In this article, we first talked about the significance of "compensation" and the realization of two ways of compensation: "rollback" and "retry".

Then, remind you to pay attention to the idempotent problem when you "retry", and Brother Z also gives a solution.

Finally, the best practices for "retry" summarized by several z brothers are shared.

I hope it will be helpful to you.

Question:

Have you ever done "compensation" manually on your own? Welcome to complain ~

Brother Z himself stayed up many times until the middle of the night to clean up the chaos caused by the "accident".

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.