How to solve the failure of request invocation under micro-service architecture 10/19 Update SLTechnology News&Howtos

How to solve the failure of request invocation under micro-service architecture

2025-10-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the knowledge of "how to solve the request call failure under the micro-service architecture". Many people will encounter this dilemma in the operation of the actual case. next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Uneasy factors brought about by micro-services

Compared with the single architecture, the service invocation under the micro-service architecture changes from local invocation within the same machine to remote invocation between different machines, which also brings the following uncertainties:

The execution of the call is the service provider. Even if the service consumer itself is normal, the service provider may fail due to various reasons such as CPU, network Imax O, disk, memory, network card and so on. It may also fail due to its own program execution problems such as GC suspension.

The call occurs between two machines, so it is transmitted through the network, and the network is out of control. Packet loss, delay or jitter may cause the call to fail.

Therefore, special treatment is needed for the failure of service invocation.

Timeout

A user call under micro-service may be split into multi-system service invocation, and any problem in any service invocation may lead to the final failure of the user call.

A problem with a system will affect all service consumers who call the services provided by the system, leading to a service avalanche.

Therefore, the timeout time should be set for service invocation to prevent the dependent service from not returning the result and blocking the service consumer.

Setting of timeout time

It is too short. Some service calls may be discarded before they are completed in time.

Too long, it may cause service consumers to be dragged to death.

According to the actual service level of the service provider online, 99.9% or 99.99% of the calls are returned within the number of ms.

Retry

Although setting the timeout time can stop the loss in time, the result of the service call fails after all. In most cases, the failure of the call is due to network problems or problems with individual service provider nodes, which may be successful if you can access it again from another node.

If the failure probability of one service call is 1%, the failure probability of two consecutive service calls is 0.01%, and the failure rate is reduced to the original 1%. So it is often necessary to set the number of retries after the service call times out.

If the timeout of a service call is set to 100ms and the number of retries is set to 1, when the service call exceeds 100ms, the service consumer will immediately initiate a second service call instead of waiting for the first call to return the result.

Double hair

If the probability of failure of one call is 1%, the probability of failure of two consecutive calls is 0.01%. A simple way to improve the success rate of service invocation is to make two service calls at the same time each time the service consumer initiates a service call.

Improve the success rate of calls

For two service invocations, the return result will be used for which one returns first, and the average response time is faster than that of one call.

This is double hair.

However, such a call will put twice the pressure on the back-end service and double the resource consumption, so it is not advisable to be "reckless".

The smarter double hair, that is,

Backup request (Backup Requests)

After the service consumer initiates a service call, if the request result is not returned within a given period of time, then the service consumer immediately initiates another service call.

Note that the setting time is usually much shorter than the timeout time, for example, if the timeout time is P999, then the backup request time may be P99 or P90, because if the call does not return a result within P99 or P90 time, then it is highly likely that this request is a slow request, and it is theoretically faster to initiate a call again.

When the actual online service is running, P999 may be much larger than P99 and P90 because of the long tail request time.

For example, the P999 of a service is 1s, while P99 has only 200ms and P90 has only 50ms. In this case, if the backup request time is P90, then the waiting time for the second request is only 50ms.

The backup request should set a maximum retry ratio to avoid that when there is a problem on the server side, the response time of most requests will exceed P90, which will almost double the number of requests and put more pressure on the service provider.

Can be set to 15%

Try to show the advantages of backup request.

Will not put too much extra pressure on the service provider.

Fuse break

The previous measures are very effective in the case of occasional exceptions of the service provider, but if the service provider fails and cannot recover in a short time, it can not improve the success rate of service invocation, and the failure will be aggravated by the greater pressure brought to the service provider by retry.

It is necessary for the service consumer to detect the failure of the service provider, stop the request in a short time, give the service provider fault recovery time, and wait for the service provider to recover before continuing the request.

Principle

Each service call of the client is encapsulated with a circuit breaker, and each service call is monitored through the circuit breaker.

If the number of service invocation failures reaches a certain threshold within a certain period of time, the circuit breaker will be triggered, and the subsequent service invocation will return directly, and the request will not be made to the service provider.

After the circuit breaker, once the service provider recovers

How to resume service invocation

There are three states of Hystrix circuit breaker: off, open, and half open.

Closed state

Under normal circumstances, the circuit breaker is closed, and the occasional call failure does not affect

Open state

When the number of service invocation failures reaches the threshold, the circuit breaker will be in the open state, and the subsequent service invocation will return directly without initiating a request to the service provider.

Half Open state

When the circuit breaker is turned on, it enters the half-open state at regular intervals, and a probe call is made to the service provider to determine whether the service provider is back to normal.

If the call is successful, the circuit breaker is closed.

If it fails, the circuit breaker remains open and waits for the next cycle to re-enter the half-open state.

Hystrix encapsulates each service call in HystrixCommand and records the status of each service call in real time, including success, failure, timeout, or thread rejection.

When the failure rate of service invocation is higher than the threshold within a period of time, the circuit breaker of Hystrix will enter the open state, and the new service call will return directly and will not initiate a call to the service provider.

After waiting for the set time interval, the Hystrix circuit breaker will enter half a dozen again, and the new service call can be re-sent to the service provider. If the failure rate of service invocation is still higher than the threshold for a period of time, the circuit breaker will re-enter the open state, otherwise, it will be reset to the closed state.

Determine whether the circuit breaker turns on the failure rate threshold through the following parameters:

HystrixCommandProperties.circuitBreakerErrorThresholdPercentage ()

When the circuit breaker enters the half-open state, the time interval passes through the following parameters:

HystrixCommandProperties.circuitBreakerSleepWindowInMilliseconds ()

The key to the circuit breaker is

Count the failure rate of service invocation over a period of time

Sliding window algorithm

By default, the sliding window contains 10 buckets, each with a time width of 1s, and each bucket records the number of successful, failed, timed-out, and rejected service calls in that 1s. When the new 1s arrives, the sliding window slides forward, discarding the oldest bucket and wrapping the latest bucket in.

At any time, Hystrix will take the failure rate of all service calls in the sliding window as the basis for judging the switch status of the circuit breaker. The sum of all failed, timeout and thread rejected calls recorded in these 10 buckets divided by the total number of calls is the failure rate of all service calls in the sliding window.

This is the end of the content of "how to solve the request call failure under the micro-service architecture". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.