What is the principle of micro-service overload protection? 07/13 Update SLTechnology News&Howtos

What is the principle of micro-service overload protection?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "what is the principle of micro-service overload protection". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

In micro-services, cascading failures are easy to occur because of the interdependence between services, which may be due to the failure of one of the services in the whole service link, which leads to the failure of other parts of the system. For example, an instance of a service fails due to overload, which leads to an increase in the load of other instances, which leads to the failure of all these instances like dominoes one by one. this cascading failure is the so-called avalanche phenomenon.

For example, service A depends on service C, service C depends on service D, and service D depends on service E. when service E is overloaded, the response time becomes slow or even the service is unavailable. at this time, caller D will have a large number of timeout connection resources that will not be released, which will lead to service D overload, which will lead to service C overload and the whole system avalanche.

The exhaustion of a certain kind of resource can lead to high latency, high error rate, or the corresponding data does not meet expectations, which is what should happen when the resource is exhausted, and when the load continues to rise until it is overloaded, it is impossible for the server to remain completely normal all the time. The increase in load caused by insufficient CPU resources is the most common in our work. If CPU resources are insufficient to cope with request load, generally speaking, all requests will slow down. Excessive CPU load will cause a series of side effects, including the following:

Increase in the number of requests being processed (in-flight)

The server gradually fills up the request queue, which means higher latency and more memory for the queue

The thread is stuck and cannot process the request

Cpu deadlock or request card owner

Rpc service invocation timed out

Cache efficiency of cpu decreases

This shows that the importance of preventing server overload is self-evident, and preventing server overload is divided into the following common strategies:

Provide downgrade results

Actively reject a request in case of overload

The caller actively rejected the request

Pressure test and reasonable capacity planning in advance

What we are mainly talking about today is the second scheme to prevent server overload, that is, to actively reject the request in the case of overload. Below, I will uniformly use "overload protection" to express. The general principle of overload protection is to actively reject the request without processing when it is detected that the server is already overloaded. The general practice is to return to error quickly.

Overload protection is built into many microservice frameworks. This paper mainly analyzes the overload protection function in go-zero. Let's first feel how the overload protection in go-zero works through an example.

First of all, we use the officially recommended goctl to generate an api service and a rpc service. The process of generating the service is relatively simple. Without introduction here, you can refer to the official documentation. My environment is two servers, the api service runs on the local server, and the rpc service runs on the remote server.

The remote server is a single-core CPU. First of all, the pressure tool is used to simulate the increase of server load and fill up the CPU.

Stress-c 1-t 1000

At this point, check the server load through the uptime tool. The-d parameter can highlight the load change. At this time, the load is greater than the number of CPU cores, indicating that the server is in a state of overload.

Watch-d uptime19:47:45 up 5 days, 21:55, 3 users, load average: 1.26,1.31,1.44

The api service is requested at this time. The ap service internally depends on the rpc service. Check the log of the rpc service at the level of stat. You can see that the cpu is relatively high.

"level": "stat", "content": "(rpc) shedding_stat [1m], cpu: 986, total: 4, pass: 2, drop: 2"

And the log of the overload protection discarding request will be printed. You can see that the overload protection has been in effect and the request has been actively dropped.

Adaptiveshedder.go:185 dropreq, cpu: 990, maxPass: 87, minRt: 1.00, hot: true, flying: 2, avgFlying: 2.07

At this time, the caller will receive an error message of "service overloaded".

Through the above experiments, we can see that overload protection will be triggered when the server load is too high, so as to avoid cascading failures leading to avalanches. Next, we analyze the principle of overload protection from the source code. Go-zero has built-in overload protection in both http and rpc frameworks, and the code path is under go-zero/rest/handler/sheddinghandler.go and go-zero/zrpc/internal/serverinterceptors/sheddinginterceptor.go, respectively. We analyze the overload protection under rpc. When server starts, we return a shedder code path: go-zero/zrpc/server.go:119, and then when we receive each request, we will use the Allow method to determine whether overload protection is needed. If err is not equal to nil, we will directly return error if overload protection is needed.

Promise, err = shedder.Allow () if err! = nil {metrics.AddDrop () sheddingStat.IncrementDrop () return}

The code path to achieve overload protection is: go-zero/core/load/adaptiveshedder.go. The overload protection implemented here is based on sliding windows to prevent burrs, has a cooling time to prevent jitter, and starts to reject requests when CPU > 90%. The implementation of Allow is as follows

Func (as * adaptiveShedder) Allow () (Promise, error) {if as.shouldDrop () {as.dropTime.Set (timex.Now ()) as.droppedRecently.Set (true) return nil, ErrServiceOverloaded / / return overload error} as.addFlying (1) / / flying + 1 return & promise {start: timex.Now () Shedder: as,}, nil}

The sholdDrop implementation is as follows. This function is used to detect whether the trigger overload protection condition is met, and if so, records the error log.

Func (as * adaptiveShedder) shouldDrop () bool {if as.systemOverloaded () | | as.stillHot () {if as.highThru () {flying: = atomic.LoadInt64 (& as.flying) as.avgFlyingLock.Lock () avgFlying: = as.avgFlying as.avgFlyingLock .Unlock () msg: = fmt.Sprintf ("dropreq Cpu:% d, maxPass:% d, minRt:% .2f, hot:% t, flying:% d, avgFlying:% .2f ", stat.CpuUsage (), as.maxPass (), as.minRt (), as.stillHot (), flying AvgFlying) logx.Error (msg) stat.Report (msg) return true}} return false}

Determine whether the CPU reaches the preset value. The default is 90%.

SystemOverloadChecker = func (cpuThreshold int64) bool {return stat.CpuUsage () > = cpuThreshold}

The load statistics code of CPU is as follows. Statistics are carried out every 250ms, and no statistics log is recorded every minute.

Func init () {go func () {cpuTicker: = time.NewTicker (cpuRefreshInterval) defer cpuTicker.Stop () allTicker: = time.NewTicker (allRefreshInterval) defer allTicker.Stop () for {select {case

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.