The last big move to keep your system "strong"-- "downgrade" 07/06 Update SLTechnology News&Howtos

The last big move to keep your system "strong"-- "downgrade"

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

If this is the second time you have seen my article, please scan the code and subscribe to me at the end of the article.

The length of this article is 4069 words. It is recommended to read for 11 minutes.

Maybe you already have some understanding of demotion. After reading it carefully, I think this article may bring you some new gains.

We have talked about "circuit breaker" in the first two articles (how to "protect yourself" in a system full of "thunder"? This is the first move) and "current restriction" (want to go through customs and "current restriction"? As long as this article), this time we are talking about the remaining "demotion" in the "highly available three Musketeers".

I don't know how many friends here have come into contact with Ali's open platform. Ali will issue such an announcement every time he makes a big push.

The announcement content of ▲ double 12 in 2018

These adjustments are "downgrading" work, and the purpose is to free up more resources for core programs to use, in order to maximize the availability of core business, so it is necessary to perform some downgrading for non-core business.

I. what is "demotion"?

The purpose of demotion can be summed up in one sentence: to maximize the efficiency of limited resources.

What is the benefit maximization? Like the following example:

Brother Z has three things to buy, a 3000 A, a 700B, and a 1200 C. the degree of importance to Brother Z A > B > C. But at this time, Brother Z has only 3000 yuan. How do you think Brother Z should choose to spend the most money? It must be A.

According to principle 28, we know that 80% of the benefits of a system are generated by the core 20% of the functions. The remaining 20% benefit requires 80% of the resources to be achieved.

This means that if the system needs to spend 100% of its resources to do 100% of its work, it will not be able to withstand a threefold increase in traffic (300% of resources). So, without increasing resources, I hope the system will not be down and still work properly, and it is necessary to give up 80% of the resources to solve the remaining 20% of the problem. In this way, in theory, this 100% of the resources can support five times the original traffic. The side effect is 80% damage to the integrity of the function.

Of course, it is such an exaggeration that 80% of the features will not be downgraded in the actual scenario, after all, we have to consider the user's experience.

To take a typical example of an e-commerce scenario, what is the most important thing in a big push? Transform ~ make money ~ so if the "comment" function takes up a lot of resources at this time, what will you do with it? In fact, we can choose to temporarily close the entry for submitting comments, turn off the paging function, and so on, so that the process of placing an order has more resources to deal with.

The common forms of downgrade schemes are the following three types.

Sacrifice user experience

In order to reduce the acquisition of "cold data", disable the paging function of the list.

In order to slow down the rate of traffic entry, a CAPTCHA mechanism is added.

In order to reduce the "big query" wasting too much resources, improve the filter requirements (disable fuzzy query, some conditions must be selected, etc.).

The general static data is used to replace the dynamic data of "thousands of people and thousands of faces".

Even more simple and rude, hang a page showing "XX function is temporarily turned off during XX time".

Although such solutions more or less reduce the user experience, but in some periods, some features are not "rigid requirements". It's a good deal in exchange for the protection of the system.

Sacrifice functional integrity

Some features are "defensive", and if you are willing to risk "streaking" for a period of time, it will also bring considerable resource savings.

For example, reduce such "verification" actions to release more resources by temporarily turning off "risk control" and canceling part of the judgment of "whether the conditions are met" (for example, judging whether the points are enough when adding points to the shopping cart).

Or turn off or do not collect logs at the original info or warning level, and only collect logs at the error and fault levels.

Sacrifice timeliness

Immediately after the occurrence of an event, we can see that the effect is very consistent with the "inertia of thinking". But according to a previous article (distributed system concern-data consistency (previous article)), we know that there is no real "real-time" when it comes to network transmission. But you will make a lot of efforts to reflect the processed results to the relevant places as soon as possible. Such as the timely synchronization of inventory.

If in a special period, can temporarily reduce the timeliness requirements (effective within 3 seconds to 30 seconds), it is also a good solution.

For example, the merchandise page used to show how much inventory is left, but now it can be adjusted to display "in stock".

And some operations that are originally carried out asynchronously, the processing efficiency is slowed down, or even suspended for a period of time. For example, send points, coupons and so on.

Having said so much, how should the downgrade be put into practice?

Second, how to do "demotion"

It is mainly divided into two links: grading and downgrading.

Grading and sequencing

As mentioned in the previous example, we first have to determine the "importance" of each feature, which determines when it can be discarded to ensure that the remaining features are available.

It is similar to defining a level for a log, for example, we can define five levels from 1 to 5, with 1 being the highest level, which should be protected to the death. The lowest level of 5 can be demoted first.

Once the system is under too much pressure, downgrade the function of level 5 first. If it's not enough, downgrade 4, level 3, and so on.

But in fact, this rating is not enough. For example, there are 100 functions defined as level 4. Do you downgrade together when you need to downgrade? It's obvious that the granularity is too coarse.

If "grading" is like cutting cake horizontally, "ordering" is cutting it vertically again.

We can also define some numbers, such as the serial number 1 # 9, which is the first to be degraded.

You can then use the number of upstream programs / functions supported by each program as a reference. For example, the same level 5 program, one supports 5 upstream functions and one supports 10 functions. It is obvious that the serial number of the former should be larger and be downgraded first.

Of course, depending on the number of functions supported is just a general approach to "business irrelevance". If you want to improve, you also need to do a "role" analysis of each function, after all, the relative importance of different functions is still different. (here you can expand to understand Analytic Hierarchy Process, Analytic hierarchy process, or AHP for short)

By the way, there is one thing to pay special attention to when grading and sequencing: the level of the downstream program on which a program depends cannot be lower than that of the program.

Why? Because once the dependent program is downgraded, it will naturally cause all the upstream programs it supports to be unavailable. Therefore, no matter how high the level of the upstream program is, it makes no sense.

At this point, the "deployment of troops" has been completed, and the next step is to "implement the operation."

Degraded realization

The first step is to develop a trigger mechanism. Like circuit breakers and current limits, when to trigger the "downgrade" action also depends on some strategies made in advance. This part is similar to the previous two articles (circuit breaker, current limit), except for the timeout rate, error rate, or resource consumption rate of the system, which will not be repeated here.

When the program finds that it has met the downgrade condition and entered the "downgrade mode", how should the program handle the request?

Global variable int _ runLevel = 3; / / run system level, default value 5 all variables int _ runIndex = 7; / / run system sequence number, the default value of 9max / below is an example of the function of level=4 and index=8. If (myLevel > _ runLevel and myIndex > _ runIndex) {/ / enters degraded mode. } else {/ / do something...}

Digression: it is a cool thing to make the above if judgment through Aop+ annotations (features).

Although there are many ways to handle requests, it is particularly emphasized that the downgrade strategy to be implemented should be as simple as possible. Because of the existence of the "marginal effect", it is not worth the loss to complicate things in order to deal with unexpected situations.

Then in the implementation part, if it is the front end. What is more common to us is:

Through the setting of Cache-Control in the returned http message, let the subsequent requests go directly to the browser cache.

The data in the page that needs to be loaded asynchronously is not loaded directly.

Disable some of the operation buttons, or even tell "temporarily close" directly.

The url of the dynamic page switches to the static page return through the response agent.

With the exception of the disable button, most things can be handled in the access layer, such as nginx, to avoid code intrusion into business items.

If it is a back-end program, for the "read" type of operation, you can write the "/ / enter degraded mode" part of the code as follows:

If it is no return value method. Default return or throw an exception.

If there is a return value method. The data of the local mock or throw an exception is returned by default.

If the back-end part uses some middleware, it is excellent to deal with it directly in the middleware (rpc, mq proxy, etc.) (usually there is a fallback interface to be implemented), so as to avoid the intrusion of business code.

Finally, let's talk about the "writing" problem of back-end programs.

Cache is a frequent customer in large-scale systems. with the larger the scale of the system, in order to seek better performance and cost, it is inevitable to increase the complexity to introduce multi-level cache. This will become: local cache-- > distributed cache-- > DB/ source service, such a layer-by-layer progressive relationship.

The usual code might look like this:

If (write database (data) = = true) {if (write distributed cache (data) = = true) {write local cache (data); return success;} else {rollback database (data); return fail;}} else {return fail;}

In a period of high load, we can reduce the requirement of consistency. Downgrade the time-consuming "data drop" operation to "asynchronous".

If (write distributed cache (data) = = true) {write local cache (data); pushMessage (data); / / messages sent can be sent through a centralized MQ or directly written to the local disk. Return success;} else {return fail;}

It can even be done more thoroughly if possible, and synchronization to distributed caches can be done asynchronously.

Write local cache (data); pushMessage (data); / / messages sent can be sent through a centralized MQ or directly written to the local disk. Return success

The database is the last bastion of the system, and in non-extreme cases, we can disable some "write data" operations in the "database access framework" and give all resources to "read data". So that the system from the appearance of at least "standing there alive", although many functions of the operation is to return failure (this is not really no way to do it, ah, hold on).

III. Summary

So far we have talked about the idea of downgrading and some of the most common ways to achieve it, but it is really a long way to go to downgrade.

From the point of view of the scheme, if the downgrading process needs to be carried out one by one for each function / program, then theoretically 10 function points can produce P (10 ~ (10)) = 3628800 schemes.

From a realistic point of view, the flow is unpredictable. Some features may need to be treated as level2 this time, but as level3 next time.

So this is a process that needs to be polished and tuned continuously for a long time.

Finally, I hope that the recent "high availability three swordsmen" can be used as a starting point for you to understand "high availability". You can collect self-defense first (of course, it is also excellent to share). Welcome to communicate and discuss with us later.

Question:

Have you ever encountered a scenario in which you "downgraded" by changing the code right away? Welcome to complain ~

How to "protect yourself" in a system full of thunder? This is the first move.

Want to go through customs and "limit the flow"? Just this one.

Distributed system concern-data consistency (part I)

Author: Zachary

Source: https://www.cnblogs.com/Zachary-Fan/p/degradation.html

▶ about the author: Zhang Fan (Zachary, personal WeChat account: Zachary-ZF). Persist in polishing each article with high quality and originality. Welcome to scan the QR code below.

Publish the original content regularly: architecture design, distributed system, product, operation, some thinking.

If you are a junior programmer, you want to promote but don't know how to do it. Or as a programmer for many years, I fell into some bottlenecks and wanted to broaden my horizons. Welcome to follow my official account "Cross-border architect", reply to "Technology" and send you a mind map that I have collected and sorted out for a long time.

If you are an operator, there is nothing you can do in the face of a changing market. Or you may want to understand the mainstream operation strategy in order to enrich your "warehouse". Welcome to follow my official account "Cross-border architect", reply to "Operation" and send you a mind map that I have collected and sorted out for a long time.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.