What is the essence of server high availability 07/19 Update SLTechnology News&Howtos

What is the essence of server high availability

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of "what is the essence of server high availability". The editor shows you the operation process through an actual case, and the operation method is simple, fast and practical. I hope this article "what is the essence of server high availability" can help you solve the problem.

High availability is an ability to control risk.

High availability is a risk-oriented design that enables the system to control risks and provide higher availability.

Second, why should it be highly available?

For a company, "why should it be highly available" can be fully understood as "why the company wants (to do the system) to be highly available". Take the company as the object, from the inside, including: people, software (things), hardware (things); from the outside, including: customers, shareholders, society; from their own point of view, including: the company.

The premise of high availability: nothing is 100% reliable

Everything is changing (the only constant is change).

None of the changes are 100% reliable.

Conclusion: nothing is 100% reliable.

Internal cause: people and things are not 100% reliable.

From a human level: people can make mistakes.

From the software level: software is possible to have BUG.

From the hardware level: hardware can be broken.

From the point of view of probability, as long as the number of changes is enough, the probability of error will infinitely tend to 1.

External cause: no high availability, external influence is very large

From the customer's point of view: no high availability, customer service may be interrupted.

From the shareholder level: without high availability, the stock price may fall.

From a social point of view: without high availability, social order may be affected.

Root cause (essence): risk control

From the point of view of the company itself: control the risk, protect the value of the company, and avoid hurting the foundation.

Third, how to make high availability

How to make high availability is essentially: how to control risk.

1 risk-related concepts

Risk: a possibility that harm will occur in the future, but it does not actually occur, recorded as r.

Failure: refers to the fact that harm has occurred or is happening, and is the result of the risk becoming a reality.

Risk probability: the probability of a risk changing into failure. It is used to express the difficulty of the risk trigger as a fault, which is marked as P (r).

Fault impact range: refers to the harmful impact caused by a fault in a unit time, recorded as R (r).

Failure duration: refers to the duration of a fault, recorded as T (r).

Fault influence surface: the sum of the fault influence area multiplied by the fault influence time. Here, the total damage degree of the fault is expressed by the fault influence surface, which is marked as F (r).

Risk expectation: the probability of each risk changing to failure multiplied by the sum of the fault impact surfaces after each risk changing to failure. Here, the potential harm degree of the risk is expressed by risk expectation, which is marked as E (r).

2 Formula of risk expectation

According to the definition in the previous section, the formula for risk expectation can be derived as follows:

R stands for risk, and risk expectation decreases with the decrease of the number of risks n and P, R, T of each risk, which is referred to as nPRT formula.

Note: if you want to quote this formula, please indicate the source.

3 4 factors to control risk (nPRT)

Reduce the number of risks, n

From the source away from the risk, so that there is no connection with the risk carrier, no relationship; then the risk probability is 0, do not care about the risk after the failure impact surface is big or small, do not care.

For example: major festivals, the implementation of station-wide network closure, the number of changes will be a significant decline, that is, a typical risk reduction.

For example, system A does not rely on Oracle at all, so system A does not have to care about any risks of Oracle. Even if the President of the United States suddenly and urgently announces that Oracle is immediately banned from use in China, system A does not matter.

For example, the recent popularity of the new crown, human-to-human transmission is very scary, if you choose not to go to work today, then you do not have to worry about being infected by pedestrians and colleagues outside today.

Reduce the probability of risk-to-failure (i.e., increase the difficulty of risk-to-failure), P

Treat the risk as an object, set up cards layer by layer, increase the threshold and difficulty of changing the risk into failure, and don't let the tragedy of "accidentally add a space or character and the system will fail" easily.

For example: personnel B to make changes to system C, personnel B can increase the certification examination for change, require offline (or simulation) testing on the content of the change, CR on the content of the change, system C provides the ability to preview the effect of the change (similar to monitoring mode or trial run), in case personnel B wants to maliciously change to destroy, you can also increase non-human review, system C can increase error prevention design for protection, and so on.

For example: take COVID-19 as an example, wearing a mask, washing hands frequently, and more ventilation can reduce the probability of catching COVID-19.

Reduce the scope of failure, R

In order to split the whole into N small individuals, each individual is isolated from each other, and the problems of a single individual only affect a single individual, realizing small and beautiful.

For example: distributed architecture is an example of this, centralized loss, distributed loss is one-fourth of the loss.

For example: take COVID-19 as an example, gridding management, restrictions on the flow between provinces or cities, cross-provincial nucleic acid + isolation for 14 days, effectively control the spread of COVID-19.

Shorten the duration of failure impact, T

The influence time of the fault is determined by the fault detection time and the fault hemostasis time, so it is necessary to detect and stop bleeding early.

The way of discovery can be divided into: pre-warning and post-warning. Do as much warning as possible, buy time to stop the bleeding and even stifle the risk in the cradle.

Hemostatic methods are divided into: switching, rollback, capacity expansion, downgrading or current limit, BUG repair and so on. When a fault occurs, the first priority principle is rapid hemostasis (such as switching, rollback, capacity expansion), and it is strictly forbidden to locate the root cause; when it is impossible to quickly stop bleeding, less bleeding is the second priority principle, such as demotion and current restriction.

Hemostatic efficiency: automatic vs manual; one-click vs multi-step operation. As far as possible to use automation instead of manual operation, if manual operation as far as possible to achieve one-click, improve the speed of hemostasis.

For example: for the capacity water level, you can draw an early warning line before the warning line, early warning, calmly deal with.

For example, in a distributed application cluster, when there is a problem with any application server, the load balancer automatically removes the problematic application server through heartbeat check and forwards the request to other (hot) backup redundant servers.

For example, take COVID-19 as an example, but because each life is unique, there is no way to switch, there is no way to roll back, nor can it be downgraded (involving humanitarianism), so it can only be treated slowly.

4 7 core principles of highly available architecture design

According to the nPRT formula, there are seven core principles for highly available architecture design:

The principle of less dependence: what you can not rely on, do not depend as much as possible, the less the better (n)

Since nothing is 100% reliable, when there is a relationship between two things, it will influence each other, and it will be a risk to each other, and one problem may affect the other. We use dependency to refer to the "relationship" here.

For example, a system relies on three relational databases of Oracle,Mysql,OB at the same time, and the principle of less dependence is changed to rely only on the most mature and stable OB, not on Oracle and Mysql.

What scenario is suitable for multi-dependence?

When the introduction of dependency (n becomes larger) can reduce one or more of the PRT, and make E (r) decline as a whole.

For example, to address the risk of DB, a distributed cache is introduced, which is still available as long as the two are not hung at the same time.

Weak dependence principle: must rely on, as weak as possible, the weaker the better (P)

Things an are strongly dependent on things b, and once b goes wrong, then a will also go wrong and lose everything.

Therefore, any strong dependency should be transformed into weak dependency as much as possible, which can directly reduce the probability of problems.

For example, the core link of the transaction should issue integral rights to users after the transaction is successful; the core system of the transaction needs to rely on the integral equity system, and the good way is to use weak dependence and asynchronization, so that when the integral equity system is not available, the high probability will not affect the core link of the transaction.

Diversification principle: don't put eggs in one basket, spread the risk (R)

Break it up and split it into N parts; avoid having only one part as a whole, otherwise the scope of influence will be 100% if there is a problem.

For example, all transaction data are placed in the same database in the same table, in case the library fails, all transactions will be affected.

For example, if you buy the same stock with all your money, it would be miserable if the stock is Letv.

Principle of equilibrium: evenly spread risk and avoid imbalance (R)

It is best that each share of the N is balanced; avoid a share that is too large, or the too large share will have too much influence if there is a problem.

For example, there are 1000 xx application clusters, but due to the drainage component BUG, all traffic is directed to more than 100 of them, resulting in a serious load imbalance and a complete collapse due to the load being unable to carry. Similar major failures have occurred many times.

For example, he bought 10 stocks with all his money, one of which accounted for 99%. It would be miserable if the stock was Letv.

Isolation principle: control risk non-proliferation, do not magnify (R)

Each copy is isolated from each other; avoid that one copy has problems affecting others, and the spread spreads the scope of influence.

For example, transaction data is split into 10 libraries and 100 tables, but deployed on the same physical machine; if a large SQL in a table fills up the network card, the 10 libraries and 100 tables will be affected.

For example: equally divided all their money to buy 10 stocks, each accounting for 10%, but 10 are Leeco.

For example, the Battle of Chibi in ancient times is a typical negative example. The iron lock and the ship led to the destruction of isolation, and a fire burned an army of 80w.

There is a level of isolation, and the higher the level of isolation, the greater the difficulty of risk spread and the stronger the ability of disaster recovery.

For example, an application cluster consists of N servers deployed on the same physical machine, or on different physical machines in the same computer room, or in different data centers in the same city, or in different cities. Different deployments represent different disaster recovery capabilities.

For example, human beings are made up of countless people and live on different continents of the same earth, which means that human beings do not have the ability to isolate at the planetary level, and when the earth has a devastating impact, human beings do not have disaster recovery.

The principle of isolation is an extremely important principle, and it is the premise of the previous four principles. Without good isolation, the first four principles are fragile, and the risk can easily spread, undermining the effects of the previous four principles. A large number of real system failures are caused by poor isolation, such as offline impact online, offline impact online, pre-issued impact on production, a bad SQL affects the entire database (or the entire cluster), and so on.

Dispersion, equilibrium and isolation are the three core principles to control the scope of risk influence. Break up and split into N parts, each of which is balanced and isolated from each other, one is problematic, and the scope of influence is 1 prime N.

No single point principle: there should be redundancy or other versions, so that there is a way out (T)

The way to quickly stop bleeding is to switch, roll back, expand capacity, etc. Rollback and expansion are special switches. Rollback refers to switching to a certain version, and capacity expansion refers to switching traffic to the new expanded machine.

Switching must have a place to cut, so there cannot be a single point (here, it refers to a single point of strong dependence, and a weak one can be degraded), there must be redundant backups or other versions; a single point will limit the overall reliability.

Assuming that the reliability of a single point is assumed to be 99.99%, it is very difficult to raise it to 99.9999%, but if you rely on 2 instead of a single point (it doesn't matter if you hang up at the same time, as long as you don't hang up at the same time), then the overall reliability will be improved qualitatively.

A single point of failure can lead to the inability to stop bleeding quickly, so it is very important to go to a single point to prolong the whole time of stopping bleeding. The single point here not only refers to the system node, but also includes people, such as people who subscribe to alarms, people who respond to emergencies, and so on.

For (important) data nodes, the principle of no single point must be met, otherwise, in extreme cases, data may be permanently lost and can never be recovered; after (important) data nodes meet the principle of no single point, ensuring data consistency is more important than availability requirements.

For example, if a merchant supports only one payment channel, it is a typical single point, and it will not be able to pay in case the payment channel fails.

For example, all the income of a family depends only on the salary of the father. In case the father is ill, there will be no income.

The difference between the principle of no single point and the principle of decentralization:

When the node is stateless, it is broken up and divided into N parts, each of which has the same function and is redundant to each other, that is, when the node is stateless, the principle of decentralization is equivalent to the principle of no single point, and one can be satisfied.

When the node is in a state, it is broken up and divided into N parts, each is different, and there is no redundancy in each copy, so it is necessary to do redundancy for each copy, that is, when the node is in a state, it should meet both the dispersion principle and the single point principle.

Principle of self-protection: less bloodshed, sacrifice one part, protect another part (P&R&T)

External input is not 100% reliable, sometimes it is unintentional error, sometimes even malicious damage, so you should have error-proof design for external input and give yourself more protection.

In extreme cases, you may not be able to stop the bleeding (quickly). Consider bleeding less and sacrifice one part to protect the other. For example: current restriction, demotion and so on.

For example, during the peak period, many functions are generally downgraded in advance, while the current limit is limited, mainly to protect the transaction payment experience of the vast majority of people at the peak.

For example, shock is triggered by excessive blood loss or excessive pain, which is also a typical self-protection mechanism.

4 where is the software risk

Previously introduced the method of risk control, back to the field of software systems, where is the risk?

Take the software system as the object, from the inside, it includes: computing system and storage system; from the outside, it includes: personnel, hardware, upstream system, downstream system; and (implicit) time.

Because each object is made up of other objects, each object can continue to be broken down (in theory, it can be decomposed infinitely), and the above decomposition is mainly designed to simplify understanding.

1 the source of software system risk

Risk comes from (harmful) change, and the risk of an object comes from the (harmful) change of all the objects related to it. Therefore, the sources of software system risk can be divided into the following seven categories:

Computing system changes: slow running, running errors

The server resources (such as CPU,MEM,IO, etc.), application resources (number of RPC threads, number of DB connections, etc.), and the load of business resources (full business ID, insufficient balance, insufficient business quota, etc.) will affect the risk expectation of system operation.

Storage system changes: slow running, running errors, data errors

The server resources (such as CPU,MEM,IO, etc.), storage resources (concurrency, etc.), the load and data consistency of data resources (single database capacity, single table capacity, etc.) that the system depends on will affect the risk expectation of storage system operation.

Human change: change error

The number of change personnel, production safety awareness, proficiency, the number of changes, the way of change, etc., will affect the risk expectation of change.

Due to the large number of people and the number of changes, change has become the TOP1 of all the fault sources of ants, which is why the "change axe" is so famous.

The correct order of "changing three axes" should be "grayscale, monitoring, emergency"; grayscale represents R, monitoring and emergency represents T.

Think about it: if changing the three-board axe allows you to add another axe, what do you think it should be?

Hardware changes: damage

The quantity, quality, service life and maintenance of the hardware will affect the risk expectation of the hardware, and the hardware damage will affect the unavailability of the upper software system.

Upstream change: request larger

Requests are divided into three dimensions: network traffic (gathered by countless API) and API,KEY (made up of numerous KEY requests).

Excessive network traffic will cause network congestion and affect all network traffic requests in the network channel.

Excessive API requests cause overload of the corresponding service cluster, affecting all API requests on the entire service machine, and even spreading out.

Excessive KEY requests (commonly known as "hotspot KEY") will cause overload on a single machine, affect all KEY requests on a single machine, and even spread out.

Therefore, when greatly promoting the guarantee, we should not only pay attention to the capacity guarantee of the core API, but also consider the network traffic and hot KEY.

Downstream changes: slow response, response error

The number, level and availability of downstream services affect the risk expectation of downstream services. Slower downstream response may slow down upstream, and downstream response errors may affect upstream running results.

Time change: time expires

Time expiration is often ignored, but it is often sudden and globally destructive. Once the time expires, triggering failure will lead to very passive, so it is necessary to identify in advance, early warning, such as: key expiration, certificate expiration, fee expiration, cross-time zone, cross-year, cross-month, cross-day, and so on.

For example, in 2019, Japanese operator Softbank Corp. caused up to four hours of communication interruption for 3000w users due to the expiration of their certificates.

Each of the above major categories of risks can be analyzed and dealt with one by one based on the nPRT formula.

2 quantity of risk: three or three things in a lifetime

Any thing is not only composed of other things but also a part of other things, the cycle goes on indefinitely; there are three or three things in life, and the number of risks is endless.

Looking inward, the content can be infinitely small; when the problem of atomic granularity spreads, it may also affect the availability of the software system, just as the 100nm novel coronavirus can affect the availability of the human body.

Looking out, it can go on infinitely; when the solar system is destroyed, the availability of the software system naturally ceases to exist.

Although the risk is endless, but as long as we know more about the risk, according to some ideas and principles to control the risk, we can better reduce the risk expectation.

Talk about awe:

Our understanding of the world is limited, which reduces a lot of fear and awe.

What we really have to fear is not the punishment regulations, but what we don't know and what we don't know.

This is the end of the content about "what is the nature of server high availability". Thank you for your reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.