The focus of distributed system-- the first understanding of "High availability" 04/17 Update SLTechnology News&Howtos

The focus of distributed system-- the first understanding of "High availability"

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The length of this article is 2042 words. It is recommended to read for 6 minutes. The text of all "" packages is highlighted only for the first time.

Ahem, starting with this article, officially opening the focus of distributed systems, I think the second most important content-"high availability".

The main point of this article is to clarify the definition of "high availability" and to understand which links to do "high availability" in the distributed system, so as to lay the foundation for the strategies and solutions to be discussed later. If you have more than 1 year of actual combat experience in distributed systems, you can skip this article as appropriate.

Tips: the "high" in "high XX" is actually relative. The more you meet your expectations, the higher you will be.

First, the role of "high availability"?

First of all, unify the cognition of "high availability".

To make a more popular analogy: the children of the only child era are "single applications". If something goes wrong, the parents will lose their independence, and the inheritance of the whole family will be broken and "unavailable". However, the two-child policy is to improve "availability" by reducing the probability of this problem through distribution (redundancy).

For "high availability", the professional explanation is:

"High availability" refers to improving the availability of systems and applications by minimizing downtime caused by routine maintenance operations (planned) and sudden system crashes (unplanned).

Baidu encyclopedia

In short, no matter what happens (even an earthquake or flood), users can be as unaware as possible and still be able to use the system normally, that is, the more "highly available" it is.

Why talk about "high availability" after "data consistency"? My understanding is that the key to distributed systems is redundancy, but the biggest enemy of redundancy is "data consistency". Through redundancy, we have broken the original bottleneck and opened up some new channels. For example, you can strive for higher availability, higher performance, and so on. But among them, "high availability" is the most important. As can be seen from the explanation cited above, the ceiling of a single application will always arrive faster in order to minimize downtime. Just like it is difficult to keep a computer running forever, you have to update the operating system several times, suddenly there are several hardware failures, and even the optical fiber in the computer room is broken! Then it is in an "unavailable" state at this time.

Therefore, I think the value or meaning of "high availability" must be above the other benefits we get from doing distributed systems, such as "high performance". Because, to a certain extent, the so-called "high performance" can also achieve a certain expected value by optimizing single applications, but "high availability" must rely on distributed systems to achieve it.

Second, how to measure "high availability"

Generally speaking, we talk most about using Service Level Agrement to measure high availability indicators, referred to as SLA. However, it is intended to refer to a contract between a network service provider and a customer, which defines terms such as type of service, quality of service, and customer payment, and includes other concepts in addition to "effective working time", such as bandwidth, service ready time (RFSD), mean time between failures (MTBF), mean service recovery time (MTRS), mean repair time (MTTR), etc. Initially, SLA is mostly used in services provided by infrastructure such as telecom operators, agreeing on what levels and what kind of bandwidth services users can enjoy, and so on.

The complete definition of SLA will be much more complicated, and the "effective working time" part is mainly taken in the software system. As long as the system can always provide services, we can say that the availability of the system is 100%, but this only stays in the ideal. If the system runs 100 time units, 1 time unit will not be able to provide services, we say that the availability of the system is 99%. Post a common form picture:

▲ pictures come from the Internet, and the copyright belongs to the original author.

Nowadays, our life is more and more dependent on some applications of the mobile Internet. Suppose Alipay has been hung up for a few hours, which is great. I can't swipe the card, transfer the account, and the credit card can't be repaid. Is it panic?

However, on the other hand, it can also be understood as speculative that external publicity can also be "highly available" as long as I can guarantee that the system is available when you use it. This is also the reason that before the popularity of the Internet, the information systems of many enterprises' internal Cripple S architecture can be used normally, for example, banks will update their systems during non-business hours, so for the salespeople in the service window, the system is not unavailable, because I don't need to use it at that time.

Third, the essence of "high availability"

To do "high availability" can be summed up in one sentence:

Faster fault detection, faster fault isolation.

Any work that helps these two points is what we have to do.

There are priorities in everything you do, and the "master" that is highly available is "load balancing".

As mentioned many times in previous articles, the key to distributed systems is redundancy, so it is "load balancing" that makes these redundancy "highly available". Therefore, this is the most basic and the first step towards "high availability". Other measures are based on "load balancing".

The role of "load balancing" is to be a "connector" so that the upstream and downstream are "connected" in the way I expect. Therefore, it is necessary to first understand the full picture of these upstream and downstream, and find out where we want to do "load balancing".

Distributed systems have a variety of architectures, but they are essentially a hierarchical architecture like the one shown above. The red dot in the picture marks the place where we need to do "load balancing". As you can see, it is the connection between each two layers.

These connections need to be combined with the network level when they are actually doing "load balancing". Because there are different practices at different levels of the network. As shown in the following picture.

The general mainstream layer 4 load balancer and layer 7 load balancer, the former refers to the transport layer, mainly involves protocols such as TCP, UDP, etc., while the latter refers to the application layer, mainly involving protocols such as Http, Https and FTP.

There are many solutions used to achieve "load balancing", including hardware-based or software-based, more mature ones such as F5 (supporting four-tier and seven-tier), LVS (supporting four-tier), Nginx (supporting seven-tier) and so on.

In recent years, with the rise of Service Mesh, with the emergence of a large number of new generation of "load balancing" solutions, such as Envoy, Istio, Linkerd, Ribbon and so on, interested friends can study on their own.

IV. Conclusion

This article starts with the first step, and the second part talks about the strategies of "load balancing". Use the picture to talk.

Author: Zachary (personal × × number: Zachary-ZF)

Official account (launch): cross-border architect.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.