How to design the architecture with high availability for web 07/01 Update SLTechnology News&Howtos

How to design the architecture with high availability for web

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to do web high availability architecture design". In the operation process of actual cases, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

define the target

Since our goal is to be highly available, it is necessary to clarify what high availability means first and to disassemble the goal so that it can be quantified. As I understand it, goals can be broken down in three ways:

1. Maintain high business stability

System stability is the fundamental purpose of high availability, colloquially speaking, the system can continue to be available, will not be down for no reason, and can still work normally under high pressure.

2. Support fast fault location

From the point of view of practical engineering, there is no service that does not fail, so it is necessary to be able to quickly discover and locate the fault. Before the external user finds it, the alarm mechanism can accurately locate the cause of the fault, help the engineer to deal with the problem as soon as possible, and prevent further affecting the business.

3. Support rapid business recovery

This requires a few more words about the difference between "restoring business" and "solving problems," which also illustrate the two different ways we solve problems after a failure occurs on the line. Simply put,"restore business" means that the cause of online failure can be temporarily put aside, we first find a quick temporary solution, let the business run. Many students have a thinking inertia when dealing with production failures: first try to find the cause of the problem, then change the code to solve the problem, test, release online, and finally the business function can work normally. In fact, the time cost of a process is very high, and the business is greatly affected by this failure. For example, the service response on a certain machine is very slow, resulting in request timeout. Possible reasons include: network bandwidth problems, machine disk problems, insufficient CPU or Memory, application endless loop, jvm garbage collection time getting longer... It is difficult to investigate so many possible causes in just a few minutes, but we can restore business without knowing the real cause. For example, the easiest way is to directly take this machine offline and let the traffic be distributed to other machines or newly added machines.

Now that we have these three unbundled goals, the next architectural solution will be executed around these three goals.

Service classification + service degradation

Service classification: Classify services according to business requirements, divide core services and general services, core services and general services will not affect each other, resources behind services, cache, database, MQ are separated from each other. Service rating, corresponding to our sub-goal 1.

There are two key points here:

1. Extract core services, such as users, orders and payments in our mutual fund business. These three services are core services. Message push and marketing coupon points are common services. The stability of core services is also related to KPI of business. Core services are what customers must use. Once there is a problem with core services, customers cannot purchase products; even if there is a problem with ordinary services, it will not affect transactions temporarily. From another perspective, the code for common services changes frequently in our daily work. Therefore, ensuring the normal use of core services as a priority is our primary objective.

2. Different levels of service resource separation, including server, cache,MQ, DB, etc., are separated, because as long as different services share resources, it is possible for common services to affect core services. For the simplest example, if services share a redis, the quality of the core service will suffer if a large number of message requests consume the number of redis connections.

Service degradation: When a failure occurs, ordinary services can be directly degraded to protect core services from being affected. Service degradation, corresponding to our sub-goal 3.

After splitting into core services and common services, in many business scenarios, services are called to each other, which means that services may influence each other. For example, we need to query user information (core services) before pushing messages (ordinary services), and a large number of messages will generate greater pressure on the core business system. In this case, we can ensure that core services are not affected by shutting down non-core services.

In addition, the best way to do service degradation is to modify the dynamic configuration. Rather than manually modifying static configurations online or releasing new versions, it is error-prone and inefficient. Therefore, here I recommend a service similar to Ali diamond. After accessing diamond, after modifying the configuration through diamond background, groovy script will directly synchronize the latest configuration to the service, and even complete the downgrade operation without restarting the service.

Establish hierarchical monitoring

The purpose of establishing monitoring layers corresponds to our sub-goal 2, which is to monitor all relevant information involved in fault analysis and positioning. It is divided into 5 layers. The specific layers and meanings are as follows:

network layer

Analyze the network access situation, for example, there are a large number of external requests, resulting in the bandwidth of the external network card is full, you need to immediately analyze whether it is normal traffic, if the activity brings high frequency access, then you need to do bandwidth upgrade, if it is an external attack, then you need to consider doing traffic cleaning and other protective operations.

interface layer

Collect the access status of exposed interfaces, including interface execution time, return status code, number of calls, etc. We need to pay attention to the API interface we access the most times, determine whether service expansion is needed according to the access status of the interface, and determine whether there is abnormal external access, machine brushing, etc. If there are interfaces that have a large number of error codes returned, we need to find out why these interfaces fail to access in the first place.

business layer

Collect and analyze core business and common service performance and calls to each other, for example, if a service generates a large number of Exceptions or dubbo service call timeouts.

middleware layer

Middleware layer refers to various types of middleware on which services depend, such as containers, caches, and message queues. Different middleware pay attention to different information, for example, database Redis monitoring indicators include the number of connections, the number of requests, the execution of rdb&aof, the frequency of IO, cache hit rate, etc.

system layer

System layer refers to operating system status, collected information including cpu utilization, memory utilization, network card traffic, number of connections, etc.

"How to do web high availability architecture design" content introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.