What aspects should the operation and maintenance of large-scale systems start from-- total quality management 07/04 Update SLTechnology News&Howtos

What aspects should the operation and maintenance of large-scale systems start from-- total quality management

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

A large-scale Internet system means a large number of users, many business modules, many servers, and a large number of resources. What aspects should the operation and maintenance guarantee work start from when getting a large-scale Internet application?

First of all, we should first look at the goal of operation and maintenance, pursuing higher SLA and lower cost. Everything is based on this goal. SLA refers to higher quality of service, and the implementation of data means the availability and performance of online services. Can you achieve 4 9s when the availability is 3 9s? Can you keep 4 9s for a long time? Can it be optimized to 80ms when the average response time is 100ms? Lower cost means is it possible to run with fewer servers and a more streamlined architecture under the same SLA? All systems and automation tools are designed to achieve this goal.

If you look at SLA and cost, they are not independent. Generally speaking, high SLA means high cost. For example, if a service running with 10 servers is switched to 100 servers, the SLA of service performance and quality will definitely be improved. Therefore, these two indicators are actually a balance of opposition and unity. When there is a conflict between SLA and cost, for the sake of service stability, our general approach is to stabilize the service quality first, and then consider optimizing the cost. It can also be said that the cost is used to stabilize the quality first, and then slowly find out the reasons for using so many servers and so many resources. After all, if the service quality is gone and users are lost, everything is zero.

Higher SLA actually means fewer online failures. If we use this as a starting point to sort out the whole picture of the work of operation and maintenance, in fact, the working stage we need to grasp will become before, during and after the fault. We should try our best to increase the work input before the fault and reduce the number of problems flowing into the fault. Once it flows into the fault, we need to find a way to stop loss quickly and review the fault after the fault. Form improvement measures to avoid the recurrence of similar problems, and then transfer to the fault, cycle. In order to understand it more vividly, a picture is drawn to show it.

The above figure divides the work of operation and maintenance into eight pieces from the point of view of fault generation, each of which may correspond to many systems and systems as support, all of which form the entire operation and maintenance service system. Our daily tools and systems are for the implementation of each business link more efficiently. One of the keys of PS for large-scale system operation and maintenance lies in all kinds of standardization, which means batch operation means uniform operation. Now let's take it apart and talk about these 8 pieces of work.

1. Before failure-goal: reduce the flow of problems into "failure"

① capture-change

Faults do not occur for no reason, many of which occur in changes, and one of the major changes is iterative online. judging from the historical data of failures every year, a large part of the faults are caused by online changes. therefore, we should strictly control the changes and control the quality before going online.

Management change is mainly to control the online "road killer", to do a good job of unit testing, integration testing, online grayscale, and then full online, to ensure foolproof, from a cost point of view, online and then rollback is also the largest way to cost, affecting users, and then rework.

Some change failures can not be found immediately after launch, such as full gc of java program, which may not occur until one day after launch, so some systems are needed as assistance at this time, for example, it is not allowed to go online a few days before major festivals, and emergency go online after going to the boss for approval after off-hours, etc., to increase the cost of changes in abnormal time online, so that developers and operators can slowly cultivate awe of online services.

② capture capacity

The management of capacity is related to quality and cost, and very often it is vague and missing for capacity. The specific performance is that the capacity failure occurred, and the company thought of capacity expansion only after seeing the cpu and memory alarms. The company needs to optimize the cost to grasp the machine usage standard rate to know the capacity reduction, which is basically passive and has no quantitative data. Without quantitative capacity management, it is like a Chinese chef cooking, adding more salt and less vinegar according to experience, which is based on feeling, which is basically unmanageable and highly dependent on someone's experience.

In addition, the management of capacity is very important to cost. With capacity data, and combined with current users, we will know whether the current number of servers is reasonable, how much waste, or need to be expanded. With capacity data, many performance problems can also be exposed. For example, a 32-core 128G machine only runs 10 QPS, which is obviously unreasonable and needs to be optimized.

According to practical experience, capacity indicators must be measured by business indicators rather than machine indicators. For example, in many cases, if the code quality is poor, such as 10QPS, the CPU and memory of the machine are quite high, should the capacity be expanded at this time? Business indicators generally refer to QPS, the number of online users, the number of long links, and so on. In theory, we first have different configurations of single machine capacity data, then calculate the capacity of the cluster, then calculate the capacity of the module, and then calculate the capacity of the entire product. The most common tools for capacity measurement are stress testing and full-link stress testing, which are used according to different scenarios.

First of all, the capacity data is guaranteed, and then more and more accurate, and then dynamically updated according to the code, architecture, model changes.

③ capture-disaster recovery

Disaster preparedness and cost are also a contradictory pair. The cost of disaster preparedness in multiple data centers must be high, because the cost of disaster preparedness in multiple computer rooms must be low, but if a certain computer room collapses and cannot be served, the business will have nowhere to cut, and it will really collapse, so we have to make reasonable disaster preparedness according to the level of business.

Disaster recovery means redundancy. Reasonable disaster recovery can ensure rapid business switching in the event of a failure and ensure the availability of users. Generally speaking, disaster recovery is divided into hot backup and cold backup. Cold backup means that resources are prepared and used only in case of failure, resulting in a great waste of resources, so those who can do hot backup do not do cold backup.

There are many ways to do hot backup, and now many businesses are service-oriented. One of the most common hot backup solutions is load balancer, which is automatically scheduled through the heartbeat health check service. Second, if a link of China Mobile, China Unicom or China Telecom fails, it will be switched through domain name resolution.

The most typical cold backup solution is keepalived, which provides cold backup for a service by means of vip drift, not to mention the principle.

Disaster preparedness can greatly reduce the business pressure in the event of a failure. It is much easier to cut the service before checking the fault. In addition, it is even better if you can switch over automatically (as if it can be called fault self-healing).

④ capture-patrol inspection

The significance of inspection is to find potential problems, expose and solve problems that have not yet formed faults in advance. In the method, the manual inspection can also be realized through the system. After the inspection, the patrol report is sent, and the changes of some core indicators are marked and notified according to the attributes of the service.

Whether it is manual or system inspection, it is necessary to sort out the core indicators of each module to form a bird's-eye view of the screen of the core indicators of each module. First, when you come to the office every day, you have to inspect the business situation. Second, when you encounter a fault, you should quickly look at the changes of each index to make fault location.

The patrol screen of each module generally includes QPS, error, time delay, external dependency error and machine index, which is convenient to find the problem at a glance, find the root cause quickly and locate the fault influence range.

2. Failure-goal: quickly find problems and stop losses quickly

If the previous work is done in quality and quantity, and the fault still occurs, then consider how to deal with it, the key to dealing with the fault has three links.

⑤ capture-alarm

Not to mention that the alarm is not matched and other special circumstances, the alarm should be the first event of the fault, when the oncall personnel receive the alarm, judge the impact, and distribute it to the corresponding students to deal with until the fault is restored.

Therefore, the alarm must be well managed, the alarm should be graded according to the impact of the incident, alarm text messages and e-mails carry as much judgment information as possible, and if you do well, you can even make a fault reference prediction.

The construction of alarm must focus on three points: accurate, few, fast, the information of alarm must be accurate, and the number of alarm is small, which is the effective alarm after convergence, which refers to the high effectiveness and high speed of alarm.

⑥ grab-location

After receiving the alarm and making a simple judgment, oncall students will distribute the alarm to the processor, and this is the time for fault location.

Fault location depends on a detailed understanding of the business architecture and online experience, which is generally checked with the help of monitoring. At this time, the core metric screen built during the inspection phase will come in handy. You can basically make a prediction through screen and core metrics, and then troubleshoot the root causes through log analysis or logging in to the server to view detailed logs.

Positioning is also a priority, first of all to find the fault module and fault scope, after found, according to the plan to cut off the business and then to investigate the cause, to reduce the impact of online users.

⑦ capture-preplan

The pre-plan refers to the operation to stop the loss in time in order to ensure the stability of the business after locating the fault module and fault range.

In order to stop the loss and reduce the impact as soon as possible after the failure occurs, it is very important to consider the probability of various faults and make a good plan, which is also divided into disaster recovery plan and demotion plan, which is basically intact. Downgrade plan may affect some user experience, but does not affect the main function, if there is no plan can only resist.

3. After failure-goal: eliminate similar failures

⑧ fault management

Fault management is the follow-up work of the whole fault, with the exception of the responsible part, then its meaning is to prevent the recurrence of similar failures. Generally speaking, all parties concerned will conduct a full process review in the form of a failure recovery meeting, and the final document is called "fault report". I think the two most important contents are the cause of the failure (is it a natural disaster or a man-made disaster? The root cause has not been found), one is the follow-up improvement measures.

There is a saying in management that "no good system is equal to none if not implemented". In the implementation of improvement measures, it is not uncommon for many good scars to forget the pain, and the improvement measures will be changed, resulting in the repetition of similar breakdowns. In this process, we must ensure that the formed improvement measures are completed in quality and quantity.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.