The "Angel in White" of creating programs (part I) 07/11 Update SLTechnology News&Howtos

The "Angel in White" of creating programs (part I)

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

Introduction to the author: senior data architect, operational Baidu

Responsible for Baidu intelligent operation and maintenance algorithm and strategy research, committed to using the power of algorithms and data to solve operation and maintenance problems.

Overview of practical information

If you are sick in your life, you have to see a doctor. If the program is sick, you will see an operation and maintenance engineer. The doctor gives the patient a lot of tests, and then takes the result list to analyze the cause. The operation and maintenance engineer will also look at the collected monitoring indicators when analyzing the system failure, and then determine what the root cause is.

It is not difficult to look at the indicators. As long as the trend chart of the index (the curve of the index value changing over time) is drawn, the experienced engineer can easily see if there is something wrong and infer the cause of the fault. However, it is said that breaking away from the open dose to say that the toxicity of food is to play hooligans, it is also similar to check the indicators. If there are only a few indicators to look at, you can make a dashboard to see at a glance, but what if there are thousands of indicators? When they checked the big tiger, several banknote counting machines were burnt out. If you check so many indicators manually, the end of the brain may not be much better. So it still depends on the "robot".

Wait, how can the robot know what is wrong and what is not? Even if you can know, pick out the defective indicators, how can the engineer know the root cause? Therefore, our "robot doctor" must be able to identify abnormal indicators, and then need to be able to sort out the identified anomalies into reports that are easy for engineers to understand.

The traditional way

Figure 1 module invocation diagram

When diagnosing faults manually, engineers often troubleshoot the system according to the module call diagram (figure 1) in mind. In many cases, the failure is due to the discovery of many failed requests on the upstream front-end module (An in figure 1). At this point, the engineer will look down along A. Because A calls the B module, you need to look at B's metrics, and if there is an anomaly in the metrics, it is suspected that B caused the failure. Then check the direct downstream module C of B, and so on. In this process, suspicion is passed down through the calling relationship of the module until it cannot be passed on. In the example in figure 1, it is suspected that it finally stops on the head of the unlucky G, who makes it have no downstream module.

In general, this is the process of pushing responsibility downstream between modules. Of course, the real scene is a little more complicated. It is not possible to deduce as long as there is an anomaly downstream, and it is also necessary to examine the extent of the anomaly. For example, if the abnormal degree of unlucky G is much less than that of E, the root cause is more likely to be in E.

It is much easier to find the root cause module and then analyze the root cause, so finding the root cause module is a very important step in fault diagnosis.

The above process can be turned into a tool very directly:

Make a page display module call diagram

Engineers configure gold indicators for each indicator, as well as threshold values for gold indicators

The modules with abnormal gold index and their possible paths to the front-end module are marked in the module diagram.

This tool solves the problem of the index and how to judge the exception by configuring the golden index and threshold, and then presents the result of the exception judgment by calling the diagram of the module, which solves the two core problems of exception judgment and result arrangement.

However, the traditional method will encounter many problems in practical use:

The living system must be constantly evolving, and the calling relationship of the module changes accordingly. In order to ensure that the diagrams in the tool are not out of date, you need to constantly synchronize from the real system. Any engineer who has done the job of combing the system knows that it is not easy. If the whole system uses unified RPC middleware to communicate in the module, you can mine the call diagram by analyzing RPC trace log, but the "history code" usually stops you in the middle of the road.

Each gold index usually covers only a part of the fault types, and as soon as a new fault occurs, the gold index needs to be added. In this way, configuration work-- especially the configuration of thresholds-- will continue to occur. In addition, if there are more indicators, it is easy to see "mountains and rivers all over the country are red". When most of the modules are marked, the tools are useless.

In order to ensure performance and availability, large systems often need to deploy mirror systems in several computer rooms. Because most failures only occur in the system of one computer room, engineers need to know not only who the root cause module is, but also which computer room it is in. In this way, each computer room has to have a call diagram, and the engineer has to look at it one by one.

Ideal effect

The diagnostic tools made by traditional methods are semi-automatic at most, and their application is also subject to a lot of limitations, so we want to make a truly automatic and intelligent tool.

First of all, we hope that the new tools will not rely too much on gold indicators, so that the allocation of indicators can be reduced. However, this in turn means that fully automated tools must be able to scan all metrics on all modules so that they are not missing. Therefore, exception judgment can no longer be done by manually setting the threshold, but must be basically unsupervised (Unsupervised). In addition, the semantics of different indicators are very different, and the algorithm of exception judgment must be flexible enough to adapt to the characteristics of different indicators.

Second, we want the tool not to rely too much on the call diagram, which means we need to find a new way to organize and present the results. In fact, it is not necessary to call a diagram. When using the traditional diagnosis method, we find that some engineers often break away from the call diagram and check the module directly according to the abnormal degree of the gold index. This is because the system gold index that this part of the engineer is responsible for is representative and easy to understand, and more importantly, the abnormal degree of the gold index of different modules can be compared.

Therefore, we can make a diagnostic tool to produce the recommendation report of the root cause module, the content of the report must be easy to understand, and the order of recommendation must be accurate enough.

Automatic investigation and Analysis of case Index

Taking the example index as an example, we introduce how to realize an index investigation tool to achieve the ideal effect. The overall flow of the troubleshooting tool is shown in figure 2.

Fig. 2 the overall flow of automatic troubleshooting of example indicators

In the first step, all collected metrics are given an exception score through the anomaly detection algorithm. By comparing the abnormal scores of the two indicators, we can know which one is bigger and who is smaller. The core of this step is to find a way to quantify the abnormality of each indicator, and the quantified score can be compared between different indicators of different instances.

In the second step, we group exception scores according to the instance to which they belong, and each group forms a vector. At this point, each instance corresponds to a vector, and each element in the vector is the exception score of an indicator. Then, vectors with similar pattern can be clustered into several abstracts (digest) by clustering (clustering) algorithm. This makes it easier for engineers to understand the results of the analysis.

In the third step, we can rank the examples contained in the summary and the exception scores of the metrics (ranking) to form a recommendation report.

Summary

This paper introduces a tool for automatically troubleshooting monitoring indicators when a service failure occurs. In the first step, the abnormal score of each index is estimated by probability and statistics. In the second step, instances with similar abnormal patterns are clustered together to form a summary. In the third step, the abstracts that are most likely to be root causes are recommended to engineers in the way of ranking.

As the operation and maintenance scene is characterized by a large amount of data, but little calibration, the generation of calibration is expensive and error-prone. Next, we will introduce in detail how to use probability and statistics, unsupervised learning and supervised learning to solve this problem. Please look forward to it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.