Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Behind the 35 million savings, how can operation and maintenance strike a balance between cost and efficiency?

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Share

Shulou(Shulou.com)06/01 Report--

Declaration of Women

360 O & M development team launched AIOps project at the beginning of the year. After continuous exploration and practice, AIOps successfully saved 35 million yuan for the company.

This article shares how to make efforts from both operation and maintenance costs and efficiency to achieve the purpose of saving resources and improving efficiency.

This article is from 360 OPS development team-machine learning engineer Ji Xinpu shared on the 168th issue of dbaplus community.

preface

Today we are going to share our exploration and practical experience in the field of AIOps (Intelligent Operations) in recent years.

Below is a summary of this sharing:

background introduction

360 thoughts on AIOps

AIOps Practice Program

Experience and summary

I. Background

With the explosive growth of Internet software and hardware, new architectures emerge one after another, and operation and maintenance personnel need to do 7*24 hours of duty to ensure the reliability and stability of the system. But that was obviously impossible.

So in the face of this unprecedented pressure, is there a kind of "machine brain" that can reduce or even replace the operation and maintenance personnel to do some things, greatly reduce their workload and improve the efficiency of operation and maintenance? And how do you get this kind of "machine brain"?

Many operation and maintenance scenarios can be summarized into some regular things, which can be refined and summarized to generate a manual experience base. In addition to human experience, can AI algorithms analyze historical data to get some machine-generated rules?

The answer, of course, is yes. If AI algorithms + artificial experience can be applied to Ops to replace part of the artificial decision-making, this will push the operation and maintenance from the ordinary automation stage to the intelligent stage.

Since this year, many companies have made some attempts in the field of AIOps. Our company's AIOps also experienced the preliminary preparation from the initial standardization to the later refinement and data operation and maintenance. In 2018, the AIOps project team was formally established. After nearly one year of development, it has achieved good results in many single-point applications, and strives to realize the closed loop of some scenarios by the end of this year.

360 thoughts on AIOps

There are many familiar AIOps scenarios, such as anomaly detection, root cause analysis, fault self-healing, capacity prediction, etc. Based on real-world scenarios for platforms and practical experience with AIOps in the industry, we divide AIOps into three scenarios: cost, efficiency, and stability.

In terms of cost, AI algorithm is used to save resources, intelligent scheduling, and improve resource utilization to save resources; in terms of efficiency, AI algorithm is used to actively discover problems, analyze problems, and solve problems, truly saving manpower and improving efficiency.

So how do you start AIOps? We believe AIOps needs three types of people: operations staff, operations development, and machine learning engineers. All three are indispensable, otherwise the project will be abandoned halfway.

The above introduces our understanding of AIOps, and the following is the appearance of pure dry goods. We will introduce AIOps best practice experience in two general directions and five specific projects.

III. AIOps Practice Plan

1

basis

data accumulation

As the saying goes,"a clever woman can't cook without rice." Before starting AIOps, you need to prepare a lot of data, including basic data of machine dimension, network data, log data, and even process data. We have dedicated big data engineers who have collected data for more than two years, laying a solid foundation for the subsequent data analysis and machine learning models.

Below is a summary of the data we collected before and after:

capacity estimation

With historical data, we can do some analysis of the data.

First, we introduce a scenario called capacity estimation. The prediction of important monitoring items can make us know the trend of indicators in time and provide scientific basis for the later decision-making.

Monitoring item samples are time series. By analyzing the monitoring item series, we can get the prediction value of future time. According to the degree of volatility, the monitoring items can be divided into less violent and violent fluctuations; according to periodicity, they can be divided into periodic and non-periodic, etc., and of course there are many criteria for classification. We can see that different time series, we need to use different models to predict.

In the process of predicting time series, we have used the following models successively, and some lessons have been summarized from them:

Many time series have periodicity, we also developed a periodicity detection model, which can judge whether a series has periodicity or not. On the basis of periodicity detection, we can predict different time series by following up the periodicity characteristics of sequences.

There are many predictive models that have been summarized. We used the following models in our project. You can choose your own model based on time cost and accuracy. All of the above prediction methods will be open sourced in the near future, and I hope you will continue to pay attention to them:

host classification

In actual projects, we often encounter classification tasks, such as judging whether the machine is idle according to the characteristics of the host monitoring item; for example, we will judge the type (CPU, disk, memory intensive) of the machine according to the characteristics of the monitoring item.

There are many classification algorithms in machine learning, such as SVM, decision tree, classification tree, etc., which can complete the classification task. We only need to do some preprocessing and feature engineering, and then we can use the existing classification model in Python, which will not be described in detail here.

2 project

With the basic modules of capacity estimation and host classification, we have done resource recovery and intelligent scheduling system successively in terms of cost, and achieved good results.

resource recovery

Resource recovery is to discover idle machines in time and notify the business to recycle them, so as to improve the utilization rate of resources.

Our resource recovery system is divided into three modules: capacity estimation, host classification, and notification module. The capacity estimation model generates five characteristics after predicting and quantitatively analyzing five important indicators (CPU utilization, memory utilization, network card traffic, disk utilization, and number of state connections). Next, the classifier is used to classify the five features to obtain a list of idle machines, and finally the idle machines are notified to the corresponding business leaders.

In AIOps, we often encounter the problem of insufficient negative samples, one reason is that there are fewer abnormal scenarios, and the other reason is that the cost of user labeling is relatively high.

In the process of host classification, we use two methods to generate samples, one is manual labeling, the other is user labeling, which solves the problem of insufficient negative samples. The following chart shows the effect of resource recovery in Q2, which is not bad at present:

MySQL Intelligent Scheduling System

Our MySQL machine online has serious waste problems, such as the following scenario: You can see that as long as there is a high load situation, the machine will not be available. If you think about it, if a machine has high memory, but it does not mean that the machine is unavailable, we can schedule instances with high CPU usage but low memory usage to this machine to achieve the purpose of making full use of resources.

In order to properly match different types of machines and different types of instances, instances and machines need to be classified. In this project, BP neural network is used for case classification, in which the input is 7 important case indexes and the output is 4 categories (low consumption, calculation type, storage type and comprehensive type).

Machine classification adopts decision tree model, input is 5 machine indexes, output is the same as output type of instance. The samples were all manually labeled, generating about 1000 samples.

After having classified machines and instances, scheduling is required. In the scheduling process, a number of factors are considered:

Try to keep the number of migrations as small as possible

Try to avoid cutting the master as little as possible

Ensure the stability of the main library and large-capacity ports

Control the number of master libraries (no more than 5) and the total number of instances on each machine

Instances of the same port cannot appear on the same machine

Do not dispatch blacklisted machines

We tested an instance of a computer room using the above principles, and the number of port migrations was 45, which could bring 14 of the 30 heavily loaded machines into the usable state.

Cost has been a big direction of our efforts this year. In addition to the two projects described above, we also use time-sharing computing to further save resources. This year's goal is to save the company 50 million costs, so far has saved 35 million, has not reached the target, need to continue to work hard.

The above describes the cost aspect of work, the following describes the efficiency of the project.

anomaly detection

Anomaly detection is the most common scenario of AIOps, and there are many algorithms, such as the common statistical learning method--3σ principle, which uses detection point offset to detect anomalies. For example, ordinary regression methods use curve fitting methods to detect new nodes and the deviation of fitted curves, and even CNN and RNN models are applied to detect outliers.

Our company uses LVS quite a lot, in order to deal with sudden increase and decrease in traffic, we need an anomaly detection algorithm.

By analyzing the time series graph of LVS flow, it is found that some curves have periodicity, some have no periodicity, some have more burrs, some are more stable, so a universal detection algorithm is needed to deal with various complex scenes.

In reality, there are few negative samples, so we adopt unsupervised model. In addition, we also draw lessons from voting mechanism to solve the problem that simple methods sometimes have bias.

In this project, we used more than five detection algorithms, including statistical year-on-year comparison, curve fitting algorithm and Zhou Zhihua's isolated forest model. These models are used together to detect a time series. If more than half of these algorithms consider the detected point to be an outlier, we consider the point to be an outlier.

Tracking nearly half a year of online LVS traffic data, the accuracy of detection algorithm is higher than 95%, the effect is good.

alarm convergence

In order to ensure the reliability of the system, operation and maintenance personnel often set up many monitoring items to understand the status of the system in time. If a certain monitoring item exceeds the set threshold, some indicators in the system have problems and need to be handled by operation and maintenance personnel. In this way, all the alarms are sent directly without filtering, which is easy to increase the pressure of the operation and maintenance personnel. Moreover, with the increase of the number of alarms, it is easy to cause fatigue among them, and it cannot achieve good alarm effect.

We analyzed historical alarms and found many patterns. If we use algorithms to analyze the relationship between these alarm items, coupled with manual experience, the number of alarms will be greatly reduced.

Manual experience is needless to say, the following describes how to analyze the potential relationship between alarm items through algorithms.

We use Apriori algorithm, which is commonly used in association analysis in machine learning, to analyze historical alarms. This model uses frequent itemsets to analyze the relationship A→B. Applying this rule to alarms, if alarm A is issued, alarm B does not need to be issued, thus reducing the number of alarms by a factor of two. The following figure shows our analysis of alarm data over the past 30 days and 20+ association rules:

We maintain a rule base online, which comes from two parts: algorithmic analysis rules and manual summary rules. While using these rules, we also combine the rating of the business to carry out a certain degree of consolidation of business alarms. After tracking alarms for half a year, the use of this rule base can reduce alarms by 60%-80%.

Root cause analysis of alarm events

The previous section described ways to reduce alarms, but in reality alarms are inevitable. After the alarm occurs, how to quickly locate the specific problem becomes the key link. So how do you locate the problem through the model?

Through statistical analysis, the six categories of alarms that occur most frequently on our line are:

host alive (host.alive);

Disk space usage (df.bytes.used.percent);

disk partition read-only (sys.disk.rw);

CPU utilization (cpu.idle);

memory usage (mem.swapused.percent);

Percentage of disk io operations (disk.io.util).

After an alarm occurs, operation and maintenance personnel need to log in to the machine or monitoring system to see which monitoring items or processes have problems during the time period when the problem occurs. Such heavy and regular scenarios are especially suitable for models to handle. This section introduces a model that can help OPS personnel narrow down alarm troubleshooting and quickly locate problems.

There are two dimensions of data to be analyzed in this project:

One is the event dimension, which focuses on six categories of alarm events;

One is the indicator dimension, focusing on the monitoring items of the machine dimension (there are about 200 monitoring items).

So how do you find indicators related to an event after it happens? This can be achieved by:

1) For each event, use the method mentioned in the paper "Correlating Events with Time Series for Incident Diagnosis" presented at the SIGKDD conference in 2014 to see which indicators are related to the occurrence of this event. The purpose of this is to screen the indicators and achieve the purpose of reducing dimensions.

2) For the indexes selected in the first step, calculate the information gain ratio of these indexes, and select the first k (we get the value of 5) features as the final influence index;

3) Finally, use xgboost to classify the impact indicators and verify the effect.

The following figure shows the analysis results of these six categories of alarms. Take the Top5 indicators most relevant to alarm events and obtain a relatively good accuracy rate:

For example, the next time a "host.alive" alarm occurs, there is a high probability that it is caused by 'cpu. idle','net.if.total.bits. sum','mem.memused.percent','mem.swapused.percent' and 'ss. closed', which can reduce the time for troubleshooting.

IV. Lessons learned and lessons learned

After nearly a year of hard work, we have achieved good results in some single-point applications. Here's a preview of what's next:

positioning of alarm process levels;

Open source components (capacity estimation, anomaly detection, and correlation analysis of alarm events);

Operation of chatbots.

In the following work, we will combine some specific scenarios to connect some of the single points introduced above, which can really form a closed loop in the true sense from discovering abnormal problems, analyzing problems to solving problems finally.

That's all I've shared this time. Thank you for your participation. Thank you!

live playback

https://m.qlchat.com/topic/details? topicId=2000002350036659&tracePage=liveCenter

Discussion on HULK First-line Technology

The technology sharing public account created by the 360 cloud platform team covers cloud computing, database, big data, monitoring, pan-front end, automated testing and many other technical fields. Through solid technology accumulation and rich first-line practical experience, it brings you the most valuable technology sharing.

Original link: mp.weixin.qq.com/s/8ZvBhrnEr89CcqIwhG6YNg

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Network Security

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report