How to design an alarm system 07/06 Update SLTechnology News&Howtos

How to design an alarm system

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces you how to design the alarm system, the content is very detailed, interested friends can refer to, hope to be helpful to you.

The nature of alarm

Alarm object

Indicators and strategies for monitoring

Theory and reality

Anomaly detection

Smoothness detection based on curve

Time periodicity based on absolute value

Time periodicity based on amplitude

Abnormal judgment based on curve rebound

Summary of core points

The nature of alarm

Not many system alarms are properly designed. Good alarm design is a very difficult task.

How do you know that the alarm you received is bad? How many times have you turned it off immediately after you received the alarm? Are you overwhelmed by these useless things all day long?

The most common alarm setting: cpu usage exceeds 90%, and then alarm. This setting does not provide high-quality alarms in most situations.

A high-quality alarm should go like this: each time you receive it, you can immediately assess the scope of impact, and each alarm requires you to make a hierarchical response. Every alarm should be actionable's.

The nature of the alarm can be shown in the following figure:

The server should be designed for such an unattended purpose. Assuming that all the operation and maintenance are on holiday, the service can run automatically at 724.

The essence of alarm is to "use people as services". When there is no way to carry out some things programmatically, use the way of alarm notification to interfere with the system to achieve the purpose of correction.

An alarm is like a service call. If there is an alarm, but the person who receives the alarm does not need to do anything, then this is a DDoS attack, which attacks the happy life of the operation and maintenance staff.

Most of the time, what the alarm tells the person to do can really be automated. For example, if the server is down, come up with another one.

In a smaller system, it may just stop for a while and manually replace it with a cold standby machine.

Larger system, because there are many servers, it is not possible to hang up every day, it must be hot standby, and the system automatically switches to the standby machine.

For a larger system, because the switching is so frequent, the withdrawal of the failed machine and the retention of the standby machine have become a management burden, so it can be connected with other operation and maintenance processes to become a fully automated system.

It's just that different implementation strategies are chosen at different stages of the business process. With a small volume of business, it is sometimes more economical to use flesh and blood as a machine. Of course, life is a little unfair to the guy who is used as a robot.

Alarm object

There are two types of alarm objects:

Business rule monitoring

System reliability monitoring

For business rule monitoring, you can give an example of a game.

For example, DNF game characters in the case of certain equipment, the damage output of a single strike should have an upper limit, if exceeded, it means that there is cheating.

For example, in the game of fighting landlords, there is a certain upper limit for a person's winning streak, and there is a certain upper limit for the daily winning rate. If it exceeds the average by too much, it may be cheating.

Business rules do not monitor the hardware or whether the software is working properly. It is whether the software is implemented in accordance with business rules and whether there are loopholes. It can also be understood as the monitoring of "correctness".

System reliability monitoring is the most common form of monitoring, such as discovering whether the server is down, whether the service is overloaded, and so on.

For most backend services, the system can be abstractly modeled like this:

What indicators can be collected for such a system?

Number of requests, request arrival rate

Normal response ratio

Number of error responses

Response delay

Queue length, queue time

The reality is that almost no system runs in isolation. But it goes like this:

A DB depends on the underlying cpu, memory, disk, and other resources. A Http service depends on the underlying DB service. An application will rely on several underlying RPC services.

As a result, there are a few more indicators:

The amount of resource A transferred (e.g. CPU utilization)

Allocation of resource B (such as memory allocation and release)

The amount of resource C transferred (such as the number of packets sent on the network)

...

Generally speaking, this hierarchical structure can be divided into four layers:

Product strategy and marketing: they determine the rate at which fundamental requests arrive

Application layer (or, more crudely, web layer): the top glue

Service layer: db, various RPC services, and layers of nested services

Hardware layer: cpu, memory, disk, network

Because of such a level of dependency. The resource consumption of the upper layer to the next layer becomes the number of requests of the next layer.

For example, how many DB resources the Http service consumes corresponds to the number of requests that the DB service needs to process. Whether DB is busy or not depends on Http service requests, whether Http service requests are busy or not depends on how many people open the client, and how many people open the client depends on the product strategy and marketing campaign.

This hierarchy determines the simple tracking of an indicator, such as the number of absolute requests, and it is difficult to tell whether there is a failure in the service at this layer.

There are so many levels, and each layer has a lot of indicators to collect. So what indicators should be collected and what alarm strategy should be used to alarm?

It has been mentioned earlier that alerts must be actionable, but in practice, only this programmatic requirement is still difficult to operate. At least mention a few things you shouldn't do:

The difficulty of collection should not be used to determine what indicators you use to warn. In many cases, cpu usage may be the best collection, but it may not be the most worthy of warning.

Don't give operators the alarm they want, but do what they really want. In most cases, what people tell you is a solution. The operation and maintenance staff tells you that it needs to alert you when the cpu utilization of the db process exceeds x%, and it gives you what he thinks is the best solution. But what he really wants is to know whether there is an exception in the db service, and the cpu usage of more than x% may not be the best indicator to tell you whether there is an exception in the service.

Blindly collecting indicators that are easy to obtain and randomly setting threshold alarms are the root causes of most poor alarm quality.

Indicators and strategies for monitoring

So what indicators should be collected? I think most of the system reliability monitoring is nothing more than three objectives:

Is the work getting done? Does the system continue to complete its set work?

Is the user having good experience? Is the user experience good?

Where is the problem/bottleneck? Where is the problem or bottleneck?

The most crucial one is the first question, is the work getting done. For the database, we can collect:

Cpu utilization rate

Network bandwidth size

Number of db requests

Number of db responses

Number of db error responses

Db request delay

Obviously, to answer whether an db has completed its assigned work, these two indicators should be paid more attention to:

Absolute number of db requests

The proportion of db's correct response to the number of requests

These two indicators are more illustrative than what cpu usage is collected. Not only db, but services at all levels can reflect their working conditions in terms of the number of requests and the percentage of correct responses.

For example, the number of http requests (compare the number of correct responses of http), such as the number of app openings (compare the number of online people recorded by the server), and so on.

Why doesn't cpu usage tell the story? Most of the time, we don't care about cpu itself, but about services that use cpu as a resource. So cpu usage is just the number of requests for a resource.

A concept related to the number of requests is the saturation (upper limit). When the upper limit is reached, the processing starts to queue, the delay starts to get longer, and the error rate begins to rise.

So can cpu usage indicate the upper limit? The upper limit of cpu utilization is set at 100%, so isn't it reasonable to start alarming at 90%? After all, cpu is 100%, which is almost equivalent to the fact that db cannot handle requests properly.

This practice of using the underlying resource allocation to assess whether it has reached the upper limit has two fundamental shortcomings:

You don't know to what extent upper-level services can make use of the underlying resources.

The saturation of the underlying resources may not be easy to measure

Specifically, it is unknown whether db can really use cpu at 100%. If the request is locked, or sleep, then maybe cpu will never reach 100% and 90% may be the limit.

And modern cpu is multicore, and if request processing can only use a single core, processing can jump between multiple cores, it will never remain 100% for a single core.

It is possible that the upper limit of cpu does have a value of 100%. But for many non-hardware services, such as you are a login service, it depends on a db. So the number of different sql combinations that this db can handle per second is difficult to measure, and there is definitely not a limit absolute value of mb/s that can be compared with disk.

And there is another drawback in measuring the use of underlying resources that you cannot enumerate all dependent resources. So instead of indirectly monitoring whether the upper-level service is normal through the underlying resources, it is better to directly measure whether work is getting done.

For the second question, is the user having good experience? The indicators that can be collected are

Average queue time, average total response delay

Queue time for percentile on 99-95-90, response delay for percentile on 99-95-90

The user here does not necessarily refer to the person or player, but may be the caller of the service at the upper level, another system.

The third problem is the so-called fault location. If you do it manually, the most common thing to do is to get an alarm, log in to CRT and start typing various commands to find out why.

The most appropriate thing for the system to do is not to execute a bunch of commands when something goes wrong, but to:

Every level warns itself.

The top-level service has an alarm trigger automatic location program.

Locate the correlation between alarms according to the dependency of the service and the approximate time range, so as to find the problem or bottleneck.

Of course, the actual situation is very complicated. Many causes and results are mutual cause and effect. Whether the two alarms are two phenomena, or a reason, a phenomenon is actually very difficult to explain.

From the perspective of alarm algorithm, it is very easy to alarm the successful request rate or the average response delay. The static threshold is looked down upon by everyone and feels simple. But most alarms use static thresholds to solve the problem.

Theory and reality

Do you want a difficult algorithm for the alarm?

My view is that the correct indicators are collected, there is no need for a complex algorithm, that is, the static threshold can be done. But there are at least three situations where algorithms are needed:

The number of errors cannot be collected directly: automatic classification of error logs is required

The request success rate cannot be collected directly: the absolute value of the number of requests or responses needs to be detected.

Only the total number, and the proportion of each subdivision component cannot be collected: the participating factor needs to be fitted by algorithm.

In fact, these three items are all about the same theme, and things get a lot more complicated when you don't have direct access to the indicators needed for the alarm.

There is an analogy: Kepler 452b, the earth twin brother recently announced by NASA. If our probe can run 1400 light-years away, it will be very easy to find him.

It is precisely because it is so difficult to obtain data directly that scientists need to find these distant planets based on the brightness changes caused by planets blocking stars (the so-called occultation).

The difficulty of collecting the required indicators may be due to several factors. One reason is that collection itself is a very resource-consuming thing.

For example, get the cpu consumed by each mysql query. It is impossible to track the processing of each request. At this time, you need the help of the algorithm. You can take a closer look at the video of vividcortex:

Http://www.youtube.com/watch?v=szAfGjwLO8k

More often, the difficulty in collecting indicators is the communication problem caused by the separation of DPUBO. The indicators needed by the operation and maintenance need to be developed to bury the site, and the development of the buried site needs to be alerted by the operation and maintenance staff. Most of the time, the second choice will lead to the situation of using whatever indicators there are.

For example, although there is no number of errors in the request response, errors are basically logged. According to the speed of the error log scrolling, you can roughly know if there is a problem.

This introduces a very difficult question of log classification: what log represents normal, what log represents exception, and what type of exception is not?

The company that has done a good job in this aspect is summo logic:

Https://www.sumologic.com/

Why is this opsdev (mocking devops that) companies so keen on algorithms?

For them, the benefits are obvious. The fewer changes the customer needs to make and the lower the access cost, the wider the customer reach.

But using machine algorithms to mine massive logs is really the answer: is the work getting done? The best way to do it?

Apparently not. This is the cannon hitting mosquitoes. The existence of logs is to solve problems, not to have a large number of logs, how to make good use of "them" has become the problem itself.

The third kind of situation is that there is no way to collect the success rate of the request, only the absolute amount of successful processing. Only this kind of data should be alerted, so it is impossible to set a simple static threshold.

For delay, it is generally possible to set an upper limit of delay that is acceptable to the business. For the success rate, you can also set an acceptable upper limit of success rate. However, for absolute processing capacity, there is no way to simply compare a static threshold to determine whether it is normal or abnormal.

Before discussing how to achieve this, I would like to emphasize two more points:

The amount of processing success is not the best indicator of is work getting done. If you take the trouble to work out the algorithm, you might as well collect the success rate index directly.

The amount of successful processing also depends on the number of requests. The number of requests basically depends on the upper-level service.

You are a dba and notice that the number of requests processed per second by db has plummeted. Does that mean there is something wrong with db? Or is there something wrong with app?

It's possible that... At the top are products and marketing.

You find that the number of registrations for a business is less than that of a few days ago, does this mean that there is something wrong with the registration service?

Maybe the product is so bad that no one comes to play the game. May also be the marketing means of marketing, do not send gold coins, players do not have enthusiasm.

Anomaly detection

Only the number of requests, no upper limit of reference (saturation), no success rate, no failure rate, how to detect anomalies?

The yellow line in the image above is yesterday's value, the green line is today's value, and most of the service monitoring graphs look like this. Four ideas can be drawn:

Curve smoothing: a failure is generally a disruption of recent trends, but visually it is not smooth

Time periodicity of absolute values: the two curves almost coincide

Time periodicity of fluctuations: assuming that the two curves do not coincide, the fluctuation trend and amplitude are similar at the same point in time.

There is a considerable length of pit: when the curve begins to rise back to the historical range, it can generally be confirmed that this period of time is really faulty.

From these four intuitive expansion, we can get a variety of complex or simple algorithms. The algorithms to be discussed below are very simple and do not require advanced mathematical knowledge.

Smoothness detection based on curve

This test is based on a recent time window, such as an hour. The curve will follow a certain trend, and the new data points break this trend, making the curve not smooth.

In other words, this kind of detection uses the temporal dependency,T of time series, which has a strong trend dependence on TMel 1.

Logically speaking, the probability of many people landing at 8:00 and landing at 8:01 is very high, because the factors that attract people to land have a strong inertia.

But 7.1 a lot of people to log in, 8.1 there are also a lot of people to land on the inertia is much worse.

Based on the recent trend to do alarm, it is necessary to fit the trend of the curve.

There are two ways to fit, moving average or regression. The two fitting methods have different bias (tendency).

This is an algorithm diagram of moving average, called exponentially weighted moving average. Its calculation is very simple.

X is the actual value and s is the average calculated by ewma. That is, the average of the next point is modified by the average of the previous point, plus the actual value of the current point.

The proportion of this correction depends on the size of the decay factor of the alpha in the month. Visually speaking, it is whether the ewma curve follows the actual curve, that is, the degree of smoothness.

With the average value, the variance can be calculated, and the tolerance range for amplitude can be obtained by multiplying the variance by a certain multiple. If the actual value is out of this range, you can know whether the alarm can be done.

Beyond the upper limit, there may be a sudden surge in the number of users. Beyond the next term, it may be that the marketing campaign is over, the user leaves quickly, or the optical fiber is broken and the player is offline.

To learn more about the algorithm details of ewma: follow Baron Schwartz:

Http://www.slideshare.net/vividcortex/statistical-anomaly-detection

Moving average believes that the curve tends to be historical, and if the momentum of the curve is rising, then it believes that the next point should begin to decline.

Regression believes that the curve tends to the future, and if the momentum of the curve is rising, then it believes that the next point should be to maintain this upward momentum.

There are more complex models that integrate moving average and regression. No matter which algorithm it is, it is impossible to predict the next 10 minutes accurately with the past 10 minutes. If this prediction could be accurate, the stock god would have been born a long time ago.

Using moving average may mask the decline caused by the failure (because its bias is falling)

If you use regression, it's possible to treat not rising so fast as a fault (because its bias is rising).

Another drawback of this algorithm for calculating variance based on recent trends is that the value of variance will be enlarged when the first few points vibrate greatly. The subsequent failures are covered up so that successive points of failure cannot be detected.

In fact, the algorithm has no concept of what is normal, it thinks that the history of the past is normal. If you have been in a fault in the past few minutes, the fault curve is normal.

In practice, it is found that the advantages of this algorithm based on curve smoothness are

Rely on less data, only need recent history, do not rely on periodicity

Very sensitive, if the fluctuation of history is very small, the variance will be very small, and the range of fluctuation tolerated will be very small.

The disadvantages are also significant:

Too sensitive, easy to false alarm. Because the variance increases with the introduction of outliers, it is difficult to use a strategy of alarm only at three consecutive points.

The business curve may have its own regular sharp increase and drop.

The best way to use it is not to use a curve as an alarm. Combined with several related curves, if the smoothness is destroyed at the same time, and deviates from the trend of the business law (such as a decrease in the number of people online and an increase in the number of login requests), it can be considered as a business failure.

Time periodicity based on absolute value

The different colors in the picture above represent the curves of different dates.

Many monitoring curves have such an one-day cycle (lowest at 4am, highest at 11:00, etc.). The simplest algorithm that takes advantage of the periodicity of time:

Min (14 days history) * 0.6

Minimize the curve with a history of 14 days. How to get the minimum value?

For 12:05, there are 14 days corresponding to the point, take the minimum value. For 12:06, there are 14 days corresponding to the point, take the minimum value. In this way, you can get a curve of the day.

And then multiply the curve as a whole by 0.6. If the curve for a few days is lower than this reference line, the alarm will be given.

This is actually an upgraded version of static threshold alarm, dynamic threshold alarm.

In the past, the static threshold was a product of patting the head based on historical experience. Using this algorithm, in fact, the historical value of the simultaneous points is used as a basis to calculate the most unlikely lower bound. At the same time, the threshold is not the only one, but one at each point in time. If there is a point per minute, there are 1440 lower threshold values in a day.

Of course, 0.6 in actual use should be adjusted as appropriate. And a serious problem is that if there is a downtime release or failure in the 14-day history, then the minimum will be affected.

In other words, history should not be regarded as normal, but should be calculated after removing outliers. A pragmatic approximation is to take the second smallest value.

To make the alarm more accurate, you can accumulate the sum of the difference between the actual curve and the reference curve. That is, the area that falls relative to the reference curve. If this area exceeds a certain value, it will give an alarm.

For a deep decline, it can be warned by accumulating a few points. For a shallow decline, a few points of tiredness can also be warned.

In adult words, if it falls a lot at once, it is likely to be a malfunction. Or if it deviates from the normal value for a long time in a row, then there is likely to be a problem.

Advantages:

The calculation is simple

Can be sure to find a major fault, an alarm must be a big problem, you can make a direct phone call

Disadvantages:

Relying on periodic historical data, it has a large amount of calculation, and it is unable to alarm the newly accessed curve.

Very insensitive, small fluctuations can not be detected

Time periodicity based on amplitude

Sometimes the curves are periodic, but the curves of the two cycles do not coincide with each other.

Like the picture above, the overall trend of the curve is online. When the curves of two cycles are superimposed, one is higher than the other. In this case, the use of absolute value alarm will be a problem.

For example, today is 10.1, the first day of the holiday. The historical curve of the past 14 days is bound to be much lower than today's curve. Well, there was a glitch today, and the curve fell, which is still much higher than the curve of the past 14 days. How can such a fault be detected?

An intuitive statement is that although the two curves are not the same height, they "look the same". So how to make use of this kind of "look alike"? That's the amplitude.

Instead of using the value of x (t), use the value of x (t)-x (t Mel 1), that is, change the absolute value into the rate of change. You can use this velocity value directly, or it can be x (t)-x (tmur1) divided by x (tmur1), which is the ratio of speed to absolute value.

For example, if there are 900 people online at t moment and 1000 people online at tmur1 moment, it can be calculated that the number of people who are offline is 10%. Is this drop ratio high or low at the same time in history? Then it will be dealt with as before.

There are two techniques in practical use: it can be x (t)-x (tMui 1), or it can be x (t)-x (tmae 5) equivalent. The larger the span, the more slowly declining conditions can be detected.

Another technique is to calculate x (t)-x (tmer2) and x (tmer1)-x (tmer1). If both values are abnormal, they are considered to be true exceptions, which can avoid the problem of data defects at one point.

Advantages:

More sensitive than the absolute value.

The periodicity of time is used to avoid the periodic steep drop of the business curve itself.

Disadvantages:

The original curve is required to be smooth.

The time points of periodic steep drops must coincide, otherwise the false alarm

In terms of percentage, it is easy to give false alarms during low peak periods.

A steep drop does not necessarily represent a fault, and the heights and falls caused by service fluctuations at the upper level occur from time to time.

This kind of abnormal alarm algorithm is excellent. There are also many shortcomings. So we can make some repairs to make do with it.

In order to avoid the low peak period, based on the false alarm of the amplitude percentage, the lower limit of absolute amplitude can be added.

In terms of business, it is okay to have small fluctuations if the relative ratio is large, but the absolute range of influence is small. For the problem of rising and falling, we can judge the situation of rising, and shield it for a period of time after rising.

Abnormal judgment based on curve rebound

When we see figure 2, we are more sure of the failure than figure 1. Why?

Because there is a significant rebound in figure 2. The algorithm is actually the same as the human eye. If you wait a few more time points, you can more accurately judge that there has been a fault if you find that the curve has rebounded.

However, there is not much mechanism in the sense of "alarm" for anomaly detection based on recovery.

The function of the alarm is to involve people to intervene to help the curve pick up. If the curve has begun to pick up, isn't it an afterthought to warn again?

The significance of this detection lies in the confirmation of machine replication alarm.

When we need to count the false alarm rate and false alarm rate, we can find out a lot of problems in the original algorithm by running the algorithm from another perspective.

At the same time, a sample database of historical faults can be built in a semi-automatic way. This sample library can be turned into a training set for more complex machine learning algorithms.

Core points:

The high quality alarm is from actionable.

The difficulty of collection should not be used to determine what indicator you use to alarm.

Don't let others warn you, just do what you do, and do "real" useful alarms: especially cpu usage alarms.

Is work getting done: number of requests + success rate

Is the user having good experience: response delay

As long as the index is collected correctly, most of the alarm does not need a complex algorithm.

Algorithm-based anomaly detection: the algorithm is not difficult, it is really necessary and can be done.

On how to design the alarm system to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.