Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

An example of an elegant alarm processing system

2025-04-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Share

Shulou(Shulou.com)06/01 Report--

Operation and maintenance students all know that operation and maintenance must be inseparable from monitoring software such as Zabbix and Nagios. At present, similar software has achieved the extreme in monitoring and data acquisition, but there is no perfect solution in alarm processing, for example, high-quality alarms are often lost in massive alarms.

This paper does not discuss the configuration optimization of the monitoring system, but only discusses what we should do after the monitoring system sends out an alarm according to its logic.

The pain point encountered by the police

Alarm storm, high quality alarm is lost in mass alarm.

After the alarm, no one claims it, and you need to communicate in the working IM group.

Operation and maintenance personnel carrying out operation and maintenance operations will certainly cause some alarms, which will bring confusion to students who do not know the truth.

After a large number of alarms have been restored, it is difficult for operators to know which alarms have not been restored at the first time.

There is a slow query alarm in MySQL, and DBA needs to log in to the database to check it.

Some alarms are of low priority and can be dealt with during the day, but they are sent out at the first time at night.

The same alarm will be reported again and again.

Background status quo

Yunjixing is created as a comprehensive cloud service provider, which is responsible for both public cloud monitoring and private cloud monitoring. Our R & D team has established a relatively perfect OpenStack monitoring system and used a variety of monitoring tools; because Yunji Star's operation and maintenance team and customers are distributed all over the country, the physical location of the monitoring system is also scattered.

In the public cloud scenario, the alarm needs to be sent to different operation and maintenance students, operation students and management according to the physical location or application type. In a private cloud scenario, the alarm also needs to be pushed to the appropriate customer. At present, we mainly use Wechat as the main way of alarm, supplemented by text messages.

Advantages and disadvantages of using Wechat

Advantages of using Wechat:

Basically free of charge

There are both pictures and texts, and the limit on the number of bytes is relatively comfortable.

Wechat client-side and server-side interaction is convenient.

Disadvantages of using Wechat:

Availability depends on Tencent's servers:

For this reason, the monitoring of the Wechat server interface is specially increased, and when something is wrong with the interface, it will send a text message to the police.

The client needs to stay online and does not deliver the report:

Therefore, the system provides summary table function (see later).

Three elements of excellent alarm processing system

Send it to the right person at the right time

Provide as much information as possible so that the responder can know what went wrong without turning on the computer.

Reduce the cost of personnel communication around the alarm.

Implementation plan

Architecture Overview

Alarm classification

General alarm: send it to the operation and maintenance students on duty according to the schedule, and the low-level alarm will be delayed to the corresponding application development.

ELK log alarm: users can view it on WeChat.

Receive alarm: confirmation, feedback and summary

Alarm confirmation: when the user clicks the confirm button, the corresponding person will receive the confirmation message.

Alarm processing result feedback

Summary table: provide batch confirmation function

Alarm convergence

Compound alarm Convergence based on keywords, Hostname and Tag

Alarm upgrade

If the alarm is not confirmed or automatically answered within a certain period of time, there will be an alarm upgrade action.

Wechat vs SMS two platforms

All Wechat interfaces are encrypted to prevent unauthorized users from accessing and following the official account. The short message platform is mainly used to send disaster level alarm, Wechat API interface alarm, and system availability alarm.

Summarize the results of using the system

The alarm scheme previously used by Yunji Xingchuang is email plus SMS. After the alarm is triggered, the operation and maintenance communication group will have a large number of communication around the alarm, and alarm storms often occur, blocking the SMS sending platform. After the system is put into use, basically all communication is carried out in the system. With the rich alarm additional information, the number of times that the second-line operation and maintenance engineers boot up and log in to the system when dealing with faults is reduced.

Research and development process

The development of this system took about half a year, and basically grew up with the development of Yunji Xingchuang. The initial idea is that with the impact of the country's policy on cracking down on telecom fraud, various SMS sending platforms are becoming more and more difficult to use. So the idea of using Wechat, which is very popular, to replace SMS was born.

The first version is to push the Zabbix alarm message intact. With the continuous expansion of the scale of the public cloud, the alarm continues to increase, in addition, the number of private cloud customers is also increasing, and the personnel who need to receive the alarm are becoming more and more scattered, and the communication cost around the alarm is getting higher and higher.

Therefore, the function points of this system are developed around the pain points encountered by our operation and maintenance students when dealing with the alarm. After half a year of development, we have turned the operation and maintenance alarm into an operational alarm.

Future development

The alarm system is associated with the work order system and CMDB

Rapid realization of fault root cause location

Alarm ranking analysis report

(note: the screenshot in this article comes from the operation and maintenance test in the pre-release environment.)

Finally, the code has been hosted on github

Https://github.com/superbigsea/zabbix-wechat

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Network Security

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report