In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
Personal experience, I hope I can help you.
Alarm + message channel + self-healing processing, optimize monitoring and alarm
1, alarm category, can be divided into grey alarm, blue alarm (important), red alarm (high risk), such as the use of zabbix
2. There is a separate alarm group for each type of alarm. There must be a SRE reply in yellow and 15 minutes and within 10 minutes in red. Automatic marks are used in the background to calculate the time difference between alarm and response, and no personnel respond to follow-up after the expiration of the time limit. If the relevant personnel are notified by direct automatic phone call, if there is no reply within 5 minutes after the automatic call, then the direct background record timed out. Daily and weekly statistics according to the product line, or personnel statistics of the total number of alarms, and overtime, assessment. The api interface of a pin can be used for secondary development to realize whether the user recovers the record or not, and the automatic telephone notification.
3, some abnormal types, after alarm + self-healing automatic treatment. Or patrol category + self-healing automatic processing.
All alarm messages and reply records are automatically collected into the database, such as reply messages, classified, such as updates, cpu, memory, faults, system bug, etc., and sent to the relevant groups through the message center at an alarm, followed by the number of occurrences in the last 1 day, 3 days, and 7 days. And the display of the information and data replied by the personnel who gave the alarm before, such as the previous personnel replied to update 60%, memory 30%; recommended maximum operation.
4. The response statistics of alarm category. Check the daily and weekly ranking after classification. If there are more update alarms, then directly in each update, the alarm related to such updates is shielded through the message channel interface (shielded for 5 minutes and customized for 10 minutes), so that the alarm will not be reported to the relevant alarm groups during the update, but monitoring tools such as zabbix will continue to be displayed. Reduced alarms caused by updates
5. Some of the problems were concealed because of self-healing and automatic treatment.
"if the CPU_IDLE alarm occurs frequently on a machine, then we can adjust the current monitoring strategy. For example, in the past, the alarm occurred 5 times in the 5min, but now it can be adjusted to 20 times in the 10min, or directly delete the alarm policy, or adjust the alarm text message to alarm email, or various similar means. But why there is a CPU_IDLE alarm on this machine, no one pays attention to it, let alone solve it. "
Daily and weekly statistics of the name and times of self-healing; according to the business lines of personnel and departments, if there are more self-healing, it is necessary to optimize the program or other problems to reduce the number of self-healing; if there is a sudden increase in the number of similar alarms in a certain machine, there may be signs of problems, and there are more types of alarms, and there are alarm dashboards directly showing all kinds of alarm curves, and problems are also found through the curves. ~ when there are fewer alarms in the later stage, why do you come back to follow up the problem of self-healing? for example, if the disk alarm keeps alarming, it will be dealt with automatically. Is it possible that there are more debug logs or abnormal logs at a certain time, resulting in frequent cleaning?
Reference:
Https://www.infoq.cn/article/1AofGj2SvqrjW3BKwXlN?utm_source=infoq&utm_medium=article&utm_campaign=newinfoq&utm_content=language2019&utm_term=701
Get rid of the invalid alarm? Summary of ten years' experience of operation and maintenance monitoring and alarm optimization
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.