Example Analysis of prometheus alarm problem 07/12 Update SLTechnology News&Howtos

Example Analysis of prometheus alarm problem

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article shows you an example analysis of prometheus alarm problems, which is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Analysis of problems

Recently, during the process of operation and maintenance prometheus, it was found that sometimes it should send an alarm, but in fact it did not; sometimes, it should not send an alarm but sent it; and sometimes, the alarm is obviously delayed. In order to find out the specific reasons, specially went to consult some information, but also referred to the relevant information of the official website. I hope it will be helpful for you to use prometheus in the future.

Let's first take a look at some of the default important configurations of prometheus and alertmanager provided on the official website. As follows:

# promtheus global: # How frequently to scrape targets by default. Interval between fetching monitoring data from the target [scrape_interval: | default = 1m] # How long until a scrape request times out. Timeout of data from the target settlement [scrape_timeout: | default = 10s] # How frequently to evaluate rules. The interval between alarm rule evaluation [evaluation_interval: | default = 1m] # alertmanager # How long to initially wait to send a notification for a group # of alerts. Allows to wait for an inhibiting alert to arrive or collect # more initial alerts for the same group. (Usually ~ 0s to few minutes.) [group_wait: | default = 30s] # waiting time for the first alarm sent # How long to wait before sending a notification about new alerts that # are added to a group of alerts for which an initial notification has # already been sent. (Usually ~ 5m or more.) [group_interval: | default = 5m] the interval between other new alarms in the same group is # How long to wait before sending a notification again if it has already # been sent successfully for an alert. (Usually ~ 3h or more). [repeat_interval: | default = 4h] the interval between sending the same alarm repeatedly

With the above configuration, let's take a look at the whole alarm process. Find the problem through the process.

According to the figure and configuration above, after prometheus grabs data, it is calculated according to alarm rules. If the expression is true, it enters pending status, and when the duration exceeds the time configured by for, it enters active status. At the same time, the data is pushed to alertmanager, and notifications are sent after group_wait.

Alarm delay or frequent occurrence

According to the entire alarm process, after the data arrives at the alertmanager, the larger the group_wait setting, the longer it takes to receive the alarm, which will cause alarm delay. Similarly, if the group_wait setting is too small, the alarm will be received frequently. Therefore, it needs to be set up according to the specific scene.

Alarm when it shouldn't be.

Prometheus pulls data from target after scrape_interval time, and then calculates it. At the same time, the target data may have returned to normal, that is, in the for calculation process, the original data has returned to normal, but the defendant alarm skipped, reached the duration, triggered the alarm, and sent an alarm notification. However, from the perspective of grafana, it is considered that the data is normal and an alarm should not be sent. This is because when grafana uses prometheus as its data source, it is range query, not as sparse as alarm data.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

The above is a sample analysis of prometheus alarm problems. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.