How to realize the alarm of operation and maintenance of monitoring platform based on Prometheus and Grafana 07/13 Update SLTechnology News&Howtos

How to realize the alarm of operation and maintenance of monitoring platform based on Prometheus and Grafana

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "how to realize the operation and maintenance alarm of the monitoring platform based on Prometheus and Grafana". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Alarm mode Grafana

The new version of Grafana already provides alarm configuration, which can be set directly in the dashboard monitoring panel. However, I found that it is not flexible and does not support variables, and many downloaded charts cannot use alarms, so we choose to use Alertmanager instead of Grafana alarms.

Alertmanager

Compared with the graphical interface of Grafana, Alertmanager needs to rely on configuration files. The configuration is a little tedious, but it is more powerful and flexible. Next, we will implement the alarm notification step by step.

Alarm type

The following two types of Alertmanager alarms are mainly used:

Mail receiver email_configWebhook receiver webhook_config, sends parameters in the following format to the configured url address in post form. {

"version": "2"

"status":

"alerts": [{

"labels":

< object >

"annotations":

< object >

"startsAt":

"endsAt":

}]

}

"this time the alarm is mainly carried out by e-mail. "

Implementation steps

download

Download the latest version of Alertmanager from GitHub and upload it to the server. Tar-zxvf alertmanager-0.19.0.linux-amd64.tar.gz

Configure Alertmanager

Vi alertmanager.yml

Global:

Resolve_timeout: 5m

Smtp_smarthost: 'mail.163.com:25' # mailbox send port

Smtp_from: 'xxx@163.com'

Smtp_auth_username: 'xxx@163.com' # email account

Smtp_auth_password: 'xxxxxx' # email password

Smtp_require_tls: false

Route:

Group_by: ['alertname']

Group_wait: 10s # Notification of how long to wait for the first time to send a set of alerts

Group_interval: 10s # wait time before sending a new alert

Repeat_interval: 1h # the cycle for sending repeated alerts cannot be set too low for email configuration, otherwise it will be rejected by the smtp server because the mail is sent too frequently.

Receiver: 'email'

Receivers:

-name: 'email'

Email_configs:

-to: 'xxx@xxx.com'

After modification, you can use. / amtool check-config alertmanager.yml to verify that the file is correct.

Start alertmanager after the verification is correct. Nohup. / alertmanager &. (the first startup can be started silently without using nohup, so it is convenient to check the log later.)

We define only one route, which means that all alerts generated by Prometheus are received through a receiver called email after they are sent to Alertmanager. In fact, there are different ways to handle different levels of alerts, so we can define more sub-Route in route. Specific configuration rules you can go to Baidu to learn more.

Configure Prometheus

Set up the rules folder under the Prometheus installation directory and place all the alarm rule files. Alerting:

Alertmanagers:

-static_configs:

-targets: ['192.168.249.131purl 9093']

Rule_files:

-rules/*.yml

Create the alarm rule file service_down.yml under the rules folder and send mail when the server goes offline.

Groups:

-name: ServiceStatus

Rules:

-alert: ServiceStatusAlert

Expr: up = = 0

For: 2m

Labels:

Team: node

Annotations:

Summary: "Instance {{$labels.instance}} has bean down"

Description: "{{$labels.instance}} of job {{$labels.job}} has been down for more than 2 minutes."

Value: "{{$value}}"

"configuration details"

Alert: the name of the alarm rule.

Expr: alarm trigger condition based on PromQL expression, which is used to calculate whether a time series satisfies this condition.

For: evaluate the wait time. Optional parameters. Used to indicate that an alarm is sent only after the trigger condition lasts for a period of time. The status of the new alarm is PENDING during the waiting period and FIRING after the waiting period.

Labels: custom tags that allow users to specify a set of additional tags to attach to the alarm.

Annotations: used to specify a set of additional information, such as text describing alarm details, etc. When an alarm is generated, the content of annotations will be sent to Alertmanager as a parameter.

Restart Prometheus after the configuration is completed, and visit Prometheus to view the alarm configuration.

test

Close node_exporter and receive the alarm email in 2 minutes. The screenshot is as follows: the alarm content of Alertmanager supports template configuration. You can use a good-looking template for rendering. If you are interested, you can try it! The More

Some calculation statements of node exporter

CPU usage (in percent)

(avg by (instance) (irate (node_cpu_seconds_total {mode= "idle"} [5m])) * 100) memory used (in bytes)

Node_memory_MemTotal_bytes-node_memory_MemFree_bytes-node_memory_Cached_bytes-node_memory_Buffers_bytes-node_memory_Slab_bytes memory usage (in bytes/sec)

Node_memory_MemTotal_bytes-node_memory_MemFree_bytes-node_memory_Cached_bytes-node_memory_Buffers_bytes-node_memory_Slab_bytes memory usage (in percent)

((node_memory_MemTotal_bytes-node_memory_MemFree_bytes-node_memory_Cached_bytes-node_memory_Buffers_bytes-node_memory_Slab_bytes) / node_memory_MemTotal_bytes) * memory utilization of 100server1 (in percent)

((node_memory_MemTotal_bytes {instance= "server1"}-node_memory_MemAvailable_bytes {instance= "server1"}) / node_memory_MemTotal_bytes {instance= "server1"}) * disk utilization of 100server2 (in percent)

(node_filesystem_size_bytes {fstype=~ "xfs | ext4", instance= "server2"}-node_filesystem_free_bytes {fstype=~ "xfs | ext4", instance= "server2"}) / node_filesystem_size_bytes {fstype=~ "xfs | ext4", instance= "server2"}) * 100uptime time (in seconds)

Time ()-uptime time of the node_boot_timeserver1 (in seconds)

Time ()-node_boot_time_seconds {instance= "server1"} Network outflow (in bytes/sec)

Irate (node_network_transmit_bytes_total {deviceflows ~ "lo | bond [0-9] | cbr [0-9] | veth.*"} [5m]) > the network outflow of 0server1 (in bytes/sec)

Irate (node_network_transmit_bytes_total {instance= "server1", deviceflows ~ "lo | bond [0-9] | cbr [0-9] | veth.*"} [5m]) > 0 network inflow (in bytes/sec)

Irate (node_network_receive_bytes_total {deviceflows ~ "lo | bond [0-9] | cbr [0-9] | veth.*"} [5m]) > the network inflow of 0server1 (in bytes/sec)

Irate (node_network_receive_bytes_total {instance= "server1", deviceread ~ "lo | bond [0-9] | cbr [0-9] | veth.*"} [5m]) > 0 disk read speed (in bytes/sec)

This is the end of irate (node_disk_read_bytes_total {device=~ "sd.*"} [5m]) "how to realize the operation and maintenance alarm of the monitoring platform based on Prometheus and Grafana". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.