In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article introduces the relevant knowledge of "how to realize the operation and maintenance alarm of the monitoring platform based on Prometheus and Grafana". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Alarm mode Grafana
The new version of Grafana already provides alarm configuration, which can be set directly in the dashboard monitoring panel. However, I found that it is not flexible and does not support variables, and many downloaded charts cannot use alarms, so we choose to use Alertmanager instead of Grafana alarms.
Alertmanager
Compared with the graphical interface of Grafana, Alertmanager needs to rely on configuration files. The configuration is a little tedious, but it is more powerful and flexible. Next, we will implement the alarm notification step by step.
Alarm type
The following two types of Alertmanager alarms are mainly used:
Mail receiver email_configWebhook receiver webhook_config, sends parameters in the following format to the configured url address in post form. {
"version": "2"
"status":
"alerts": [{
"labels":
< object >"annotations":
< object >"startsAt":
"endsAt":
}]
}
"this time the alarm is mainly carried out by e-mail. "
Implementation steps
download
Download the latest version of Alertmanager from GitHub and upload it to the server. Tar-zxvf alertmanager-0.19.0.linux-amd64.tar.gz
Configure Alertmanager
Vi alertmanager.yml
Global:
Resolve_timeout: 5m
Smtp_smarthost: 'mail.163.com:25' # mailbox send port
Smtp_from: 'xxx@163.com'
Smtp_auth_username: 'xxx@163.com' # email account
Smtp_auth_password: 'xxxxxx' # email password
Smtp_require_tls: false
Route:
Group_by: ['alertname']
Group_wait: 10s # Notification of how long to wait for the first time to send a set of alerts
Group_interval: 10s # wait time before sending a new alert
Repeat_interval: 1h # the cycle for sending repeated alerts cannot be set too low for email configuration, otherwise it will be rejected by the smtp server because the mail is sent too frequently.
Receiver: 'email'
Receivers:
-name: 'email'
Email_configs:
-to: 'xxx@xxx.com'
After modification, you can use. / amtool check-config alertmanager.yml to verify that the file is correct.
Start alertmanager after the verification is correct. Nohup. / alertmanager &. (the first startup can be started silently without using nohup, so it is convenient to check the log later.)
We define only one route, which means that all alerts generated by Prometheus are received through a receiver called email after they are sent to Alertmanager. In fact, there are different ways to handle different levels of alerts, so we can define more sub-Route in route. Specific configuration rules you can go to Baidu to learn more.
Configure Prometheus
Set up the rules folder under the Prometheus installation directory and place all the alarm rule files. Alerting:
Alertmanagers:
-static_configs:
-targets: ['192.168.249.131purl 9093']
Rule_files:
-rules/*.yml
Create the alarm rule file service_down.yml under the rules folder and send mail when the server goes offline.
Groups:
-name: ServiceStatus
Rules:
-alert: ServiceStatusAlert
Expr: up = = 0
For: 2m
Labels:
Team: node
Annotations:
Summary: "Instance {{$labels.instance}} has bean down"
Description: "{{$labels.instance}} of job {{$labels.job}} has been down for more than 2 minutes."
Value: "{{$value}}"
"configuration details"
Alert: the name of the alarm rule.
Expr: alarm trigger condition based on PromQL expression, which is used to calculate whether a time series satisfies this condition.
For: evaluate the wait time. Optional parameters. Used to indicate that an alarm is sent only after the trigger condition lasts for a period of time. The status of the new alarm is PENDING during the waiting period and FIRING after the waiting period.
Labels: custom tags that allow users to specify a set of additional tags to attach to the alarm.
Annotations: used to specify a set of additional information, such as text describing alarm details, etc. When an alarm is generated, the content of annotations will be sent to Alertmanager as a parameter.
Restart Prometheus after the configuration is completed, and visit Prometheus to view the alarm configuration.
test
Close node_exporter and receive the alarm email in 2 minutes. The screenshot is as follows: the alarm content of Alertmanager supports template configuration. You can use a good-looking template for rendering. If you are interested, you can try it! The More
Some calculation statements of node exporter
CPU usage (in percent)
(avg by (instance) (irate (node_cpu_seconds_total {mode= "idle"} [5m])) * 100) memory used (in bytes)
Node_memory_MemTotal_bytes-node_memory_MemFree_bytes-node_memory_Cached_bytes-node_memory_Buffers_bytes-node_memory_Slab_bytes memory usage (in bytes/sec)
Node_memory_MemTotal_bytes-node_memory_MemFree_bytes-node_memory_Cached_bytes-node_memory_Buffers_bytes-node_memory_Slab_bytes memory usage (in percent)
((node_memory_MemTotal_bytes-node_memory_MemFree_bytes-node_memory_Cached_bytes-node_memory_Buffers_bytes-node_memory_Slab_bytes) / node_memory_MemTotal_bytes) * memory utilization of 100server1 (in percent)
((node_memory_MemTotal_bytes {instance= "server1"}-node_memory_MemAvailable_bytes {instance= "server1"}) / node_memory_MemTotal_bytes {instance= "server1"}) * disk utilization of 100server2 (in percent)
(node_filesystem_size_bytes {fstype=~ "xfs | ext4", instance= "server2"}-node_filesystem_free_bytes {fstype=~ "xfs | ext4", instance= "server2"}) / node_filesystem_size_bytes {fstype=~ "xfs | ext4", instance= "server2"}) * 100uptime time (in seconds)
Time ()-uptime time of the node_boot_timeserver1 (in seconds)
Time ()-node_boot_time_seconds {instance= "server1"} Network outflow (in bytes/sec)
Irate (node_network_transmit_bytes_total {deviceflows ~ "lo | bond [0-9] | cbr [0-9] | veth.*"} [5m]) > the network outflow of 0server1 (in bytes/sec)
Irate (node_network_transmit_bytes_total {instance= "server1", deviceflows ~ "lo | bond [0-9] | cbr [0-9] | veth.*"} [5m]) > 0 network inflow (in bytes/sec)
Irate (node_network_receive_bytes_total {deviceflows ~ "lo | bond [0-9] | cbr [0-9] | veth.*"} [5m]) > the network inflow of 0server1 (in bytes/sec)
Irate (node_network_receive_bytes_total {instance= "server1", deviceread ~ "lo | bond [0-9] | cbr [0-9] | veth.*"} [5m]) > 0 disk read speed (in bytes/sec)
This is the end of irate (node_disk_read_bytes_total {device=~ "sd.*"} [5m]) "how to realize the operation and maintenance alarm of the monitoring platform based on Prometheus and Grafana". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.