In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
Prometheus machine: 172.27.143.155
Alertmanager machine: 172.27.143.150
The Prometheus alarm is divided into two parts. The alarm rules in the Prometheus server send alerts to Alertmanager.
Alertmanager then processes these alarm messages, including silencing, banning, aggregating and sending to email, PagerDuty and HipChat.
The main steps to set up alarms and notifications are:
Set up and configure Alertmanager
Configure Prometheus and Alertmanager communication
Create alarm rules in Prometheus
Alertmanager handles alerts sent by client applications, such as Prometheus servers.
It is responsible for deduplicating, grouping and routing them to the correct recipient integration, such as email, PagerDuty, or OpsGenie. It also handles the silence and suppression of alarms.
First, it is configured with Prometheus and grafana services on the 155machine
Next, configure the alermanager service
1. Wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz
2 、 tar zxf alertmanager-0.20.0.linux-amd64.tar.gz
3 、 mv alertmanager-0.20.0.linux-amd64 / usr/local/alertmanager
4 、 vim alertmanager.yml
5 、 vim / etc/alertmanager/template/wechat.tmpl
{{define "wechat.default.message"}}
{{range $I, $alert: = .Alerts}}
= Monitoring alarm =
Alarm status: {{.Status}}
Alarm level: {{$alert.Labels.severity}}
Alarm type: {{$alert.Labels.alertname}}
Alarm App: {{$alert.Annotations.summary}}
Alarm host: {{$alert.Labels.instance}}
Alarm details: {{$alert.Annotations.description}}
Trigger threshold: {{$alert.Annotations.value}}
Alarm time: {{$alert.StartsAt.Format "2006-01-02 15:04:05"}}
= end=
{{end}}
{{end}}
6. Start the service after completion
Nohup. / alertmanager &
Next, configure the Prometheus service
Modify the configuration file
You need to create a rules directory
There are two files inside, one is monitored by the host and the other is monitored by the container.
1) cat host_sys.yml
Groups:
Name: Host
Rules:alert: Memory Usage
Expr: (node_memory_MemTotal_bytes-(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 2
For: 1m
Labels:
Name: Memory
Severity: Warning
Annotations:
Summary: "{{$labels.appname}}"
Description: "Host memory usage exceeds 80%."
Value: "{{$value}}" alert: CPU Usage
Expr: sum (avg without (cpu) (irate (node_cpu_seconds_total {modestly optimized roomidle`} [5m]) by (instance,appname) > 0.05
For: 1m
Labels:
Name: CPU
Severity: Warning
Annotations:
Summary: "{{$labels.appname}}"
Description: "Host CPU utilization rate exceeds 65%."
Value: "{{$value}}" alert: HostLoad
Expr: node_load5 > 4
For: 1m
Labels:
Name: Load
Severity: Warning
Annotations:
Summary: "{{$labels.appname}}"
Description: "the host load exceeds 4. 5 minutes."
Value: "{{$value}}" alert: Filesystem Usage
Expr: 1-(node_filesystem_free_bytes / node_filesystem_size_bytes) > 0.3
For: 1m
Labels:
Name: Disk
Severity: Warning
Annotations:
Summary: "{{$labels.appname}}"
Description: "Host [{{$labels.mountpoint}}] partition usage exceeds 80%."
Value: "{{$value}}%" alert: Diskio writes
Expr: irate (node_disk_writes_completed_total {job=~ "Host"} [1m]) > 50
For: 1m
Labels:
Name: Diskio
Severity: Warning
Annotations:
Summary: "{{$labels.appname}}"
Description: "the average IO load of 1 minute writes to the host [{{$labels.device}}] disk is high."
Value: "{{$value}} iops" alert: Diskio reads
Expr: irate (node_disk_reads_completed_total {job=~ "Host"} [1m]) > 5
For: 1m
Labels:
Name: Diskio
Severity: Warning
Annotations:
Summary: "{{$labels.appname}}"
Description: "the average IO load of 1 minute read from host [{{$labels.device}}] disk is high."
Value: "{{$value}} iops" alert: Network_receive
Expr: irate (node_network_receive_bytes_total {deviceholders ~ "lo | bond [0-9] | cbr [0-9] | veth. | virbr. | ovs-system"} [5m]) / 1048576 > 5
For: 1m
Labels:
Name: Network_receive
Severity: Warning
Annotations:
Summary: "{{$labels.appname}}"
Description: "the average received traffic of the host [{{$labels.device}}] Nic exceeds 5Mbps in 5 minutes."
Value: "{{$value}} Mbps" alert: Network_transmit
Expr: irate (node_network_transmit_bytes_total {deviceholders ~ "lo | bond [0-9] | cbr [0-9] | veth. | virbr. | ovs-system"} [5m]) / 1048576 > 5
For: 1m
Labels:
Name: Network_transmit
Severity: Warning
Annotations:
Summary: "{{$labels.appname}}"
Description: "the average traffic sent by the host [{{$labels.device}}] Nic exceeds 5Mbps within 5 minutes."
Value: "{{$value}} Mbps"
2) cat container_sys.yml
Groups:
Name: Container
Rules:alert: CPU Usage
Expr: (sum by (name,instance) (rate (container_cpu_usage_seconds_total {imageframes = ""} [5m])) * 100) > 80
For: 1m
Labels:
Name: CPU
Severity: Warning
Annotations:
Summary: "{{$labels.name}}"
Description: "Container CPU usage exceeds 80%"
Value: "{{$value}}%" alert: Memory Usage
Expr: (container_memory_usage_bytes {name=~ ". +"}-container_memory_cache {name=~ ". +"}) / container_spec_memory_limit_bytes {name=~ ". +"} * 100 > 80
For: 1m
Labels:
Name: Memory
Severity: Warning
Annotations:
Summary: "{{$labels.name}}"
Description: "Container memory usage exceeds 80%."
Value: "{{$value}}%" alert: Network_receive
Expr: irate (container_network_receive_bytes_total {name=~ ". +", interface=~ "eth.+"} [5m]) / 1048576 > 5
For: 1m
Labels:
Name: Network_receive
Severity: Warning
Annotations:
Summary: "{{$labels.name}}"
Description: "the container [{{$labels.device}}] Nic receives more traffic than 5Mbps in 5 minutes."
Value: "{{$value}} Mbps" alert: Network_transmit
Expr: irate (container_network_transmit_bytes_total {name=~ ". +", interface=~ "eth.+"} [5m]) / 1048576 > 5
For: 1m
Labels:
Name: Network_transmit
Severity: Warning
Annotations:
Summary: "{{$labels.name}}"
Description: "the container [{{$labels.device}}] Nic sends more traffic than 5Mbps in 5 minutes."
Value: "{{$value}} Mbps"
Restart the Prometheus service after the configuration is complete
Wait a minute to verify the effect.
The recovery is the following
The monitoring container is complete.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.