Kubernetes operation and maintenance uses Prometheus to monitor K8S in all directions. 07/08 Update SLTechnology News&Howtos

Kubernetes operation and maintenance uses Prometheus to monitor K8S in all directions.

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Table of contents:

Prometheus Architecture K8S Monitoring indicators and implementation ideas deploying Prometheus on K8S platform configuration Analysis based on K8S Service Discovery deployment Grafana monitoring K8S cluster Pod, Node, resource objects using Grafana to visually display Prometheus monitoring data alarm rules and alarm notifications

Speaking of the above, now the first choice for monitoring, it must be Prometheus+Grafana, that is, many large companies are also using it, such as RBM,360 and NetEase, basically using this set of monitoring system.

What is Prometheus?

Prometheus (Prometheus) is a monitoring system originally built on SoundCloud. SoundCloud is a foreign company engaged in cloud computing and developed by a Google engineer after coming to the company. Since 2012, it has become a community open source project with a very active developer and user community. To emphasize open source and independent maintenance, Prometheus joined the Cloud Native Cloud Computing Foundation in 2016

(CNCF), becoming the second hosting project after Kubernetes, this project has developed relatively fast, and it has also risen with the development of K8s.

Https://prometheus.io official website

Https://github.com/prometheus GitHub address

Composition and structure of Prometheus

Next, let's take a look at the official architecture diagram. Let's study it.

The leftmost piece is the collection, collecting who monitors whom, generally some short-cycle tasks, such as cronjob tasks, can also be some persistent tasks, in fact, mainly some persistent tasks, such as web services, that is, running continuously, exposing some indicators, such as short-term tasks, and then turning them off, divided into these two types. Short-term tasks will use Pushgateway to collect these short-term tasks.

The middle piece is Prometheus itself. Internally, there is a database of TSDB. It can collect and display Prometheus from inside. This piece of UI is more lou than lou. So with the help of this open source Grafana, after all the indicators are exposed by the monitored end, Prometheus will actively grab these indicators, store them in its own TSDB database, and provide them to Web UI or Grafana. Or API clients invokes the data through PromQL, which is the equivalent of Mysql's SQL, mainly for querying the data.

The above piece in the middle is for service discovery, that is, when you have a lot of monitored end, it is not realistic to write these monitored end manually, so you need to automatically discover new nodes, or add a batch of nodes to this monitoring, such as K8s, which has a built-in K8s service discovery mechanism, that is, it will connect to K8s API to discover which applications and which pod you deploy. Everything is exposed and monitored, which is why K8S is particularly friendly to prometheus, that is, it has built-in support to do this.

In the upper right corner is the alarm of Prometheus. Its alarm implementation has a component, Alertmanager, which receives alarms from prometheus, that is, triggers some prevalues, notifies Alertmanager, and Alertmanager handles alarm-related processing, and then sends them to the receiver, which can be email, WeCom, or nails, that is, its entire framework, which is divided into these five blocks.

Summary:

Prometheus Server: collect metrics and store time series data, and provide query interface

ClientLibrary: client library, these can be integrated into many languages, such as a Web website developed with JAVA, then you can integrate the client of JAVA to expose relevant indicators and expose your own indicators, but there are many business indicators that need to be developed to write.

Push Gateway: short-term storage of metric data. Mainly used for temporary tasks

Exporters: collect existing monitoring metrics of third-party services and expose metrics, which is equivalent to an agent on the collection side

Alertmanager: alarm

Web UI: a simple Web console

Data model

Prometheus stores all data as time series; it has the same measurement name and the label belongs to the same metric.

Each time series is uniquely identified by the metric name and a set of key-value pairs (also known as labels). That is, when querying,

It will also query and filter based on these tags, that is, when writing PromQL

Time series format:

{=,.}

There are many values in the name of the indicator + curly braces.

Example: api_http_requests_total {method= "POST", handler= "/ messages"}

(name) (POST request, GET request, request also contains requested resources, such as messages or API) there can be many metrics, such as the protocol of the request, or fields with other HTTP headers, which can be marked out, that is, anything you want to monitor can be monitored in this way.

Jobs and instances

Instance: a target that can be crawled is called an instance (Instances). Anyone who has used zabbix knows what the monitored end is called, which is generally called the host, the monitored end, and an instance in prometheus.

Job: a collection of instances with the same goal is called Job, that is, take your monitored end as your collection, for example, do a grouping, there are several web services, for example, there are three, write a job, under this job, there are three, that is, do a logical grouping.

Second, K8S monitoring index

Kubernetes itself monitoring

Node resource utilization: dozens of node and hundreds of node to monitor in general production environment

Number of Node: generally, you can monitor the number of node, because it is an example. How many projects a node can run, it also needs to be evaluated, and the overall resource rate is in what kind of state and value, so it needs to be evaluated according to the project, the resource utilization of the run, and the value, such as how many resources are needed to run another project.

Number of Pods (Node): in fact, it is the same: how many pod can run on each node, but by default you can run 110pod on a node, but in most cases it is impossible to run so much, such as a 128GB memory, 32-core cpu, a java project, an allocation of 2G, that is, 50-60 runs, the average machine, pod runs dozens, rarely more than 100s.

Resource object status: for example, pod,service,deployment,job these resource status, do a statistics.

Pod monitoring

Number of Pod (projects): how many pod have been run in your project, and what is the approximate profit rate, so that you can evaluate how many resources the project has run, how many resources it has, and how many resources each pod has taken up.

Container resource utilization: how many resources are consumed by each container, how much CPU is used, and how much memory is used

Application: this is the indicator of the application itself, which is generally difficult for our operators to get, so before monitoring, we need the developer to expose it to you. There is a lot of client integration, and the client library supports many languages. You need to let the developer do some development to integrate it, expose the indicators that the application wants to know, and then include it in the monitoring. If the development department cooperates, it is very difficult for the basic operation and maintenance to do this, unless you write a client program, can you get the internal working situation from the outside through shell/python? if this program provides API, this is easy to do.

Prometheus Monitoring K8S Architecture

If you want to monitor node resources, you can put a node_exporter, which monitors node resources. Node_exporter is a collector on Linux. If you put it on it, you can collect the CPU, memory and network IO of the current node, and you can wait for it.

If you want to monitor containers, cAdvisor collector, pod and containers are provided inside K8s to collect these metrics. They are all built-in and do not need to be deployed separately. You only know how to access the Cadvisor.

If you want to monitor the K8s resource object, a kube-state-metrics service will be deployed. It will regularly obtain these metrics from API to help you access the Prometheus. If the alarm is sent to some receivers through Alertmanager, it will be displayed visually through Grafana.

Service discovery:

Https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config

Deploy Prometheus+Grafana in K8S

Some documents may not be properly formatted in yaml, which may cause problems with deployment.

It is recommended to pull the address of my code repository. Please give me your public key when you pull it, otherwise you can't pull it down.

Git clone git@gitee.com:zhaocheng172/prometheus.git

[root@k8s-master prometheus-k8s] # lsalertmanager-configmap.yaml OWNERSalertmanager-deployment.yaml prometheus-configmap.yamlalertmanager-pvc.yaml prometheus-rbac.yamlalertmanager-service.yaml prometheus-rules.yamlgrafana.yaml prometheus-service.yamlkube-state-metrics-deployment.yaml prometheus-statefulset-static-pv.yamlkube-state-metrics-rbac.yaml prometheus-statefulset.yamlkube-state-metrics-service.yaml README.mdnode_exporter.sh

Now let's create the rbac because the main service master process that deploys it will reference these services

Because prometheus connects to your API, get a lot of metrics from API

And the permission to bind the cluster role is set, which can only be viewed but cannot be modified.

[root@k8s-master prometheus-k8s] # cat prometheus-rbac.yaml apiVersion: v1kind: ServiceAccountmetadata: name: prometheus namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile---apiVersion: rbac.authorization.k8s.io/v1beta1kind: ClusterRolemetadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules:-apiGroups:-" Resources:-nodes-nodes/metrics-services-endpoints-pods verbs:-get- list-watch-apiGroups:-"" resources:-configmaps verbs:-get- nonResourceURLs:-"/ metrics" verbs:-get---apiVersion: rbac.authorization.k8s.io/v1beta1kind: ClusterRoleBindingmetadata: name: prometheus Labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: ReconcileroleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheussubjects:- kind: ServiceAccount name: prometheus namespace: kube-system [root@k8s-master prometheus-k8s] # kubectl create-f prometheus-rbac.yaml

Now create a configmap

Rule_files:- / etc/config/rules/*.rules

This is the directory where the alarm rules are written, that is, the configmap will be mounted into Prometheus and let the main process read these configurations.

Scrape_configs:-job_name: prometheus static_configs:-targets:-localhost:9090

The following are all used to configure the monitoring end. Job_name is grouped, and this is the monitoring itself. There is also monitoring node below. We will start a nodeport on the node. Here we modify to monitor the node node.

Scrape_interval: 30s: the time collected here How many seconds does the data be collected? here is the name of an alerting service: alerting: alertmanagers:-static_configs:-targets: ["alertmanager:80"] [root@k8s-master prometheus-k8s] # kubectl create-f prometheus-configmap.yaml [root@k8s-master prometheus-k8s] # cat prometheus-configmap.yaml # Prometheus configuration format https://prometheus.io/docs/prometheus/latest/configuration/configuration/apiVersion: v1kind: ConfigMapmetadata : name: prometheus-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExistsdata: prometheus.yml: | rule_files:-/ etc/config/rules/*.rules scrape_configs:-job_name: prometheus static_configs:-targets:-localhost:9090-job_name: kubernetes-nodes scrape_interval: 30s static_ Configs:-targets:-192.168.30.22 job_name-job_name: kubernetes-apiservers kubernetes_sd_configs:-role: endpoints relabel_configs:-action: keep regex: default Kubernetes Https source_labels:-_ _ meta_kubernetes_namespace-_ meta_kubernetes_service_name-_ _ meta_kubernetes_endpoint_port_name scheme: https tls_config: ca_file: / var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: / var/run/secrets/kubernetes.io/serviceaccount / token-job_name: kubernetes-nodes-kubelet kubernetes_sd_configs:-role: node relabel_configs:-action: labelmap regex: _ meta_kubernetes_node_label_ (. +) scheme: https tls_config: ca_file: / var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: / var / run/secrets/kubernetes.io/serviceaccount/token-job_name: kubernetes-nodes-cadvisor kubernetes_sd_configs:-role: node relabel_configs:-action: labelmap regex: _ _ meta_kubernetes_node_label_ (. +)-target_label: _ _ metrics_path__ replacement: / metrics/cadvisor scheme: https tls_config: ca_file: / Var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: / var/run/secrets/kubernetes.io/serviceaccount/token-job_name: kubernetes-service-endpoints kubernetes_sd_configs:-role: endpoints relabel_configs:-action: keep regex: true source_labels:-_ _ meta_kubernetes_service_annotation_prometheus_io Scrape-action: replace regex: (https?) Source_labels:-_ _ meta_kubernetes_service_annotation_prometheus_io_scheme target_label: _ _ scheme__-action: replace regex: (. +) source_labels:-_ meta_kubernetes_service_annotation_prometheus_io_path target_label: _ _ metrics_path__-action: replace regex: ([^:] +) (?::\ d +)? (\ d +) replacement: $1meta_kubernetes_service_annotation_prometheus_io_port target_label 2 source_labels:-_ address__-_ meta_kubernetes_service_annotation_prometheus_io_port target_label: _ _ address__-action: labelmap regex: _ _ meta_kubernetes_service_label_ (. +)-action: replace source_labels:-_ _ meta_kubernetes_namespace Target_label: kubernetes_namespace-action: replace source_labels:-_ meta_kubernetes_service_name target_label: kubernetes_name-job_name: kubernetes-services kubernetes_sd_configs:-role: service metrics_path: / probe params: module:-http_2xx relabel_configs:-action: keep regex: true Source_labels:-_ _ meta_kubernetes_service_annotation_prometheus_io_probe-source_labels:-_ address__ target_label: _ _ param_target-replacement: blackbox target_label: _ _ address__-source_labels:-_ _ param_target target_label: instance-action: labelmap regex: _ _ meta_kubernetes_service_label_ (. +)-source_labels:-_ meta_kubernetes_namespace target_label: kubernetes_namespace-source_labels:-_ meta_kubernetes_service_name target_label: kubernetes_name-job_name: kubernetes-pods kubernetes_sd_configs:-role: pod relabel_configs:-action: keep regex : true source_labels:-_ meta_kubernetes_pod_annotation_prometheus_io_scrape-action: replace regex: (. +) source_labels:-_ meta_kubernetes_pod_annotation_prometheus_io_path target_label: _ metrics_path__-action: replace regex: ([^:] +) (?::\ d +)? (\ d +) replacement: $1meta_kubernetes_pod_annotation_prometheus_io_port target_label 2 source_labels:-_ address__-_ meta_kubernetes_pod_annotation_prometheus_io_port target_label: _ _ address__-action: labelmap regex: _ _ meta_kubernetes_pod_label_ (. +)-action: replace source_labels:-_ _ meta_kubernetes_namespace Target_label: kubernetes_namespace-action: replace source_labels:-_ meta_kubernetes_pod_name target_label: kubernetes_pod_name alerting: alertmanagers:-static_configs:-targets: ["alertmanager:80"]

Then configure this role, this is to configure the alarm rule, here is divided into two alarm rules, one is a general alarm rule, applicable to all instances, if the instance fails, and then send an alarm, the instance we are monitored side of the agent, there is a node role, which monitors the CPU, memory, disk utilization of each node, and writes the alarm value in prometheus through promQL to query a data to compare If the expression that matches this comparison is true, trigger the current alarm, such as the one below, and then push the alarm to alertmanager, which handles the alarm of this message.

Expr: 100-(node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80

[root@k8s-master prometheus-k8s] # kubectl create-f prometheus-rules.yaml [root@k8s-master prometheus-k8s] # cat prometheus-rules.yaml apiVersion: v1kind: ConfigMapmetadata: name: prometheus-rules namespace: kube-systemdata: general.rules: | groups:-name: general.rules rules:-alert: InstanceDown expr: up = = 0 for: 1m labels: severity: error annotations: Summary: "Instance {{$labels.instance}} stop working" description: "{{$labels.instance}} job {{$labels.job}} has been stopped for more than 5 minutes." Node.rules: | groups:-name: node.rules rules:-alert: NodeFilesystemUsage expr: 100-(node_filesystem_free_bytes {fstype=~ "ext4 | xfs"} / node_filesystem_size_bytes {fstype=~ "ext4 | xfs"} * 100) > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{$labels.instance}}: {{$labels. Mountpoint}} partition usage is too high "description:" {{$labels.instance}}: {{$labels.mountpoint}} partition usage is greater than 80% (current value: {{$value}}) "- alert: NodeMemoryUsage expr: 100-(node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 1m labels: Severity: warning annotations: summary: "Instance {{$labels.instance}} memory usage is too high" description: "{{$labels.instance}} memory usage is greater than 80% (current value: {{$value}})"-alert: NodeCPUUsage expr: 100-(avg (irate (node_cpu_seconds_total {mode= "idle"} [5m])) by (instance) ) * 100) > 60 for: 1m labels: severity: warning annotations: summary: "Instance {{$labels.instance}} CPU utilization is too high" description: "{{$labels.instance}} CPU usage greater than 60% (current value: {{$value}})"

Then deploy statefulset again.

Name: prometheus-server-configmap-reload: this is mainly to reload the configuration file of prometheus. The following is the main server of prometheus, which is used to start the service of prometheus. In addition, the / data directory is persisted, the configuration file uses configmap, and alarm rules are also stored from configmap. Here we use the storage class that dynamically creates pv. Managed-nfs-storage [root@k8s-master prometheus-k8s] # cat prometheus-statefulset.yaml apiVersion: apps/v1kind: StatefulSetmetadata: prometheus namespace: kube-system labels: k8s-app: prometheus kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v2.2.1spec: serviceName: "prometheus" replicas: 1 podManagementPolicy: "Parallel" updateStrategy: type: "RollingUpdate" selector: matchLabels: K8s-app: prometheus template: metadata: labels: k8s-app: prometheus annotations: scheduler.alpha.kubernetes.io/critical-pod:''spec: priorityClassName: system-cluster-critical serviceAccountName: prometheus initContainers:-name: "init-chown-data" image: "busybox:latest" imagePullPolicy: "IfNotPresent" command: ["chown" "- R", "65534R" 65534 "/ data"] volumeMounts:-name: prometheus-data mountPath: / data subPath: "" containers:-name: prometheus-server-configmap-reload image: "jimmidyson/configmap-reload:v0.1" imagePullPolicy: "IfNotPresent" args:-volume-dir=/etc/config- -- webhook-url= http://localhost:9090/-/reload volumeMounts:-name: config-volume mountPath: / etc/config readOnly: true resources: limits: cpu: 10m memory: 10Mi requests: cpu: 10m memory: 10Mi-name : prometheus-server image: "prom/prometheus:v2.2.1" imagePullPolicy: "IfNotPresent" args:-config.file=/etc/config/prometheus.yml-storage.tsdb.path=/data-web.console.libraries=/etc/prometheus/console_libraries-web.console.templates=/etc/prometheus / consoles-- web.enable-lifecycle ports:-containerPort: 9090 readinessProbe: httpGet: path: /-/ ready port: 9090 initialDelaySeconds: 30 timeoutSeconds: 30 livenessProbe: httpGet: path: /-/ healthy Port: 9090 initialDelaySeconds: 30 timeoutSeconds: 30 # based on 10 running nodes with 30 pods each resources: limits: cpu: 200m memory: 1000Mi requests: cpu: 200m memory: 1000Mi volumeMounts:-name: config-volume mountPath: / etc/config-name: prometheus-data mountPath: / data subPath: ""-name: prometheus-rules mountPath: / etc/config/rules terminationGracePeriodSeconds: 300 volumes:-name: config-volume configMap: name: prometheus-config-name: prometheus-rules configMap: Name: prometheus-rules volumeClaimTemplates:-metadata: name: prometheus-data spec: storageClassName: managed-nfs-storage accessModes:-ReadWriteOnce resources: requests: storage: "16Gi"

Here, because I have built nfs to create pvc dynamically, using nfs to do network storage, so there is no demonstration here, you can take a look at my previous blog, and then it has been created here.

[root@k8s-master prometheus-k8s] # kubectl get pod-n kube-systemNAME READY STATUS RESTARTS AGEcoredns-bccdc95cf-kqxwv 1 + 1 Running 3 2d4hcoredns-bccdc95cf-nwkbp 1 + + 1 Running 3 2d4hetcd-k8s-master 1 + + 1 Running 2 2d4hkubeSumapiservermuri k8s- Master 1/1 Running 2 2d4hkube-controller-manager-k8s-master 1/1 Running 5 2d4hkube-flannel-ds-amd64-dc5z9 1/1 Running 1 2d4hkube-flannel-ds-amd64-jm2jz 1/1 Running 1 2d4hkube-flannel-ds-amd64-z6tt2 1/1 Running 1 2d4hkube-proxy-9ltx7 1/1 Running 2 2d4hkube-proxy-lnzrj 1/1 Running 1 2d4hkube-proxy-v7dqm 1/1 Running 1 2d4hkube-scheduler-k8s-master 1/1 Running 5 2d4hprometheus-0 2/2 Running 0 3m3s

Then take a look at service. We use the Nodeport type and the port uses 9090. Of course, you can also use ingress to expose.

[root@k8s-master prometheus-k8s] # cat prometheus-service.yaml kind: ServiceapiVersion: v1metadata: name: prometheus namespace: kube-system labels: kubernetes.io/name: Prometheus "kubernetes.io/cluster-service:" true "addonmanager.kubernetes.io/mode: Reconcilespec: type: NodePort ports:-name: http port: 9090 protocol: TCP targetPort: 9090 selector: k8s-app: prometheus

Now you can visit it. Visit random port 32276. Our prometheus has been successfully deployed.

[root@k8s-master prometheus-k8s] # kubectl create-f prometheus-service.yaml [root@k8s-master prometheus-k8s] # kubectl get svc-n kube-systemNAME TYPE CLUSTER-IP EXTERNAL-IP PORT (S) AGEkube-dns ClusterIP 10.1.0.10 53 and UDP 2d4hprometheus NodePort 53 9090:32276/TCP 22s

A very simple UI page, there is no good function, it is difficult to meet the requirements of the enterprise UI, but only here to do a debugging, it mainly writes the expression of promQL, how to check this data, just like the SQL of mysql, to query your data, you can debug in status, while in the config configuration file, we have added alarm pre-value, added support for nodeport and specified the address of alertmanager. Then rules, we also planned two pieces, one is the general rules, the other is the node node rules, mainly monitor three blocks, memory, disk, CPU

Now to see the utilization rate of CPU, we usually use Grafana to show it.

Deploy Grafana on K8S platform

This is also done with statefulset, and the pv is created automatically. The defined port is 30007.

[root@k8s-master prometheus-k8s] # cat grafana.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: grafana namespace: kube-systemspec: serviceName: "grafana" replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers:-name: grafana image: grafana/grafana ports:-containerPort: 3000 protocol: TCP resources: limits: cpu: 100m memory: 256Mi requests: cpu: 100m memory: 256Mi volumeMounts:-name: grafana-data mountPath: / var/lib/grafana subPath: grafana securityContext: fsGroup: 472 RunAsUser: 472volumeClaimTemplates:-metadata: name: grafana-data spec: storageClassName: managed-nfs-storage accessModes:-ReadWriteOnce resources: requests: storage: "1Gi"-apiVersion: v1kind: Servicemetadata: name: grafana namespace: kube-systemspec: type: NodePort ports:-port: 80 targetPort: 3000 nodePort: 30007 selector:app: grafana

The default account password is admin.

First, let's use prometheus as the data source, add a data source and select prometheus

Add a URL address, which can write either the address where you visit the UI page or the address of service

[root@k8s-master prometheus-k8s] # kubectl get svc-n kube-systemNAME TYPE CLUSTER-IP EXTERNAL-IP PORT (S) AGEgrafana NodePort 10.1.246.143 80:30007/TCP 11mkube-dns ClusterIP 10.1.0.10 53 80:30007/TCP 11mkube-dns ClusterIP 9153/TCP 2d5hprometheus NodePort 10.1.58.1 9090:32276/TCP 40m

Check that there is already one data source.

6. Monitor Pod, Node and resource objects in K8S cluster

 Pod

The node of kubelet uses the metrics interface provided by cAdvisor to obtain all the Pod and container-related performance metric data of the node.

That is, kubelet exposes two interface addresses:

Https://NodeIP:10255/metrics/cadvisor read-only

API of https://NodeIP:10250/metrics/cadvisor kubelet, you can do anything if there is no problem with authorization.

You can take a look at the node node. This port is mainly used to access some API authentication of kubelet and to provide some cAdvisor metrics. When we deployed prometheus, we already began to collect cAdvisor data. Why do we collect cAdvisor data? because the prometheus configuration file has already defined how to collect data.

[root@k8s-node1 ~] # netstat-antp | grep 10250tcp6 0 0: 10250: * LISTEN 107557/kubelet tcp6 0 0 192.168.30.22 ESTABLISHED 107557/kubelet tcp6 10250 192.168.30.23 root@k8s-node1 58692 ESTABLISHED 107557/kubelet 0 192.168.30.22 root@k8s-node1 10250 192.168.30.23 root@k8s-node1 46555

 Node

Use the node_exporter collector to collect node resource utilization.

Https://github.com/prometheus/node_exporter

Working with documents: https://prometheus.io/docs/guides/node-exporter/

 resource object

Kube-state-metrics collects the status information of various resource objects in K8s

Https://github.com/kubernetes/kube-state-metrics

Now import a template that can view pod data, that is, you can display the data more visually through the template.

7. Use Grafana to visually display Prometheus monitoring data

Recommendation template: it is in the grafana sharing center, that is, the template written by others is uploaded to the library here. You can also write and upload it yourself, and others can also access it. Here is the id of the template. As long as you get the ID, you can use the template. As long as the template is provided with execution promeQL at the backend, it can be displayed for you as long as there is data.

Grafana.com

Cluster resource monitoring: 3119

Resource status monitoring: 6417

Node Monitoring: 9276

Now use this 3319 template to show the resources of our cluster. Open the add template and select dashboard.

Select Import template

Write 3119, which can automatically help you identify the name of this template.

Because these have the data, you can view the resources of all the clusters directly.

The following is a chart of the network IO, one is receiving, the other is sending

The following is the usage of cluster memory

Here is 4G, only 3.84G is identified, 2.26G promQL CPU is dual-core, 0.11 is used, and the one on the right is a cluster file system, but it is not displayed. We can take a look at how it is written in PromQL. Take this write promQL to promQL Ui to test if there is any data. Generally, it is caused by no matching data.

Let's see how to solve this.

Use this data to compare, find the data, and delete it bit by bit. Now we have found the data. Here is the name of your node that matches. According to this, we go to find it, because this template is uploaded by someone else. We use sure to match according to our own content. Here we can match the relevant promQL, and then change the promQL of our grafana. Now we have the data.

In addition, we may also do some other template monitoring. We can find some templates in its Grafana official, but some of them may not work. We need to modify them ourselves, such as typing k8s. Here we monitor the etcd cluster.

Node

Use the node_exporter collector to collect node resource utilization.

Https://github.com/prometheus/node_exporter

Working with documents: https://prometheus.io/docs/guides/node-exporter/

At present, this is not deployed using pod, because it does not show the utilization of a disk. Officials have given a statfulset method, which cannot show the disk, but it can also be deployed on a node node as a daemon. This deployment is also relatively simple. It can be deployed in a binary way, and just start one on the host.

Take a look at this script, which uses systemd to filter the status of service startup monitoring. If the daemon dies, it will also be collected by Prometheus, that is, the following parameter

-collector.systemd-collector.systemd.unit-whitelist= (docker | kubelet | kube-proxy | flanneld) .service

[root@k8s-node1 ~] # bash node_exporter.sh #! / bin/bashwget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gztar zxf node_exporter-0.17.0.linux-amd64.tar.gzmv node_exporter-0.17.0.linux-amd64 / usr/local/node_exportercat 20 call the police

[root@k8s-master prometheus-k8s] # vim prometheus-rules.yaml-alert: NodeFilesystemUsage expr: 100-(node_filesystem_free_bytes {fstype=~ "ext4 | xfs"} / node_filesystem_size_bytes {fstype=~ "ext4 | xfs"} * 100) > 20

Rebuild pod, here will automatically start, check prometheus, has been effective, in addition, the production environment is to call api, send a signal to rules, here I am rebuilt, you can also find some other online articles

[root@k8s-master prometheus-k8s] # kubectl delete pod prometheus-0-n kube-system

If you look at Alerts, it will change color, and it will turn red later, that is, alertmanager, whether it has a processing logic, or it is more complex, it will be designed to a silence, that is, alarm convergence, a packet, and a confirmation waiting again, all of which are not sent as soon as they are triggered.

In fact, pink has already pushed the alarm to Alertmanager, and it is only in this state that the alarm message is sent.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.