Example Analysis of K8s Fault Detection and self-Healing 04/19 Update SLTechnology News&Howtos

Example Analysis of K8s Fault Detection and self-Healing

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the example analysis of K8s fault detection and self-healing, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Component failure

Component failure can be thought of as a subclass of node failure, but the source of the fault is part of the basic components of K8S.

DNS failure: an external DNS name cannot be resolved in 2 of the 6 DNS Pod. The result is a large number of online business due to domain name resolution.

CNI failure: the container network and the outside of a few nodes are disconnected, there is no problem for the node to access its own Pod IP, but other nodes cannot access the Pod IP of the failed node. In this case, the local health check of Pod is invalid, resulting in the persistence of failed instances and the failure of a certain proportion of business requests.

Kubenurse probes the network of ingress, dns, apiserver, and kube-proxy.

Using KubeNurse for Cluster Network Monitoring

Node failure

Hardware error: CPU/Memory/ disk failure

Kernel problem: kernel deadlock/corrupted file systems

Container runtime error: Docker fake death

Infrastructure service failure: NTP failure

Node-problem-detector

Root cause: on kubernetes clusters, we usually only regulate the stable operation of the cluster itself and containers. But these stability are strongly dependent on the stability of the node node. However, the management of node is relatively weak in kubernetes, because perhaps for the initial design of kubernetes, these should be the business of IaaS. However, with the development of kubernetes, it has become an operating system, and it will manage more and more content, so the management of node will also be included in the management of kuberntes. So it extends to the node problem detector project.

Kubernetes supports two reporting mechanisms:

1. NodeCondition (node condition): this refers to a permanent error that will prevent pod from running on this node. The condition of this node will not be reset until the node is restarted.

2. Event (event): a temporary problem that affects the node, but it is meaningful for system diagnosis. NPD uses the reporting mechanism of kubernetes to report the error information to the node of kuberntes by detecting the log of the system (such as journal in centos).

Events on the failure node of the picture will be recorded in some logs of the host. There is so much noise information in these logs (such as kernel logs) that NPD will extract valuable information, which can be reported to Prometheus, and offline events will be generated. This information can be pushed to WeCom for manual processing. It can also correspond to the method library of the self-healing system and recover automatically. In the bare metal K8S cluster, due to the lack of infrastructure support, automatic node expansion may not be realized, and the abnormal state of the node can only be cured through more precise automatic operation and maintenance.

Taking a CNI failure as an example, the possible cure process is as follows:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Query the operation and maintenance method library, and if a match is found, perform the corresponding operation and maintenance action.

If the above steps do not work, try to delete the Pod responsible for CNI on the node to reset the routing and Iptables configuration of the node

If the above steps do not work, try to restart the container runtime

Alarm, require operation and maintenance personnel to intervene

To deploy the NPD practice, you need to have a k8s cluster, which must have more than one worker node. You can refer to https://github.com/kubernetes/node-problem-detector.

Main parameter:-- prometheus-address: the default binding address is 127.0.0.1. If you need to push it to promethues, you need to modify it. -- config.system-log-monitor: the Node problem detector will launch a separate log monitor for each configuration. Case: config/kernel-monitor.json. -- config.custom-plugin-monito: the Node problem detector will launch a separate custom plug-in monitor for each configuration. Case: config/custom-plugin-monitor.json

Main parameters:

-- prometheus-address: the default binding address is 127.0.0.1. If you need to push it to promethues, you need to modify it.

-- config.system-log-monitor: the Node problem detector will launch a separate log monitor for each configuration. Case: config/kernel-monitor.json.

-- config.custom-plugin-monito: the Node problem detector will launch a separate custom plug-in monitor for each configuration. Case: config/custom-plugin-monitor.json

Clone the code locally, change the DaemonSet in the deployment file as needed, and do the following:

Create ConfigMap: kubectl create-f node-problem-detector-config.yaml create DaemonSet: kubectl create-f node-problem-detector.yaml

How to verify the NPD capture information can be done on the node of the test cluster.

Sudo sh-c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' > > / dev/kmsg" can see the alarm of the KernelOops event in kubectl describe nodes x.x.x.x. Sudo sh-c "echo 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.' > > / dev/kmsg" can see the alarm of the DockerHung event in kubectl describe nodes x.x.x.x.

If the event alarm receives a promethues, you can configure the policy and send it to Wechat.

This is the end of the example analysis of K8s fault detection and self-healing. I hope the above content can be helpful to everyone and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.