What if K8S node is abnormal? 04/22 Update SLTechnology News&Howtos

What if K8S node is abnormal?

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

K8S node exception how to do, in view of this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Significance of node health detection

In the process of running K8S cluster, nodes are often unavailable for a variety of reasons, such as runtime component problems, kernel deadlock, insufficient resources and so on. Kubelet monitors the status of PIDPressure, MemoryPressure, DiskPressure and other resources of a node by default, but when Kubelet reports these states, the node may have been unavailable for a long time, and Kubelet may have started the operation of expelling Pod. Therefore, the detection mechanism of node health by native K8S is not perfect in some scenarios, and we need to be able to detect node problems in advance, and need more detailed indicators to describe the health status of nodes and adopt corresponding recovery strategies to achieve intelligent operation and maintenance, and save the burden of developers and operation and maintenance personnel.

Node-Problem-Detector

NPD (Node-Problem-Detector) is the health detection component of the Kubernetes community's open source cluster nodes. NPD provides the ability to find node anomalies by regularly matching system logs or files. Users can configure regular expressions that may generate abnormal problem logs and choose different reporting methods based on their own operation and maintenance experience. NPD parses the user's profile, and when a log matches the regular expression configured by the user, the abnormal status detected can be reported through NodeCondition, Event or Promethues Metric. In addition to the log matching function, NPD also accepts custom detection plug-ins written by users, and users can develop their own scripts or executables and integrate them into the plug-ins of NPD, allowing NPD to execute detection programs on a regular basis.

Node Health Detection in TKE

NPD is integrated in TKE in the form of extension components, and the capability of NPD is enhanced, which is called NodeProblemDetectorPuls (NPDPlus) extension components. Users can deploy NPDPlus extension components on existing clusters with one click, or they can choose to deploy NPDPlus while creating clusters. From the experience of Tencent Cloud container team's long-term operation and maintenance of K8S cluster, some metrics that can find node anomalies in specific forms are extracted and some of these metrics are integrated into NPDPlus. For example, check the systemd status of Kubelet and Docker in the NPDPlus container, and detect the file descriptor and thread count pressure of the host. The specific indicators are as follows:

The purpose of TKE using NPDPlus is to discover the possible unavailable status of the node in advance, rather than to report the status when the node is already unhealthy. When users deploy NPDPlus in the TKE cluster, use the command kubectl describe node to find a lot of extra ThreadPressure. For example, FDPressure indicates whether the number of file descriptors used on the node has reached 80% of the maximum allowed by the machine, and whether the number of threads on the node has reached 90% allowed by the machine, and so on. Users can monitor these Condition and adopt evasion strategies in advance when abnormal conditions occur.

At the same time, K8S currently believes that the mechanism of node NotReady depends on the parameter setting of kube-controller-manager. When the node network is completely blocked, it is difficult for K8S to find node anomalies at the second level, which is unacceptable in some scenarios (such as live streaming, online meetings, etc.). In view of this scenario, NPDPlus inherits the distributed node health detection function, which can quickly detect the network status of nodes in seconds and whether they can communicate with other nodes without relying on communication with K8S master components. The implementation principle and function of this function will be described in detail in later articles.

Node self-healing

The purpose of collecting the health status of a node is to detect node anomalies in advance before the business Pod is unavailable, so that operators or developers can repair Docker, Kubelet or nodes. In NPDPlus, in order to reduce the burden of operation and maintenance personnel, it provides the ability to carry out different self-healing actions according to the status of the collected nodes. The cluster administrator can configure the corresponding self-healing capabilities according to the different states of the nodes, such as restarting Docker, restarting Kubelet or restarting CVM nodes. At the same time, in order to prevent the node avalanche in the cluster, strict current restrictions are made before the self-healing action is carried out to prevent the node from being restarted on a large scale. At the same time, in order to prevent the node avalanche in the cluster, strict current restrictions are made before the self-healing action is carried out. The specific strategies are:

Only one node in the cluster is allowed to perform self-healing behavior at the same time, and there is an interval of at least 1 minute between the two self-healing behaviors.

When a new node is added to the cluster, it will give the node a tolerance time of 2 minutes to prevent error self-healing due to the instability that the node has just added to the cluster.

When the node is still in an abnormal state after triggering the restart CVM self-healing action, the node will not perform any self-healing action within 3 hours.

NPDPlus records all self-healing actions performed in the Event of Node, making it easier for cluster administrators to understand the events that occur on Node.

User's guide

Click component Management on the left side of the cluster details page, and select NodeProblemDetectorPlus (Node anomaly Detection Plus) in component Management.

By configuring the NodeProblemDetectorPlus parameter, you can choose to perform different self-healing actions depending on the state of a particular node.

Select OK and click finish to create it.

Viewing the NPDPlus run in the cluster build management indicates that the NPDPlus is running successfully:

This is the answer to the question about what to do with the exception of K8S node. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.