How to troubleshoot NotReady abnormal state of Node nodes in K8S online cluster 04/11 Update SLTechnology News&Howtos

How to troubleshoot NotReady abnormal state of Node nodes in K8S online cluster

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how the K8S online cluster troubleshoots the abnormal NotReady status of Node nodes. Many people may not know much about it. In order to make you understand better, the editor summarized the following content for you. I hope you can get something according to this article.

First, a brief introduction to the article

The status of the Node node in the K8s online cluster changes to the NotReady state, resulting in troubleshooting after the container stops service in the entire Node node.

What is described in the article is what I actually solved in the midline environment of the project, in addition to how to solve the problem, what is more important is how to investigate the cause of the problem.

It also took a long time to troubleshoot the NotReady status in which the Node node is not available.

Second, Pod status

Before analyzing the state of NotReady, we first need to know what the state of Pod is in K8s. And what does each state mean? different states intuitively show the creation information of the current Pod.

In order to avoid confusion about the concepts of Node and Pod, let's briefly describe the relationship between them (citing an official picture of K8S).

It is intuitively shown from the figure that the outermost node is the Node node, while multiple Pod containers can be run in a Node node, and a further layer is that each Pod container can run multiple instance App containers.

Therefore, the unavailability of the Node node described in this article directly results in the unavailability of all containers in the Node node.

There is no doubt that the health of the Node node directly affects the health of all instance containers under the node, even the whole K8S cluster.

So how to solve and troubleshoot the health status of Node nodes? No hurry, let's talk about the lifecycle state of Pod first.

Pending: this phase indicates that it has been accepted by Kubernetes, but the container has not yet been created and is being dispatched by kube.

1: the number 1 in the figure indicates that container creation begins after being successfully dispatched by kube resources, but container creation failure occurs at this stage.

Waiting or ContainerCreating: these two reasons are that the image pull fails during the container creation process, or the status of the container with a network error will change.

Running: this stage indicates that the container is running normally.

The container in Failed:Pod exits in a non-zero (abnormal) state.

2: the possible state of phase 2 is CrashLoopBackOff, which means that the container starts normally but exits abnormally.

The Succeeded:Pod container was successfully terminated and will not be restarted.

The above states are only common in the Pod lifecycle, and there are some states that are not listed.

Here. It's a little too much. Take a break for 3 seconds

But then again, is the status of Pod the same thing as that of Node? Mm-hmm. no, no, no. It's not exactly the same thing.

However, when the container service is not available, it is very important to check the status of the Pod first. So the question is, if the Node node service is not available, can Pod still be accessed?

The answer is: no.

Therefore, the significance of troubleshooting the health status of Pod lies in what causes the Node node service to be unavailable, so this is a very important troubleshooting indicator.

Third, business review

Since my work is related to the Internet of things, let's assume that there are four servers (assuming that the performance of the server itself is not considered, if this is the reason, it is better to upgrade the server), one of which is built by K8S-Master, and the other three machines are used as Worker work nodes.

Each worker is a Node node. Now you need to start the image on the Node node. All normal Node is the ready state.

But after a while, it was like this.

This is what we are talking about when the Node node becomes NotReady.

Fourth, problem planning and analysis.

This becomes NotReady after running. What is NotReady?

It's been running for a while, and you told me it wasn't ready?

All right, let's see why it's not ready yet.

4.1 problem analysis

Let's go back to the question we talked about earlier, that is, whether the Pod container is still running normally after the Node node becomes NotReady.

What is marked with a red box in the figure is on the node edgenode, and the Pod status has been displayed as Terminating, indicating that Pod has terminated the service.

Next let's analyze why the Node node is not available.

(1) first, check the physical environment of the server and use the command df-m to check the usage of the disk.

Or use the command free directly to view

The disk does not overflow, which means there is enough physical space.

(2) then we look at the utilization of CPU, and the command is: top-c (uppercase P can be reversed)

CPU usage is also within the range, and there is nothing unusual in terms of physical disk space or CPU performance. So why is the Node node not available? And the server is also running normally.

This seems to be a bit of a dilemma. How can this be done?

(3) Don't panic, there is another item that can be used as a basis for troubleshooting, that is, use the kube command describe command to view the detailed log of the Node node. The complete command is:

Kubectl describe node, then the Node node in the figure is as follows:

Alas, I seem to see some information description in this log. First, let's look at the first sentence: Kubelet stoped posting node status, which roughly means that Kubelet stops sending node status, and then Kubelet never posted node status means that you can no longer receive node status.

To check whether Kubelet is running properly, use the command: systemctl status kubelet. If the status is Failed, then you need to restart it. But if it is working properly, please continue to look down.

The analysis seems to have some clues, why does Kubelet send the status of the node node? This brings up another point of knowledge about Pod. Please read on patiently.

Fifth, Pod health test PLEG

According to our last analysis, it seems that the node status is no longer reported, causing the node node to become unavailable, which leads to the life and health cycle of Pod.

The full name of PLEG is: Pod Lifecycle Event Generator:Pod lifecycle event generator.

The simple understanding is to adjust the state of the container runtime according to the Pod event level and write it to the Pod cache to keep the Pod up-to-date.

In the above figure, you can see that Kubelet is testing the health status of Pod. Kubelet is a daemon on each node. Kubelet will regularly check the health information of Pod. First, take a look at an official picture.

PLEG detects the status of the running container, while kubelet detects it through a polling mechanism.

At this point, there seems to be a bit of a direction. Causing the Node node to become NotReady has something to do with the health status detection of Pod. It is precisely because the default time has been exceeded, the K8S cluster has stopped the service of the Node node.

Then why didn't you get a health report? Let's take a look at the default detection time in K8S.

On the cluster server, enter the directory: / etc/kubernetes/manifests/kube-controller-manager.yaml and check the parameters:

-node-monitor-grace-period=40s (node eviction time)-node-monitor-period=5s (polling interval)

The above two parameters indicate that kubelet detects the health status of Pod every 5 seconds. If the health status of Pod is not detected after 40 seconds, it will be set to NotReady status, and all Pod under the node will be expelled after 5 minutes.

The Pod expulsion strategy is briefly described in the official documentation, https://kubernetes.io/zh/docs/concepts/scheduling-eviction/eviction-policy/

Kubelet polling to detect the status of Pod is actually a performance-consuming operation, especially with the increase in the number of Pod containers, it is a serious consumption of performance.

A little brother on GitHub also expressed his own views on this. The original link is as follows:

Https://github.com/fabric8io/kansible/blob/master/vendor/k8s.io/kubernetes/docs/proposals/pod-lifecycle-event-generator.md

So far, we have analyzed almost, and the conclusions we have come to are as follows:

With the increase of the number of Pod, the pressure of Kubelet polling on the server is increased, and the CPU resources are tight.

Kubelet polling to check the status of Pod is bound to be affected by the network.

The physical hardware resource of Node node is limited, so it cannot hold more containers.

Due to the limitation of hardware at that time and the poor network environment, I only changed the above two parameters to extend Kubelet to poll and check the health status of Pod. The actual effect has indeed been improved.

/ / need to restart dockersudo systemctl restart docker// need to restart kubeletsudo systemctl restart kubelet

However, if conditions permit, it is best to optimize in terms of hardware.

Improve the physical resources of Node nodes

Optimize K8S network environment

Six, K8S common commands

Finally, share some commonly used K8S commands

1. Query all pod (namespaces)

Kubectl get pods-n

2. Query all node nodes

Kubectl get nodes

3. View pod details and logs

Kubectl describe pod-n

Kubectl logs-f-n

4. View the pod-yaml file

Kubectl get pod-n-o yaml

5. Query pod through tags

Kubectl get pod-l app=-n

6. Query a specific piece of information in pod

Kubectl-n get pods | grep | awk'{print $3}'

7, delete pod (or through the tag-l app=)

Kubectl delete pod-n

8, delete deployment

Kubectl delete deployment-n

9. Force deletion of pod

Kubectl delete pod-n-- force-- grace-period=0

10, enter the pod container

Kubectl exec-it-n-- sh

11. Tag node.

Kubectl label node app=label

12, view a node tag

Kubectl get node-l ""

13, view all node tags

Kubectl get node-show-labels=true

After reading the above, do you have any further understanding of how the K8S online cluster troubleshoots the NotReady abnormal state of Node nodes? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.