How to review runtime exceptions in K8S 07/06 Update SLTechnology News&Howtos

How to review runtime exceptions in K8S

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces how to review runtime anomalies in K8S. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

I. Overview

Received the online events alarm of K8s:

See this alarm, quickly take a look at the service, found that the service is currently normal, but there is a service being released, can not destroy the old pod, I suspect that the node node may not be available. Sure enough, check the node status on the control node, which is not ready.

II. Inspection

The kernel alarm is reported, but in fact, it should not cause the node node to be unavailable, so take a closer look at it and stain the node node first, which cannot be scheduled.

When the cluster node enters the NotReady state, the first thing we need to do is to check that the kubelet running on the node is normal. When this problem occurs, use the systemctl command to check the kubelet status and find that it works as a daemon managed by systemd. When we use journalctl to view the kubelet log, we find the following error.

Dec 11 19:38:45 ali-worker-k8s-001 kubelet [20140]: E1211 19displacement 38V 45.239546 20140 kubelet.go:1551] error killing pod: failed to "KillPodSandbox" for "31321cfc-1bbe-11ea-893e-00163e14447d" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"

The pod of order-oms-64544b9c65-4lq5d_sec-mall could not be killed, which led to the docker deadlock, so it was judged to be the problem of containerd.

Asked Ali's boss.

Shim actually acts as the parent process, the process in the recycling container, just like systemd to recycle the system process. If the systemd card is owned on linux, there will be a bunch of defunct. The synchronization mechanism of the old version of shim uses a 32-size channel. In theory, if more than 32 processes exit together, they will overflow.

So I exec into the container and found that there are many processes, all of which are more than 32, so I choose to upgrade containerd to solve the problem.

The specific steps are as follows

1. Download 1.2.10 containerdwget https://download.docker.com/linux/centos/7/x86_64/stable/Packages/containerd.io-1.2.10-3.2.el7.x86_64.rpm2, stop kubelet process systemctl stop kubelet3, stop containerdsystemctl stop containerd4, update rpm version rpm-Uvh containerd.io-1.2.10-3.2.el7.x86_64.rpm5, start containerd, check version systemctl start containerdctr version6, start docker Check the container process systemctl start dockerdocker ps7, start kubeletsystemctl start kubelet8, and schedule pod to the node to verify whether it is normal. Third, another bug seen on the Internet is the problem of systemd.

1. What is PLEG

This error clearly tells us that the container runtime is not working and that PLEG is unhealthy. Here the container runtime refers to docker daemon. Kubelet controls the lifecycle of the container by directly manipulating docker daemon. And the PLEG here refers to pod lifecycle event generator. PLEG is the health check mechanism that kubelet uses to check the container runtime. This could have been done by kubelet using polling. But polling has its cost defects, so the application of PLEG comes into being. PLEG attempts to implement a health check of the container runtime in the form of an "interrupt", although in fact it uses both polling and "interrupt" mechanisms.

Basically seeing the error above, we can confirm that there is something wrong with the container runtime. On the node in question, try to run the new container with the docker command, and the command will not respond. This shows that the above error report is accurate.

2. Container runtime

The container runtime includes docker daemon,containerd,containerd-shim and runC. The component containerd is responsible for the lifecycle management of the container on the cluster node and provides the gRPC interface to the docker daemon.

Therefore, it also upgraded the next systemd. Upgrade systemd, just yum update systemd directly.

If you encounter a situation in which the container cannot get the network plug-in, you can ip link del dev cni0 and restart it automatically.

This is the end of the runtime exception review in K8S. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.