Analysis and solution to the eviction of K8s pod 07/12 Update SLTechnology News&Howtos

Analysis and solution to the eviction of K8s pod

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about the analysis and solution to the eviction of K8s pod, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

1. Problem phenomenon and analysis environment description

Environment description:

Centos7.3Kubernetes1.14docker 1.18.9

Exception message: kubectl get pod found that the service was expelled, and then there was a problem during scheduling to other nodes. The problem occurred because a stain was added to the orchestration file, indicating that the Pod could not be scheduled to other nodes. But why is Pod being deported, which is a question? Check the log in / var/log/messages and find a large number of errors in which images cannot be pulled, as shown below:

Mirror deleted problem Nov 7 06:20:49 k8work2 kubelet: E1107 06 remote_image.go:113 20remote_image.go:113 49.829886 13241] PullImage "k8s.gcr.io/kube-proxy:v1.14.2" from image service failed: rpc error: code = Unknown desc = Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp 74.125.204.82

Nov 7 06:20:49 k8work2 kubelet: E110706 pod_workers.go:190 20 pod_workers.go:190 49.866132 13241] Error syncing pod 4fedf7b3-207e-11eb-90a3-2c534a095a16 ("kube-proxy-pqvvb_kube-system (4fedf7b3-207e-11eb-90a3-2c534a095a16)"), skipping: failed to "StartContainer" for "kube-proxy" with ErrImagePull: "rpc error: code = Unknown desc = Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp 74.125.204.8 443: connect: connection timed out"

This log means that there is a startup failure because the image cannot be pulled. In addition, there are coredns, Controller, etc., and this kind of image cannot be pulled.

When this problem occurs, it is easy to understand that in the private network cluster, during the installation process of the cluster, the image is copied, but execute docker images | grep K8s found that the image of K8s is all gone. Is it possible that someone deleted it, or else why did the image disappear for no reason?

Then I started to check the log again and found that this kind of information exists in the log. It is confirmed that it is not deleted by man, but recycled by kubelet. The specific log is as follows:

Nov 7 05:44:51 k8work2 kubelet: I1107 05:44:51.041315 13241 image_gc_manager.go:317] attempting to delete unused images

Nov 7 05:44:51 k8work2 kubelet: I110705VOV 44buret 51.083785 13241 image_gc_manager.go:371] [imageGCManager]: Removing image "sha256:6858809bf669cc5da7cb6af83d0fae838284d12e1be0182f92f6bd96559873e3" to free 1231725 bytes

Why recycle the images needed for K8S to run itself? Let's not explain too much here, the specific reasons, let's see below.

Can't find manifests problem Nov 7 06:20:47 k8work2 kubelet: E1107 06 k8work2 kubelet 20 file_linux.go:61 47.943073 13241 file_linux.go:61] Unable to read config path "/ etc/kubernetes/manifests": path does not exist, ignoring

As a matter of fact, there are a lot of problems on the Internet, because it is a computing node and does not contain manifests, but this error is always prompted in the log. This kind of noise log looks uncomfortable. I created the manifests folder directly in / etc/kubernetes/, and the problem is solved directly. This error should have nothing to do with the Pod eviction in this article. I have seen that there is the same problem with other calculations.

Orphan Volume problem Nov 7 09:32:03 k8work2 kubelet: E1107 0915 32 kubelet_volumes.go:154 03.431224 13241 kubelet_volumes.go:154] Orphaned pod "f6a977f4-2098-11eb-90a3-2c534a095a16" found, but volume paths are still present on disk: There were a total of 1 errors similar to this. Turn up verbosity to see them.

Enter / var/lib/kubelet/pods/, through the id number, enter the directory of kubelet, and you can find that there is still container data in it, and the information such as pod name is still retained in the etc-hosts file.

It can be inferred from the error message that there is an orphan Pod on this computing node, and the Pod mounts the data volume (volume), which hinders the normal collection and cleaning of the orphan Pod by Kubelet. So I keep prompting the above error message. I confirm that the Pod confirms that the Pod is no longer running and there is no risk of data loss. I directly executed rm-rf f6a977f4-2098-11eb-90a3-2c534a095a16. After deletion, I am not brushing this kind of error.

Question of expulsion Nov 7 07:21:19 k8work2 kubelet: E1107 07 eviction_manager.go:576 2119.021705 13241 eviction_manager.go:576] eviction manager: pod es-0_log (aa41dd4c-2085-11eb-90a3-2c534a095a16) failed to evict timeout waiting to kill pod

Nov 7 07:21:22 k8work2 kubelet: I110707 image_gc_manager.go:300 22883681 13241 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 86 which is over the high threshold (85%). Trying to free 21849563955 bytes down to the low threshold (80%).

Nov 7 07:21:22 k8work2 kubelet: E1107 07:21:22.890923 13241 kubelet.go:1278] Image garbage collection failed multiple times in a row: failed to garbage collect required amount of images. Wanted to free 21849563955 bytes, but freed 0 bytes

The log probably indicates that the pressure on the disk is too high and has exceeded the threshold, so df-h checked the disk. Sure enough, this machine service generated a large number of logs, which caused the disk to occupy too much. However, although the disk occupies too much, why reclaim the image? According to the inquiry on the official website, it is roughly introduced as follows:

Garbage collection is a useful feature of kubelet that cleans up unused images and containers. Kubelet performs garbage collection on the container every minute and on the image every five minutes.

The mirror garbage collection strategy considers only two factors: HighThresholdPercent and LowThresholdPercent. Disk usage exceeding the upper threshold (HighThresholdPercent) triggers garbage collection. Garbage collection removes the least recently used images until disk usage meets the lower threshold (LowThresholdPercent).

The container garbage collection policy considers three user-defined variables. MinAge is the minimum life cycle within which containers can be garbage collected. MaxPerPodContainer is the maximum number of death containers allowed in each pod. MaxContainers is the maximum number of all dead containers. You can disable these variables independently by setting MinAge to 0 and MaxPerPodContainer and MaxContainers to less than 0, respectively. Kubelet will handle containers that are unrecognized, deleted, and outside the range set by the previously mentioned parameters. The oldest containers are usually removed first.

For K8s users, the above kubelet parameters can be adjusted. For more information on how to adjust them, please see https://kubernetes.io/zh/docs/concepts/cluster-administration/kubelet-garbage-collection/.

At this point, we have probably found the reason why Pod was expelled because the disk pressure exceeded the threshold. In the view of K8s, this computing node is already abnormal, so the garbage collection mechanism is enabled. According to the default collection policy, it first deletes its own image information, then causes the private network image pull failure, and then starts to reschedule the Pod, but because the Pod is tainted. Cannot be scheduled to other nodes, resulting in startup failure.

2. Problem solving process

So I found the folder that occupied the disk data, which was a file named pod PVC. After confirming that the data in it could be deleted, I deleted all the PVC folder and contents directly. I started Pod again and was in the init state all the time. The event event prompted that the PVC could not be found. After taking a closer look at the contents of the file where the Pod is choreographed, I found that the Pod is a stateful application and choreographed by sts. We know that sts starts in a certain order, has a stable network identity and writes to fixed storage, and now I've killed all the storage names, so it's a warning that it can't start.

But the reason why the above stateful Pod can not start, the reason is that I reuse the past PVC, I just delete the PVC, PV, re-create, everything is fine, so I started to use kubectl delete pvc pvc_name-n log, an interesting scene happened again, PVC has been stuck in Terminating can not be deleted.

I looked up the solutions on the Internet, and there are about two ways:

Delete directly in etcd

Use kubectl patch

Because etcdctl is not installed locally, kubectl patch is used directly to solve the problem. The resolution command is as follows: kubectl delete pvc pvc_name-n log

Kubectl patch pvc pvc_name-p'{"metadata": {"finalizers": null}'- n log

It needs to be enforced because native k8s does not allow rollback actions after deletion. This is probably the embodiment of the ultimate consistency principle of K8s.

Restart Pod again, start successfully, PVC and PV have been recreated.

3. Summary

Through this article, we can see two points:

When there is a problem with the k8s cluster, you must check the log carefully. First, take a look at the event information of K8s itself. If you can't find the clue, then check the kernel log. If there is a problem, you will be able to find the problem log under normal circumstances. Do not delete data blindly, be sure to understand the principle of things before deleting, otherwise it will bring unnecessary trouble.

Of course, in order to completely solve such problems, we still need to monitor the patrol system. After the disk alarm occurs, we can immediately notify the system administrator or automatically make further data processing, so that the service will not fail. If you have any questions, please follow the official account, add my Wechat and discuss together!

After reading the above, do you have any further understanding of the analysis and solution to the eviction of K8s pod? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.