How to troubleshoot the problem of missing some nodes in the process of scheduling pod by K8s Scheduler 07/04 Update SLTechnology News&Howtos

How to troubleshoot the problem of missing some nodes in the process of scheduling pod by K8s Scheduler

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to troubleshoot the problem of missing some nodes in K8s Scheduler in the process of scheduling pod", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how to troubleshoot some nodes missing from K8s Scheduler in the process of scheduling pod".

Problem phenomenon

Create a new version v1.18.4 (detailed version number) on the TKE console

< v1.18.4-tke.5）的独立集群，其中，集群的节点信息如下：有3个master node和1个worker node，并且worker 和 master在不同的可用区。 node角色label信息ss-stg-ma-01masterlabel[failure-domain.beta.kubernetes.io/region=sh,failure-domain.beta.kubernetes.io/zone=200002]ss-stg-ma-02masterlabel[failure-domain.beta.kubernetes.io/region=sh,failure-domain.beta.kubernetes.io/zone=200002]ss-stg-ma-03masterlabel[failure-domain.beta.kubernetes.io/region=sh,failure-domain.beta.kubernetes.io/zone=200002]ss-stg-test-01workerlabel[failure-domain.beta.kubernetes.io/region=sh,failure-domain.beta.kubernetes.io/zone=200004] 待集群创建好之后，再创建出一个daemonset对象，会出现daemonset的某个pod一直卡住pending状态的现象。现象如下： $ kubectl get pod -o wideNAME READY STATUS RESTARTS AGE NODE debug-4m8lc 1/1 Running 1 89m ss-stg-ma-01debug-dn47c 0/1 Pending 0 89m debug-lkmfs 1/1 Running 1 89m ss-stg-ma-02debug-qwdbc 1/1 Running 1 89m ss-stg-test-01 （补充：TKE当前支持的最新版本号为v1.18.4-tke.8，新建集群默认使用最新版本）问题结论 k8s的调度器在调度某个pod时，会从调度器的内部cache中同步一份快照（snapshot），其中保存了pod可以调度的node信息。上面问题（daemonset的某个pod实例卡在pending状态）的原因就是同步的过程发生了部分node信息丢失，导致了daemonset的部分pod实例无法调度到指定的节点上，卡在了pending状态。接下来是详细的排查过程。日志排查截图中出现的节点信息（来自客户线上集群）： k8s master节点：ss-stg-ma-01、ss-stg-ma-02、ss-stg-ma-03 k8s worker节点：ss-stg-test-01 1、获取调度器的日志这里首先是通过动态调大调度器的日志级别，比如，直接调大到V(10)，尝试获取一些相关日志。当日志级别调大之后，有抓取到一些关键信息，信息如下：解释一下，当调度某个pod时，有可能会进入到调度器的抢占preempt环节，而上面的日志就是出自于抢占环节。集群中有4个节点（3个master node和1个worker node），但是日志中只显示了3个节点，缺少了一个master节点。所以，这里暂时怀疑下是调度器内部缓存cache中少了node info。 2、获取调度器内部cache信息 k8s v1.18已经支持打印调度器内部的缓存cache信息。打印出来的调度器内部缓存cache信息如下：

As you can see, the node info in the scheduler's internal cache cache is complete (3 master node and 1 worker node). By analyzing the log, we can get a preliminary conclusion: the node info in the internal cache cache of the scheduler is complete, but when scheduling the pod, part of the node information in the cache cache will be missing.

Root cause of the problem

Before we go any further, let's familiarize ourselves with the process of scheduling pod by the scheduler (shown in part) and the nodeTree data structure.

Pod scheduling process (partial presentation)

Combined with the above figure, the scheduling process of a pod is a Scheduler Cycle. At the beginning of this Cycle, the first step is update snapshot. Snapshot can be understood as the cache in cycle, which stores the node info needed for pod scheduling, and update snapshot is a synchronization process from nodeTree (node information stored in the internal cache of the scheduler) to snapshot. The synchronization process is mainly realized through the nodeTree.next () function, and the function logic is as follows:

/ / next returns the name of the next node. NodeTree iterates over zones and in each zone iterates// over nodes in a round robin fashion.func (nt * nodeTree) next () string {if len (nt.zones) = 0 {return ""} numExhaustedZones: = 0 for {if nt.zoneIndex > = len (nt.zones) {nt.zoneIndex = 0} Zone: = nt.zones [nt.zoneIndex] nt.zoneIndex++ / / We do not check the exhausted zones before calling next () on the zone. This ensures / / that if more nodes are added to a zone after it is exhausted, we iterate over the new nodes. NodeName, exhausted: = nt.tree [zone] .next () if exhausted {numExhaustedZones++ if numExhaustedZones > = len (nt.zones) {/ / all zones are exhausted. We should reset. Nt.resetExhausted ()} else {return nodeName}}

Combined with the conclusion of the above troubleshooting process, we can further narrow the scope of the problem: the synchronization process from nodeTree (internal cache of the scheduler) has lost some node information.

# nodeTree data structure (easy to understand, this article uses a linked list to show)

In the nodeTree data structure, there are two cursors, zoneIndex and lastIndex (zone level), which are used to control the synchronization process from nodeTree (internal cache of the scheduler) to snapshot.nodeInfoList. Also, it is important that the cursor value after the last synchronization is recorded for the initial value of the next synchronization process.

# reproduce the problem and locate the root cause

When creating a k8s cluster, you will first join master node and then join worker node (meaning that the worker node time will be later than the time when master node joined the cluster).

The first round of synchronization: three master node are created, and then pod scheduling occurs (for example, the cni plug-in is deployed in the cluster as daemonset), which triggers a synchronization to the nodeTree (internal cache of the scheduler). After synchronization, the two cursors of nodeTree become the following result:

NodeTree.zoneIndex = 1, nodeTree.nodeArray [sh: 200002] .lastIndex = 3

The second round of synchronization: when worker node joins the cluster, and then creates a new daemonset, the second round of synchronization (nodeTree (cache inside the scheduler) is triggered. The synchronization process is as follows:

1. ZoneIndex=1, NodeArray [sh: 200004] .lastIndex = 0, we get ss-stg-test-01.

2. ZoneIndex=2 > = len (zones); zoneIndex=0, nodeArray [sh: 200002] .lastIndex = 3, return.

3. ZoneIndex=1, NodeArray [sh: 200004] .lastIndex = 1, return.

4. ZoneIndex=0, NodeArray [sh: 200002] .lastIndex = 0, we get ss-stg-ma-01.

5. ZoneIndex=1, nodeArray [sh: 200004] .lastIndex = 0, we get ss-stg-test-01.

6. ZoneIndex=2 > = len (zones); zoneIndex=0, nodeArray [sh: 200002] .lastIndex = 1, we get ss-stg-ma-02.

After the synchronization is complete, the snapshot.nodeInfoList of the scheduler gets the following result:

[ss-stg-test-01, ss-stg-ma-01, ss-stg-test-01, ss-stg-ma-02,]

Where's ss-stg-ma-03? It was lost during the second round of synchronization.

Solution

From the analysis of the root cause of the problem, we can see that the cause of the problem is that the cursor zoneIndex and lastIndex (zone level) values in the nodeTree data structure are retained, so the solution is to force the cursor to reset (return to zero) every time the SYNC is synchronized.

At this point, I believe you have a deeper understanding of "how to troubleshoot K8s Scheduler missing some nodes in the process of scheduling pod". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.