How to troubleshoot the problem of excessive utilization of Kubelet CPU 03/31 Update SLTechnology News&Howtos

How to troubleshoot the problem of excessive utilization of Kubelet CPU

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

The main content of this article is "how to troubleshoot the problem of excessive utilization of Kubelet CPU". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how to troubleshoot the problem of high Kubelet CPU utilization".

Problem background

We found that the CPU utilization of the Kubelet processes of all the worker nodes in the customer's Kubernetes cluster environment was too high for a long time, and through pidstat, we can see that the CPU utilization is as high as 100%. This paper records the process of troubleshooting this problem.

Cluster environment

Investigation process

Use strace tools to track kubelet processes

1. Due to the abnormal CPU utilization of the Kubelet process, you can use the strace tool to dynamically track the calls of the kubelet process. First, use the strace-cp command to count the time, calls and errors of each system call of the kubelet process in a certain period of time.

As you can see from the above figure, futex throws more than 5,000 errors during the execution of system calls, which is not a normal number, and the function takes up 99% of the time, so you need to further check the calls related to the kubelet process.

2. Since the strace-cp command can only view the overall call status of the process, we can print the timestamp of each system call through the strace-tt-p command, as follows:

Judging from the strace output, there are a large number of Connect timed out when executing futex-related system calls, and error of-1 and ETIMEDOUT are returned, which is why you see so much error in strace-cp.

Futex is a mixed synchronization mechanism in user mode and kernel mode. When the futex variable tells the process that there is a competition, it will execute a system call to complete the corresponding processing, such as wait or wake up. According to the official documents, futex has several parameters:

Futex (uint32_t * uaddr, int futex_op, uint32_t val, const struct timespec * timeout, / * or: uint32_t val2 * / uint32_t * uaddr2, uint32_t val3)

The official document gives an explanation of ETIMEDOUT:

ETIMEDOUT The operation in futex_op employed the timeout specified in timeout, and the timeout expired before the operation completed.

It means that the corresponding operation cannot be completed within the specified timeout time. Where futex_op corresponds to the FUTEX_WAIT_PRIVATE and FUTEX_WAIT_PRIVATE of the above output result, you can see that the timeout basically occurs when FUTEX_WAIT_PRIVATE occurs.

Judging from the current system call level, futex can not go to sleep smoothly, but what operations futex has done is still unclear, so it is still impossible to determine the reason for the soaring kubeletCPU, so we need to further look at the function calls of kubelet to see where the execution is stuck.

FUTEX_PRIVATE_FLAG: this parameter tells the kernel that futex is process-specific and not shared with other processes. FUTEX_WAIT_PRIVATE and FUTEX_WAKE_PRIVATE here are two of the FLAG.

Futex related note 1: https://man7.org/linux/man-pages/man7/futex.7.html fuex related note 2: https://man7.org/linux/man-pages/man2/futex.2.html

Using go pprof tool to analyze kubelet function call

In the early version of Kubernetes, you can obtain debug data directly through the debug/pprof API. Later, considering the related security issues, this API is cancelled. For more information, please see CVE-2019-11248 (https://github.com/kubernetes/kubernetes/issues/81023). Therefore, we will enable proxy through kubectl to obtain relevant data metrics:

1. First use the kubectl proxy command to start the API server agent

Kubectl proxy-address='0.0.0.0'-accept-hosts=' ^ * $'

It is important to note that if you are using a kubeconfig file copied on Rancher UI, you need to use a context with a specified master IP, and you can ignore it if you are installing RKE or other tools.

2. Build Golang environment. Go pprof needs to be used in golang environment. If golang is not installed locally, you can quickly build Golang environment through Docker.

Docker run-itd-name golang-env-net host golang bash

3. Use the go pprof tool to export the collected metrics. Replace 127.0.0.1 with the IP of the apiserver node. The default port is 8001. If the docker run environment runs on the node where the apiserver is located, 127.0.0.1 can be used. In addition, replace NODENAME with the corresponding node name.

Docker exec-it golang-env bashgo tool pprof-seconds=60-raw-output=kubelet.pprof http://127.0.0.1:8001/api/v1/nodes/${NODENAME}/proxy/debug/pprof/profile

4. The output pprof file is not easy to view and needs to be converted into flame diagram. It is recommended to use FlameGraph tool to generate svg diagram.

Git clone https://github.com/brendangregg/FlameGraph.gitcd FlameGraph/./stackcollapse-go.pl kubelet.pprof > kubelet.out./flamegraph.pl kubelet.out > kubelet.svg

After the conversion to the flame diagram, you can visually see the function-related calls and the specific call time ratio in the browser.

5. Analyze the flame diagram

As you can see from the fire chart of kubelet, the function that takes the longest time to call is * k8s.io/kubernetes/vendor/github.com/google/cadvisor/manager. (containerData) .housekeeper. CAdvisor is a built-in metric collection tool of kubelet, which is mainly responsible for real-time monitoring and performance data collection of resources and containers on node machines, including CPU usage, memory usage, network throughput and file system usage.

If you go deep into the function call, you can find that the function k8s.io/kubernetes/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs. (* Manager) .GetStats takes the longest time to occupy the function k8s.io/kubernetes/vendor/github.com/google/cadvisor/manager. (* containerData). Housekeeping, indicating that it takes more time to obtain the relevant state of the container CGroup.

6. Since this function takes a long time, let's analyze what the function does.

View the source code: https://github.com/kubernetes/kubernetes/blob/ded8a1e2853aef374fc93300fe1b225f38f19d9d/vendor/github.com/opencontainers/runc/libcontainer/cgroups/fs/memory.go#L162

Func (s * MemoryGroup) GetStats (path string, stats * cgroups.Stats) error {/ / Set stats from memory.stat. StatsFile, err: = os.Open (filepath.Join (path, "memory.stat")) if err! = nil {if os.IsNotExist (err) {return nil} return err} defer statsFile.Close () sc: = bufio.NewScanner (statsFile) for sc.Scan () {t, v Err: = fscommon.GetCgroupParamKeyValue (sc.Text ()) if err! = nil {return fmt.Errorf ("failed to parse memory.stat (% Q) -% v", sc.Text (), err)} stats.MemoryStats.Stats [t] = v} stats.MemoryStats.Cache = stats.MemoryStats.Stats ["cache"] memoryUsage, err: = getMemoryData (path, ") if err! = nil {return err} stats.MemoryStats.Usage = memoryUsage swapUsage Err: = getMemoryData (path, "memsw") if err! = nil {return err} stats.MemoryStats.SwapUsage = swapUsage kernelUsage, err: = getMemoryData (path, "kmem") if err! = nil {return err} stats.MemoryStats.KernelUsage = kernelUsage kernelTCPUsage, err: = getMemoryData (path, "kmem.tcp") if err! = nil {return err} stats.MemoryStats.KernelTCPUsage = kernelTCPUsage useHierarchy: = strings.Join ([] string {"memory", "use_hierarchy"} ".") Value, err: = fscommon.GetCgroupParamUint (path, useHierarchy) if err! = nil {return err} if value = = 1 {stats.MemoryStats.UseHierarchy = true} pagesByNUMA, err: = getPageUsageByNUMA (path) if err! = nil {return err} stats.MemoryStats.PageUsageByNUMA = pagesByNUMA return nil}

As you can see from the code, the process reads the memory.stat file, which holds the cgroup memory usage. In other words, it takes a lot of time to read this file. At this point, what would be the effect if we looked at the file manually?

# time cat / sys/fs/cgroup/memory/memory.stat > / dev/nullreal 0m9.065suser 0m0.000ssys 0m9.064s

As you can see here, it took 9s to read this file, which is obviously abnormal.

Based on the above results, we find an issue (https://github.com/google/cadvisor/issues/1774) on the GitHub of cAdvisor, from which we can see that the problem has something to do with slab memory caching. Learn from this issue that the memory of the affected machine will gradually be used. Through / proc/meminfo, you can see that the memory used is slab memory, which is the memory page of the kernel cache, and the vast majority of it is dentry cache. From this we can judge that when the process life cycle in CGroup ends, it still remains in slab memory due to cache reasons, so that it can not be released like zombie CGroup.

That is, every time a memory CGroup is created, a memory space is allocated to its creation in the kernel memory space, which contains the current CGroup-related cache (dentry, inode), that is, the cache of directory and file indexes, which is essentially to improve the efficiency of reading. However, when all processes in CGroup exit, the cache with kernel memory space is not cleaned up.

The kernel allocates memory through the partner algorithm, and whenever a process requests memory space, it will allocate at least one memory page, that is, at least 4k memory will be allocated, and the memory will be released at least one page at a time. When the requested memory size is tens or hundreds of bytes, 4k is a huge memory space for it. In Linux, in order to solve this problem, a slab memory allocation management mechanism is introduced to handle such a small number of memory requests. As a result, when all processes in CGroup exit, this part of the memory will not be easily reclaimed, and the cached data in this part of memory will not be easily reclaimed. It is also read into the stats, which affects the performance of the read.

Solution method

1. Clean the node cache, which is a temporary solution. Temporarily emptying the node memory cache can alleviate the kubelet CPU utilization, but when the cache comes up later, the CPU utilization will rise again.

Echo 2 > / proc/sys/vm/drop_caches

2. Upgrade the kernel version

In fact, this is mainly a kernel problem. It is mentioned in the commit on GitHub (https://github.com/torvalds/linux/commit/205b20cc5a99cdf197c32f4dbee2b09c699477f0)) that the query performance related to CGroup stats has been optimized in kernel versions above 5.2 +. If you want to better solve this problem, it is recommended to upgrade the kernel version reasonably by referring to your own operating system and environment. In addition, Redhat also optimizes the performance problems of CGroup in kernel-4.18.0-176 (https://bugzilla.redhat.com/show_bug.cgi?id=1795049) version. The default kernel version used by CentOS 8/RHEL 8 is 4.18.If the operating system you are using is RHEL7/CentOS7, you can try to gradually replace the new operating system and use this kernel version 4.18.0-176or later. After all, the container-related experience of the new version of the kernel will be much better.

Kernel related commit: https://github.com/torvalds/linux/commit/205b20cc5a99cdf197c32f4dbee2b09c699477f0 redhat kernel bug fix: https://bugzilla.redhat.com/show_bug.cgi?id=1795049

At this point, I believe you have a deeper understanding of "how to troubleshoot the problem of excessive utilization of Kubelet CPU". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.