Capacitive RDS: locating Kubernetes performance problems with the help of Flame Graph 07/06 Update SLTechnology News&Howtos

Capacitive RDS: locating Kubernetes performance problems with the help of Flame Graph

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Capacitive RDS series of articles:

Containerized RDS: computing "Split-Brain" under the architecture of storage separation

Containerized RDS: computing storage separation or local storage?

Containerized RDS: you need to understand how data is written "bad"

Containerized RDS: expand Kubernetes storage capacity with CSI

With the help of CSI (Container Storage Interface) and a few modifications to the Kubenetes core code, the Kubenetes storage management sub-module can be extended in an efficient and low-coupling way in out-tree.

As described in "containerized RDS: expanding Kubernetes storage capacity with CSI", the PVC capacity expansion (Resize) feature is added in out-tree.

From executable programs to available programming products, you also need to design performance benchmarks that align with business requirements and optimize the performance bottlenecks found.

Empirical data show that programming products with the same functionality cost at least three times as much as programs that have been tested. -- the myth of man and moon

This article will share the optimization examples of performance benchmarking:

Discover performance bottlenecks

Identify problem components

Quickly narrow down the scope and locate the problem code-path with CPU Profile and Flame Graph

Targeted optimization

| find performance bottleneck test case:

Create 100 PVC with read-write mode of RWO and capacity of 1GiB in batch

Expected test results:

All created within 180 seconds without error report

All programmers are optimists, after all, they will encounter problems where problems may occur. After 3600 seconds, 95% of PVC is in Pending state. Strictly speaking, this feature is not available in bulk-created scenarios.

A large number of PVC are in Pending state.

| locate problem components because there are many components involved:

Kube-apiserver

Kube-controller-manager

Kubelet

External-provisioner

External-attacher

Csi-driver

Qcfs-csi-plugin

Complex calls between components, coupled with ubiquitous goroutine, if you view logs directly or locate problems with debug code, it's like looking for a needle in a haystack, not to mention locating performance bottlenecks. Therefore, the first task is to locate the problem component first.

In the course of testing, we recorded the resource usage of all components and systems. Unfortunately, there was no abnormal data in terms of CPU usage, memory usage, network Imax O and disk Imax O.

By combing the architecture diagram of the components related to storage management:

Architecture diagram

As well as the carding of business processes, kube-controller-manager, external-provisioner and csi-driver are more suspected.

By viewing the logs through kubelet logs, you can find suspicious logs in external-provisioner:

I0728 19 request.go:480 19V 50.504069 1 request.go:480] Throttling request took 192.714335ms, request: POST: https://10.96.0.1:443/api/v1/namespaces/default/eventsI0728 19V 19V 19V 50.704033 1 request.go:480] Throttling request took 190.667675ms, request: POST: https://10.96.0.1:443/api/v1/namespaces/default/events

External-provisioner access kube-apiserver trigger current limit

External-provisioner is a major suspect.

| locate the problem code-path. We can immediately enter the debugging process:

Read the external-provisioner code, add debug logs, and understand logic

Keep shrinking code-path

Steps 1 and 2 continue to iterate until the problem function is finally located, which is a very effective method.

Or use CPU profile:

Collect stack samples

Find the function that consumes the highest percentage of CPU time in the sampling luck, and use this function as the starting point for debugging.

Compared with the previous one, it is more efficient to narrow the scope of the problem and save more time.

With the help of the module "net/http/pprof", the external-provisioner is sampled by CPU for 60 seconds, and the following information can be obtained:

Generate stack usage percentage sort:

The calling relationship of the function and the percentage of CPU time spent in the sampling period:

A few words about "net/http/pprof":

Provide CPU profile and Heap profile

It is not 100% accurate to estimate the percentage of CPU occupancy of the stack during the whole sampling period by obtaining stack (almost all) information during sampling.

The sampling cost is not low. 100 Hz can sample enough stack information without causing too much overhead to the application.

The CPU sampling frequency defaults to 100Hz and is hard-coded into the module. It is not recommended to set it above 500Hz.

There are a large number of related articles on the Internet, so I won't repeat them here.

Generate a flame diagram (Flame Graph) with the acquired CPU profile information:

Here's a little more talk about the flame picture:

Draw with the help of third-party tool go-torch

Each rectangle represents a stack. During the sampling time, the higher the percentage of CPU, the longer the Y axis. The X axis indicates the call relationship between stacks.

Sort alphabetically from left to right

Random selection of colors with no specific meaning

There are a large number of related articles on the Internet, so I won't repeat them here.

It can be found that the CPU occupancy ratio of the functions addClaim and lockProvisionClaimOperation is 36.23%.

From the third-party module kubenetes-incubator/external-storage called by external-provisioner

Therefore, you can reproduce api throttling as long as you reference, for example, the module Kubenetes-incubator/external-storage to implement the volume creation function.

Then add the debug log to the code-path, understand the logic, and quickly determine the problem:

When creating a volume, external-storage needs to access API resources (such as configmap, pvc, pv, event, secrets, storageclass, and so on). To reduce the kube-apiserver workload, it is not recommended to access kube-apiserver directly. Instead, you should take advantage of the local cache (built by informer cache). But external-storage happens to have direct access to kube-apiserver. As you can see from the following figure, 18.84% of the sampling time is in list event, which is the cause of api throttling.

Further analysis, the reason why there is a large number of list event is that the Lock implementation granularity of Leader Election is too fine, which leads to serious lock preemption. In a production environment, a component will launch multiple instances. If the instance of Leader Lock is preempted, it is Leader and provides services. Other instances run in Slave mode. If there is a problem with Leader, Slave finds that Leader Lock is not updated within the lease term and can initiate preemption to become a new Leader and take over the service. This not only improves the availability of components, but also avoids possible data race problems. Therefore, it can be understood as a component instance and a lock, and the master will be re-selected only when the roles of Leader and Slave switch. However, external-storage originally intended to run multiple instances to provide services for Leader at the same time, which can be simply understood as a PVC with a lock. 100PVC means that multiple instances must have at least 100 Lock preemption.

Finally locate the cause of the problem:

The preemption of Lock leads to api throttling, which causes Lock to preempt timeout,timeout and retry, which further worsens api throttling.

As can be further verified from the figure below, 8.7% of the sampling time is in Leader Election.

| once the root cause of the problem is found, it is not difficult to solve it. Later, the problem was fixed:

Adopt sharedinformer cache

Modify Leader Lock granularity

When you generate the run again, you can see that the CPU occupancy percentage of the functions addClaim and lockProvisionClaimOperation has dropped to 13.95%.

The throttling keyword in the external-provisioner log disappears

The time of 100 PVC is reduced to less than 60 seconds. All of them are created successfully without any error reports.

| | conclusion |

For end users, the interactive interface is getting simpler and simpler, but for developers, there are more and more components, compilation time is getting longer and longer, coupled with ubiquitous concurrency, it is becoming more and more difficult to locate problems, especially performance problems. Therefore, the understanding of architecture can help us to quickly lock problem components, cooperate with Profile tools and Flame Graph to quickly locate code-path, coupled with the understanding of business logic to find a solution.

All programmers are optimists, and no matter what the program is, the result is unquestionable: "it will definitely run this time." Or "I just found out the last question." -- the myth of man and moon

| | author profile |

Xiong Zhong Zhe, co-founder of Walk Technology

He has worked for Alibaba and Baidu and has more than 10 years of working experience in relational database. He is currently committed to introducing cloud native technology into relational database services.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.