When the K8s cluster reaches ten thousand scale, how does Alibaba solve the performance problem of each component of the system? 07/16 Update SLTechnology News&Howtos

When the K8s cluster reaches ten thousand scale, how does Alibaba solve the performance problem of each component of the system?

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Author | Zeng Fansong, Senior Technical expert of Aliyun Container platform

This paper mainly introduces the typical problems encountered by Alibaba in the process of landing Kubernetes in large-scale production environment and the corresponding solutions, including several performance and stability enhancements to etcd, kube-apiserver and kube-controller. These key enhancements are the key to the smooth support of Alibaba's internal Kubernetes cluster with tens of thousands of nodes.

Background

From Alibaba's earliest AI system (2013), the cluster management system has experienced many rounds of architecture evolution, to 2018 the comprehensive application of Kubernetes, the story during this period is very wonderful, there is an opportunity to share with you alone. This paper ignores the process of system evolution, does not discuss why Kubernetes can win in the community and within the company, but focuses on what problems will be encountered in the application of Kubernetes and what key optimizations we have made.

In Alibaba's production environment, there are more than 10k containerized applications, and the containers of the whole network are at the level of millions, running on more than 100,000 host computers. There are more than a dozen clusters supporting Alibaba's core e-commerce business, and the largest cluster has tens of thousands of nodes. In the process of landing Kubernetes, we are faced with great challenges in scale, such as how to apply Kubernetes to a very large-scale production level.

Rome was not built in a day. In order to understand the performance bottleneck of Kubernetes, we estimate the expected size of a 10k-node cluster based on the current situation of Ali's production cluster:

20w pods100w objects

We build a large-scale cluster simulation platform based on Kubemark, using a container to start multiple (50) Kubemark processes, using 200 4c containers to simulate 10k-node kubelet. When running common loads in a simulated cluster, we found that some basic operations such as Pod scheduling latency were very high, reaching an astonishing 10s level, and the cluster was in a very unstable state.

When the size of the Kubernetes cluster reaches 10k nodes, each component of the system has corresponding performance problems, such as:

There are a lot of read and write delays in etcd and a denial of service, and at the same time, because of its space constraints, Kubernetes can not store a large number of objects; API Server query pods/nodes latency is very high, concurrent query request address back-end etcd oom;Controller may not be aware of the latest changes from API Server in time, and the processing delay is high; when an abnormal restart occurs, the service recovery time will take several minutes Scheduler has high latency and low throughput, so it can not meet the needs of daily operation and maintenance of Ali business, let alone support the extreme scenarios of great promotion. Etcd improvements

In order to solve these problems, Aliyun container platform has made great efforts to improve the performance of Kubernetes in large-scale scenarios.

First of all, the etcd level, as the database of Kubernetes storage objects, its impact on the performance of the Kubernetes cluster is very important.

With the improvement of the first version, we increased the total amount of data stored by etcd by transferring the data from etcd to the tair cluster. However, a significant disadvantage of this approach is the additional tair cluster, and the increased complexity of operation and maintenance brings great challenges to the data security in the cluster. At the same time, its data consistency model is not based on raft replication group, at the expense of data security.

The improvement of the second version is that we store different types of objects in API Server in different etcd clusters. From the inside of etcd, it corresponds to different data directories. By routing the data from different directories to different back-end etcd, the total amount of data stored in a single etcd cluster is reduced and the scalability is improved.

With the improvement of the third version, we deeply studied the internal implementation principle of etcd, and found that a key problem affecting the scalability of etcd is the page page allocation algorithm of the underlying bbolt db: with the growth of the amount of data stored in etcd, the performance of linear search "page storage pages with continuous length n" in bbolt db is significantly degraded.

In order to solve this problem, we design a free page management algorithm based on segregrated hashmap. Hashmap takes the continuous page size as key, and the continuous page start page id as value. The free page lookup of O (1) is realized by looking up this segregrated hashmap, which greatly improves the performance. When the block is released, the new algorithm attempts to merge with the page adjacent to the address and update the segregrated hashmap. A more detailed analysis of the algorithm can be found in the blog posts published on the CNCF blog:

Https://www.cncf.io/blog/2019/05/09/performance-optimization-of-etcd-in-web-scale-data-scenario/

Through the improvement of this algorithm, we can expand the storage space of etcd from the recommended 2GB to 100GB, which greatly increases the scale of etcd storage data, and there is no significant delay in reading and writing. In addition, we also collaborated with Google engineers to develop features such as etcd raft learner (zookeeper observer-like) / fully concurrent read to enhance data security and read-write performance. These improvements have contributed to open source and will be released in the community etcd version 3.4.

API Server improvementsEfficient node heartbeats

In Kubernetes cluster, a core problem that affects its expansion to a larger scale is how to effectively deal with the heartbeat of nodes. In a typical production environment (non-trival), kubelet reports a heartbeat every 10 seconds, and the content of each heartbeat request reaches 15kb (including dozens of images on the node, and several volume information), which leads to two major problems:

Heartbeat requests trigger updates of node objects in etcd. In a cluster of 10k nodes, these updates will generate near-1GB/min transaction logs (etcd records the change history); API Server has a high CPU consumption, node nodes are very large, serialization / deserialization costs are very high, and the CPU cost of processing heartbeat requests exceeds 80% of the API Server CPU time.

To solve this problem, Kubernetes introduces a new build-in Lease API that strips the information closely related to the heartbeat from the node object, which is the Lease in the figure above. Originally, kubelet updates every 10 seconds and the node object is upgraded to:

The Lease object is updated every 10 seconds, indicating the survival state of the node, and Node Controller judges whether the node is alive according to the state of the Lease object; for the sake of compatibility, it is reduced to updating the node object every 60s, so that Eviction_ _ Manager and other things can continue to work according to the original logic.

Because the Lease object is very small, the cost of updating it is much less than updating the node object. Through this mechanism, kubernetes not only significantly reduces the CPU overhead of API Server, but also greatly reduces a large number of transaction logs in etcd, successfully expanding its scale from 1000 to thousands of nodes, which has been enabled by default in community Kubernetes-1.14.

API Server load balancing

In a production cluster, multiple nodes are usually deployed to form a highly available Kubernetes cluster for performance and availability reasons. However, in the actual operation of a high availability cluster, there may be a load imbalance among multiple API Server, especially when the cluster is upgraded or some nodes are restarted. This puts a lot of pressure on the stability of the cluster. It was originally planned to share the pressure on the API Server in a highly available way, but in extreme cases, all the pressure went back to one node, resulting in a longer response time of the system, or even breaking down the node and leading to an avalanche.

The following figure shows a simulated case in a stress test cluster. In a three-node cluster, all the pressure after API Server upgrade is transferred to one of the API Server, and its CPU overhead is much higher than that of the other two nodes.

A natural way to solve the problem of load balancing is to add load balancer. As mentioned in the previous description, the main load in the cluster is to process the heartbeat of nodes, so we add lb between API Server and kubelet. There are two typical ideas:

API Server testing adds lb, and all kubelets connects to lb, which is a typical Kubernetes cluster delivered by cloud vendors; kubelet testing adds lb, and lb selects API Server.

Through the verification of the pressure test environment, it is found that adding lb can not solve the problems mentioned above, and we must deeply understand the communication mechanism within Kubernetes. Going deep into Kubernetes, it is found that in order to solve the cost of tls connection authentication, Kubernetes clients have made a lot of efforts to ensure that "try to reuse the same tls connection". In most cases, the client watcher works on the same tls connection in the lower layer. Only when this connection is abnormal, it is possible to trigger reconnection and then API Server switching. As a result, we can see that when the kubelet is connected to one of the API Server, basically no load switching occurs. In order to solve this problem, we have optimized in three aspects:

API Server: believes that the client is untrusted and needs to protect itself from overloaded requests. Send 409-too many requests to remind the client to back off when its own load exceeds a threshold; when its own load exceeds a higher threshold, reject the request by closing the client connection; Client: try to rebuild the connection switching API Server; regularly rebuild the connection switching API Server to complete the reshuffle when the 409 is received frequently within a period of time At the operation and maintenance level, we upgrade API Server by setting maxSurge=3 to avoid performance jitter caused by the upgrade process.

As shown in the monitoring chart in the lower left corner of the above figure, the enhanced version can achieve the basic load balance of API Server, and can quickly and automatically restore to the balanced state when two nodes are restarted (jitter in the figure).

List-Watch & Cacher

List-Watch is the core mechanism of communication between Server and Client in Kubernetes. All objects in etcd and their updated information are changed and stored in memory through Reflector in API Server. Clients in controller/kubelets also subscribe to changes in data through similar mechanisms.

A core problem in the List-Watch mechanism is how to ensure that the data is not lost during the reconnection when the communication between Client and Server is disconnected, which is realized by a globally increasing version number resourceVersion in Kubernetes. The current synchronized data version is saved in Reflector as shown in the following figure. When reconnecting, Reflector informs Server of its current version (5), and Server calculates the starting location of the data needed by the client based on the recent change history recorded in memory (7).

It all seems very simple and reliable, but.

Within API Server, each type of object is stored in an object called storage, such as:

Pod StorageNode StorageConfigmap Storage...

Each type of storage has a limited queue that stores recent changes to the object to support a certain lag of the watcher (retry, etc.). In general, all types share an incremental version number space (1, 2, 3,..., n), that is, as shown in the above figure, the version number of a pod object is only guaranteed to be incremented but not continuous. When Client uses the List-Watch mechanism to synchronize data, it may only focus on a portion of the pods. The most typical kubelet only focuses on the pods associated with its own node. As shown in the figure above, a kubelet only focuses on the green pods (2,5).

Because the storage queue is FIFO, the old changes are eliminated from the queue when the pods is updated. As shown in the figure above, when the update in the queue has nothing to do with a certain Client, the Client progress is still maintained at rv=5. If the Client is reconnected after 5 is eliminated, API Server cannot determine whether there is a change that the client needs to perceive between 5 and the current queue minimum (7), so return Client too old version err to trigger Client to re-list all the data. To solve this problem, Kubernetes introduces the Watch bookmark mechanism:

To sum up, the core idea of bookmark is to maintain a "heartbeat" between Client and Server. Even if there are no Client-aware updates in the queue, the version number within Reflector needs to be updated in time. As shown in the figure above, Server will push the current latest rv=12 version number to Client where appropriate, so that the Client version number can keep up with the progress of Server. Bookmark can reduce the number of events that need to be resynchronized when API Server is restarted to 3% (performance has been improved by dozens of times). This feature has been developed by the Ali Cloud container platform and has been released to the community Kubernetes-1.15 version.

Cacher & Indexing

In addition to List-Watch, another client access mode is to query API Server directly, as shown in the following figure. In order to ensure that the client reads consistent data among multiple API Server nodes, API Server will support the query request of Client by obtaining the data in etcd. From a performance perspective, this raises several problems:

Unable to support the index, the pod of the query node needs to obtain all the pod in the cluster first, which is a huge overhead. Because of the request-response model of etcd, querying too large data in a single request consumes a lot of memory. Usually, the query between API Server and etcd limits the amount of data requested and completes a large number of data queries by paging. The multiple round trip brought by paging significantly reduces performance. To ensure consistency, API Server query etcd uses Quorum read, which is cluster-level and cannot be extended.

To solve this problem, we have designed a data coordination mechanism between API Server and etcd to ensure that Client can obtain consistent data through the cache of API Server. The principle is shown below, and the overall workflow is as follows:

At T0, Client queries API Server;API Server requests etcd to get the update of the current data version rv@t0;API Server request progress, and waits for the Reflector data version to reach rv@t0; to respond to users' requests through cache.

This approach does not break the consistency model of Client (those who are interested can demonstrate it for themselves). At the same time, we can flexibly enhance our query capabilities when responding to user requests through cache, such as supporting namespace nodename/labels indexes. This enhancement greatly improves the read request processing capacity of API Server. In ten thousand clusters, the typical describe node time is reduced from 5s to 0.3s (triggering the node name index), and the efficiency of other query operations such as get nodes has also increased exponentially.

Controller failover

In a production cluster of 10k node, nearly one million objects are stored in Controller. The cost of getting these objects from API Server and deserialization cannot be ignored. It may take several minutes to restart Controller recovery, which is unacceptable for an enterprise of Alibaba's size. In order to reduce the impact of component upgrades on system availability, we need to minimize the outage time caused by a single controller upgrade to the system. Here, we use the solution shown in the following figure to solve this problem:

Pre-start the slave controller informer and load the data needed by the controller in advance. When the master controller is upgraded, the Leader Lease will be released actively, triggering the slave to take over the work immediately.

Through this solution, we reduce the controller outage time to seconds (upgrade time < 2s). Even in the event of abnormal downtime, we only need to wait for the leader lease to expire (default 15s), and it does not take a few minutes to resynchronize the data. Through this enhancement, controller MTTR is significantly reduced, and the performance impact of controller recovery on API Server is reduced at the same time. This scheme is also applicable to scheduler.

Customized scheduler

Due to historical reasons, Alibaba's scheduler adopted a self-developed architecture, and this sharing did not expand the enhancement of the scheduler due to time constraints. Only two basic ideas are shared here, as shown in the following figure:

Equivalence classes: typical user requests for capacity expansion are multiple containers at a time, so we implement batch processing by dividing the requests in the pending queue into equivalence classes, which significantly reduces the number of Predicates/Priorities. Relaxed randomization: for a single scheduling request, when there are a large number of candidate nodes in the cluster, we do not need to evaluate all the nodes in the cluster. After selecting enough nodes, we can enter the subsequent processing of the scheduling (to improve the scheduling performance at the expense of the accuracy of the solution).

Summary

Through a series of enhancements and optimizations, Alibaba successfully applied Kubernetes to the production environment and achieved a super large scale of 10000 nodes in a single cluster, including:

The storage capacity of etcd is improved by separating index and data, data shard, etc., and finally, by improving the block allocation algorithm of etcd underlying bbolt db storage engine, the performance of etcd in large data storage scenarios is greatly improved. Large-scale Kubernetes clusters are supported by a single etcd cluster, which greatly simplifies the complexity of the whole system architecture. Through landing the Kubernetes lightweight heartbeat, improving the load balance of multiple API Server nodes under the HA cluster, adding bookmark to the ListWatch mechanism, and improving the most troublesome List performance bottleneck in the Kubernetes large-scale cluster by means of index and Cache, it is possible to run the 10,000-node cluster stably, and the service interruption time of the controller/scheduler during the active-standby switching is greatly shortened by the hot standby method, and the availability of the whole cluster is improved. Alibaba developed two most effective ideas for performance optimization of the scheduler: equivalent class processing and random relaxation algorithm.

Through this series of functional enhancements, Alibaba successfully ran his core business on a Kubernetes cluster with tens of thousands of nodes, and passed the test of Tmall in 2019.

Brief introduction of the author:

Zeng Fansong (nickname: chasing spirits) is a senior technical expert on Ali Yunyun's native application platform.

Rich experience in distributed system design and development. In the field of cluster resource scheduling, the self-developed scheduling system has managed hundreds of thousands of nodes and has rich practical experience in cluster resource scheduling, container resource isolation, mixed parts of different workloads and so on. Currently, it is mainly responsible for the large-scale landing of Kubernetes within Ali, applying Kubernetes to the core e-commerce business within Ali, improving the efficiency of application release and the utilization of cluster resources, and stably supporting the promotion of 2018 double Eleven and 2019.

"Alibaba Cloud's native Wechat official account (ID:Alicloudnative) focuses on micro-services, Serverless, containers, Service Mesh and other technology areas, focuses on cloud native popular technology trends, and large-scale cloud native landing practices, and is the technical official account that best understands cloud native developers."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.