Getting started with K8s from scratch | etcd performance optimization practice 07/02 Update SLTechnology News&Howtos

Getting started with K8s from scratch | etcd performance optimization practice

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Author | Chen Xingyu (Yumu) Middle Taiwan Technical expert on basic Technology of Aliyun

This article is compiled from lesson 17 of the "CNCF x Alibaba Cloud Native Technology Open course".

Guide: etcd is a component used by the container cloud platform to store key meta-information. Alibaba has been using etcd for 3 years, and it has once again played a key role in the process of double 11 this year, and it has been tested by the pressure of double 11. Starting from the background of etcd performance, the author leads us to understand the best practices of etcd server performance optimization and etcd client usage, hoping to help you run a stable and efficient etcd cluster.

A brief introduction to etcd

Etcd was born in CoreOs and developed in Golang language. It is a distributed KeyValue storage engine. Etcd is also widely used by major companies.

The following figure shows the basic architecture of etcd

As shown above, a cluster has three nodes: one Leader and two Follower. Each node synchronizes data through the Raft algorithm and stores data through boltdb. Client can complete the request by connecting to any node.

2. Understand etcd performance

First of all, let's look at a picture:

The figure above is a standard etcd cluster architecture diagram. The etcd cluster can be divided into several core parts: for example, the blue Raft layer and the red Storage layer. The Storage layer is divided into the treeIndex layer and the boltdb underlying persistent storage key/value layer. Each of these layers has the potential to cause a loss of etcd performance.

First of all, let's take a look at the Raft layer. Raft needs to synchronize data through the network, and the RTT and / bandwidth between network IO nodes will affect the performance of etcd. In addition, WAL is also affected by disk IO write speed.

Looking at the Storage layer, disk IO fdatasync latency affects etcd performance, and the block of index layer locks also affects etcd performance. In addition, the lock of boltdb Tx and the performance of boltdb itself will greatly affect the performance of etcd.

In other ways, the kernel parameters of the host where etcd is located and the latency of the grpc api layer will also affect the performance of etcd.

Third, etcd performance optimization-server side

Let's take a look at the performance optimization on the etcd server side.

Etcd server performance Optimization-hardware deployment

The server side needs enough CPU and Memory on the hardware to ensure the operation of etcd. Secondly, as a database program which is very dependent on disk IO, etcd needs ssd hard disk with very good IO latency and throughput. Etcd is a distributed key/value storage system, and network conditions are also very important to it. Finally, in the deployment, it needs to be deployed independently as far as possible to prevent other host programs from interfering with the performance of etcd.

Attached: the configuration requirements officially recommended by etcd.

Etcd server performance Optimization-Software

Etcd software is divided into many layers, the following is a brief introduction to performance optimization according to different levels. Students who want to know more can visit the GitHub pr below to get the specific modification code.

The first is memory index layer optimization for etcd: optimizing the use of internal locks to reduce wait time. The original implementation is that the granularity of the internal lock used by BTree is relatively coarse, which greatly affects the performance of etcd. The new optimization reduces the impact of this part and reduces the delay.

For more information, please refer to the following link:

For the optimization of lease scale use: the algorithm of lease revoke and expiration failure is optimized, the time complexity of traversing invalid list is reduced from O (n) to O (logn), and the problem of large-scale use of lease is solved.

For more information, please refer to the following link:

Finally, it is optimized for the use of the back-end boltdb: adjust the back-end batch size limit/interval so that it can be dynamically configured according to different hardware and workloads, and these parameters used to be fixed conservative values.

There is also a fully concurrent read feature optimized by Google engineers: optimize the use of boltdb tx read-write locks to improve read performance. A New algorithm for freelist allocation and recovery of etcd Internal Storage based on segregated hashmap

There are also a lot of other performance optimizations. Here we focus on a performance optimization contributed by Alibaba. This performance optimization greatly improves the performance of etcd internal storage. It is called segregated hashmap-based freelist allocation and recovery algorithm for etcd internal storage.

The figure above shows a single-node architecture of etcd. Internally, boltdb is used to persist all key/value, so the performance of boltdb plays a very important role in the performance of etcd. Inside Alibaba, we use etcd as internal storage metadata a lot. In the process of using it, we found some performance problems of boltdb, which we share with you.

The above figure allocates a core algorithm for the collection of etcd internal storage. Let's first introduce the background knowledge. First, etce internally uses a page size that defaults to 4KB to store data. As shown in the figure, the number indicates the page ID, the red one indicates that the page is in use, and the white one indicates that the page is not in use.

When a user wants to delete data, etcd does not immediately return the storage space to the system, but keeps it internally to maintain a page pool to improve performance for next use. This page pool is called freelist, and as shown in the figure, the freelist page ID of 43, 45, 46, 50, 53 is in use, and the page ID of 42, 44, 47, 48, 49, 51, 52 is idle.

When the new data storage requires a configuration with a continuous page size of 3, the old algorithm needs to start scanning from the freelist header, and finally return the page start ID of 47, so you can see the ordinary etcd linear scan internal freelist algorithm, in the case of large amount of data or serious internal fragments, the performance will decline rapidly.

To solve this problem, we design and implement a new freelist allocation and recovery algorithm based on segregated hashmap. The algorithm takes the continuous page size as the key,value of the hashmap is the configuration set of the starting ID. When new page storage is needed, we only need O (1) time complexity to query the hashmap value and quickly get the starting ID of the page.

Looking at the example above, when you need a continuous page with a size of 3, you can quickly find the starting page with an ID of 47 by querying the hashmap.

Also when releasing the page, we also used hashmap to optimize it. For example, when the page ID is 45 or 46 is released, it can merge forward and backward to form a large continuous page, that is, to form a continuous page with a starting page ID of 44 and a size of 6.

To sum up: the new algorithm optimizes the time complexity of allocation from O (n) to O (1), and reclaims from O (nlogn) to O (1). Etcd internal storage no longer limits its read and write performance. In real scenarios, its performance is optimized by dozens of times. The recommended storage 2GB for a single cluster can be expanded to 100GB. The optimization is currently used internally by Alibaba and exported to the open source community.

One more point here, the optimization of several software mentioned this time will be released in the new version of etcd, so you can pay attention to the use of it.

Fourth, etcd performance optimization-client side

Let's take a look at the best practices for the performance use of the etce client.

First, let's review several API:Put, Get, Watch, Transactions, Leases, and many other operations that etcd server provides to the client.

For the above client operations, we summarize several best practice calls:

For Put operations, avoid using large value, simplify and then simplify, such as the use of crd under K8s; secondly, etcd itself is suitable for and stores some infrequent key/value metadata information. Therefore, the client needs to avoid creating frequently changing key/value in use. This point, for example, the heartbeat data upload for new node nodes under K8s follows this practice; finally, we need to avoid creating a large number of lease and choose reuse as much as possible. For example, under K8s, event data management: event with the same TTL failure time will also choose a similar lease for reuse instead of creating a new lease.

Finally, please remember one thing: maintaining best practices for client use will ensure that your etcd cluster runs stably and efficiently.

This section summarizes

This section ends here, and here is a summary for you:

First of all, we understand the etcd performance background, understand the potential performance bottlenecks from the principle behind; analyze the etcd server side performance optimization, optimize from hardware / deployment / internal core software algorithms, etc.; understand the best practices of using etcd client

Finally, I hope that after you finish reading this article, you can get something to help you run a stable and efficient etcd cluster.

"Alibaba Cloud Native focus on micro-services, Serverless, containers, Service Mesh and other technology areas, focus on cloud native popular technology trends, cloud native large-scale landing practice, to be the best understanding of cloud native developers of the technology circle."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.