Get started with K8s from scratch | I will take you to understand etcd by hand. 07/19 Update SLTechnology News&Howtos

Get started with K8s from scratch | I will take you to understand etcd by hand.

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Author | Zeng Fansong (chasing spirits) Senior Technical expert of Ali Cloud Container platform

This article is compiled from lesson 16 of the "CNCF x Alibaba Cloud Native Technology Open course".

Introduction: etcd is a distributed, consistent KV storage system for shared configuration and service discovery. Starting from several important moments in the development of the etcd project, this paper introduces the overall architecture of etcd and the basic principles in its design. Hope to help you better understand and use etcd.

I. the development process of the etcd project

Etcd was born in CoreOS. It was originally used to solve the problems of distributed concurrency control of OS upgrade and storage and distribution of configuration files in cluster management system. Based on this, etcd is designed to provide highly available and highly consistent small keyvalue data storage services.

The project currently belongs to the CNCF Foundation and is widely used by large Internet companies such as AWS, Google, Microsoft, Alibaba and so on.

Initially, the first version of the initial code was submitted to GitHub by CoreOS in June 2013.

In June 2014, something happened to the community when Kubernetes v0.4 was released. It is necessary to introduce the Kubernetes project, which is first of all a container management platform developed by Google and contributed to the community, because it combines Google's years of experience in container scheduling and cluster management and has attracted a lot of attention since its inception. In Kubernetes v0.4, it uses etcd 0.2 as the core metadata storage service of the experiment, and the etcd community has developed rapidly since then.

Soon, in February 2015, etcd released its first official stable version 2.0. In version 2.0, etcd redesigned the Raft consistency algorithm and provided users with a simple tree view of data. In version 2.0, etcd supported more than 1000 writes per second, meeting the needs of most application scenarios at that time. After the release of version 2.0, after continuous iteration and improvement, its original data storage scheme has gradually become a performance bottleneck in the new era, and then etcd launched the v3 version of the scheme design.

In January 2017, etcd released version 3.1, and the v3 version scheme basically marked the full maturity of etcd technology. In the v3 version, etcd provides a new set of API, reimplements a more efficient consistent reading method, and provides a proxy of gRPC to extend the read performance of etcd. At the same time, a large number of GC optimizations are included in the v3 version, and great progress has been made in performance optimization, in which etcd can support more than 10000 writes per second.

In 2018, many projects under the CNCF Foundation used etcd as its core data store. According to incomplete statistics, there are more than 30 projects using etcd. In November of the same year, the etcd project itself became an incubation project under CNCF. Since joining the CNCF Foundation, etcd has more than 400 contribution groups, including 9 project maintainers from 8 companies, including AWS, Google, Alibaba and so on.

In 2019, etcd is about to release a new version 3.4, which is jointly created by Google, Alibaba and other companies, which will further improve the performance and stability of etcd to meet the stringent scenario requirements in very large companies.

Second, the framework and internal mechanism to analyze the overall structure

Etcd is a distributed and reliable key-value storage system, which is used to store critical data in distributed systems. This definition is very important.

An etcd cluster is usually composed of 3 or 5 nodes. Multiple nodes cooperate with each other through the completion of Raft consistency algorithm. The algorithm selects a master node as leader, and leader is responsible for data synchronization and data distribution. When the leader fails, the system will automatically select another node to become leader, and re-complete the data synchronization. The client only needs to select any one of the multiple nodes to read and write the data, and the internal state and data cooperation are completed by the etcd itself.

In the whole etcd architecture, there is a very key concept called quorum,quorum, which is defined as (nasty 1) / 2, that is, a group of more than half of the nodes in the cluster. In a three-node cluster, etcd can allow one node to fail, that is, as long as any two nodes are available, etcd can continue to provide services. Similarly, in a five-node cluster, etcd can continue to provide services as long as any three nodes are available, which is the key to the high availability of etcd clusters.

To continue to provide services after allowing some node failures, you need to solve a very complex problem: distributed consistency. In etcd, the distributed consistency algorithm is completed by the Raft consistency algorithm. The algorithm itself is relatively complex and has the opportunity to expand in detail. Here is only a simple introduction to facilitate people to have a basic understanding of it. One of the key points that the Raft consistency algorithm can work is that there must be an intersection (common member) between any two members of quorum, that is, as long as any quorum survives, there must be a node (common member), which contains all the confirmed committed data in the cluster. Based on this principle, the Raft consistency algorithm designs a set of data synchronization mechanism, which can resynchronize all the data submitted by the last quorum after the Leader term switch, so as to ensure the consistency of the data in the process of advancing the state of the whole cluster.

The internal mechanism of etcd is complex, but the interface provided by etcd to customers is simple and straightforward. As shown in the figure above, we can access the data of the cluster through the client provided by etcd, or we can access etcd directly through http (similar to the curl command). Within etcd, its data representation is also relatively simple. We can directly understand the data storage of etcd as an ordered map, which stores key-value data. At the same time, in order to make it convenient for the client to subscribe to the change of data, etcd also supports a watch mechanism to get the incremental update of data in etcd in real time through watch, so as to realize business logic such as data synchronization with etcd.

API introduction

Next, let's take a look at the interfaces provided by etcd. Here, the interfaces of etcd are divided into five groups:

The first group is Put and Delete. As you can see in the figure above, the operation of put and delete is very simple. You only need to provide a key and a value to write data to the cluster, and you only need to specify key when deleting the data. The second group is the query operation. Etcd supports two types of queries: the first is a query that specifies a single key, the second is a specified scope of a key, and the third group is a data subscription. Etcd provides Watch mechanism. We can use watch to subscribe to incremental data updates in etcd in real time. Watch supports specifying a single key or a key prefix. In practical application scenarios, we usually use the second situation and the fourth set of transaction operations. Etcd provides a simple transaction support, users can specify a set of conditions to perform certain actions, when the conditions are not true to perform another set of operations, similar to the if else statement in the code, etcd ensures the atomicity of the entire operation; the fifth group is the Leases interface. Leases interface is a commonly used design pattern in distributed systems, and its usage will be expanded later. Data version mechanism

To use etcd's API correctly, you must know the fundamentals of the internal corresponding data version number.

First of all, there is a concept of term in etcd, which represents the term of office of the entire cluster Leader. When a Leader switch occurs in the cluster, the value of term will be + 1. In the event of a node failure, or a problem with the Leader node network, or pulling up again after stopping the entire cluster, Leader switching will occur.

The second version number, called revision,revision, represents the version of the global data. When the data changes, including creation, modification, and deletion, the corresponding revision will be + 1. In particular, revision keeps the global monotonous increase between the terms of office of the Leader in the cluster. It is this feature of revision that makes any modification in the cluster correspond to a unique revision, so we can support data MVCC or data Watch through revision.

For each KeyValue data node, three versions are recorded in etcd:

The first version is called create_revision, the second revision; corresponding to KeyValue when it is created is called mod_revision, and the third version corresponding to revision; when its data is manipulated is a counter, which represents how many times the KeyValue has been modified.

Here we can show you in the form of a diagram:

In the same Leader term, we find that the corresponding term value of all the modification operations is always equal to 2, while the revision remains monotonously increasing. When we restart the cluster, we will find that the term value corresponding to all the modification operations has become 3. In the new Leader term, all term values are equal to 3 and will not change, while the corresponding revision values also remain monotonously increasing. From a larger dimension, we can find that between the two Leader terms of term=2 and term=3, the corresponding revision value of the data still keeps a monotonous increase globally.

Mvcc & streaming watch

After you understand the version number control of etcd, next you can use etcd multi-version numbers to implement concurrency control and data subscription (Watch).

Multiple data modifications to the same Key are supported in etcd, with each data modification corresponding to a version number. Etcd records the corresponding data for each change in the implementation, which means that there are multiple historical versions of a key in etcd. If you do not specify a version number when querying data, etcd will return the latest version corresponding to Key. Of course, etcd also supports specifying a version number to query historical data.

Because etcd records every change, when using watch to subscribe to data, you can create a watcher from any historical moment (specify revision), establish a data pipeline between the client and etcd, and etcd will push all data changes starting from the specified revision. The watch mechanism provided by etcd ensures that the data of the Key is subsequently modified and immediately pushed to the client through this data pipeline.

As shown in the following figure, all the data in etcd is stored in a b+tree (gray), which is stored on disk and mapped to memory by mmap to support fast access. The gray b+tree maintains the mapping relationship from revision to value, and supports querying the corresponding data through revision. Because revision is monotonously increasing, when we subscribe to the data after the specified revision through watch, we only need to subscribe to the data changes of the b + tree.

Another btree (blue) is maintained within etcd, which manages the mapping of key to revision. When the client uses key to query data, it first needs to convert the key into the corresponding revision through the blue btree, and then query the corresponding data through the gray btree.

Careful readers will find that recording each change by etcd will lead to continuous data growth, which will not only lead to memory and disk space consumption, but also affect the query efficiency of b+tree. To solve this problem, a periodic Compaction mechanism is run in etcd to clean up historical data, cleaning up multiple historical versions of the same Key some time ago. The end result is that the gray b+tree remains monotonously increasing, but there may be some holes.

Mini-transactions

After understanding the mvcc mechanism and the watch mechanism, move on to the mini-transactions mechanism provided by etcd. The transaction mechanism of etcd is relatively simple, which can basically be understood as an if-else program, which can provide multiple operations in if, as shown in the following figure:

There are two conditions in If: when Value (key1) is greater than "bar" and the version of Version (key1) is equal to 2, perform the operation specified in Then: change the data of Key2 to valueX and delete the data of Key3. If the condition is not met, another action is performed: Key2 is changed to valueY.

Atomicity of the entire transaction operation is guaranteed within the etcd. That is to say, If operates on all the comparison conditions, and the view it sees must be consistent. At the same time, it ensures that the atomicity of multiple operations will not occur when only half of the operations in Then are performed.

Through the transaction operations provided by etcd, we can ensure the consistency of data reading and writing in multiple competitions, such as the Kubernetes project mentioned earlier, which uses the transaction mechanism of etcd to achieve the consistency of multiple KubernetesAPI server to the same data modification.

The concept and usage of lease

Lease is a common concept in distributed systems to represent a distributed lease. Typically, in a distributed system, a lease mechanism is required when it is necessary to detect whether a node is alive.

The code example in the above figure example first creates a 10s lease, and if nothing is done after the lease is created, the lease will expire automatically after 10s. Then bind the key value of key1 and key2 to the lease, so that etcd will automatically clean up key1 and key2 when the lease expires, so that the nodes key1 and key2 have the ability to delete timeouts automatically.

If you want the lease to never expire, you need to periodically call the KeeyAlive method to refresh the lease. For example, if you need to check whether a process is alive in a distributed system, you can create a lease in the process and call the KeepAlive method periodically in the process. If all goes well, the rental appointment of the node remains the same, and if the process dies, the lease will automatically expire.

In etcd, allowing multiple key to be associated on the same lease is a clever design that can significantly reduce the overhead of refreshing lease objects. Just imagine, if there are a large number of key need to support similar lease mechanism, each key needs to renew the lease independently, which will put a lot of pressure on etcd. Through the mode in which multiple key are bound to the same lease, we can aggregate the timeout-like key together, thus greatly reducing the cost of lease refresh and greatly increasing the usage scale of etcd support without losing flexibility.

Typical usage scenarios introduce metadata storage

Kubernetes stores its own state in etcd, and the high availability of its state data is handed over to etcd to solve. Kubernetes system itself no longer needs to deal with complex distributed system state processing, and its own system architecture has been greatly simplified.

Server Discovery (Naming Service)

The second scenario is Service Discovery, also known as the name service. In distributed systems, a common pattern is that multiple backends (perhaps hundreds of processes) are needed to provide a set of peer-to-peer services, such as retrieval services and recommendation services.

For such a back-end service, usually in order to simplify the operation and maintenance cost of the back-end service (the node is replaced at any time when the node fails), the back-end process will be scheduled by a cluster management system like Kubernetes, so when the user (or upstream service) calls, we need a service discovery mechanism to solve the service routing problem. This service discovery problem can be solved efficiently using etcd as follows:

After the internal startup of the process, you can register your own address with the etcd;API gateway to be aware of the address of the back-end process through etcd. When the back-end process fails to migrate, it will re-register to the etcd, and the API gateway can also be aware of the new address in time. Using the Lease mechanism provided by etcd, if an exception occurs during the running of the process providing services (crash), the API gateway can also remove its traffic and avoid call timeout.

In this architecture, the service status data is taken over by etcd, and the API gateway itself is stateless and can be extended horizontally to serve more customers. At the same time, thanks to the good performance of etcd, it can support tens of thousands of back-end process nodes, making this architecture can serve large enterprises.

Distributed Coordination: leader election

In distributed systems, a typical design pattern is Master+Slave. Usually, Slave provides various resources such as CPU, memory, disk and network, while Master is used to reconcile these nodes to provide a service (such as distributed storage, distributed computing). Typical distributed storage services (HDFS) and distributed computing services (Hadoop) adopt similar design patterns. Such a design pattern has a typical problem: the availability of Master nodes. When the Master fails, the service of the whole cluster is down, and there is no way to serve the user's request.

To solve this problem, it is typical to start multiple Master nodes. Because the Master node will contain control logic, the state synchronization between multiple nodes is very complex. The most typical way here is to select one node as the master node to provide services, and the other node is in the waiting state.

Through the mechanism provided by etcd, the master selection function of distributed processes can be easily realized, for example, the logic of preemptive ownership can be realized by writing transactions of the same key. Generally speaking, the Leader of the selected master will register its own IP with the etcd, so that the Slave node can get the current Leader address in time, thus making the system continue to work in the way of the previous single Master node. When an exception occurs in the Leader node, a new node can be selected as the master node through etcd, and after registering the new IP, Slave can pull the IP of the new master node and continue to restore service.

Concurrency Control of Distributed Coordination distributed system

In a distributed system, when we perform some tasks, such as upgrading OS or software on OS, or performing some computing tasks, we usually need to control the concurrency of the tasks for the sake of the bottleneck of the back-end service or the stability of the business. If the task lacks a reconciled Master node, such distributed system work can be done through etcd.

In this mode, a distributed semaphore is realized by etcd, and the fault nodes can be eliminated automatically by using etcd leases mechanism. In the process of process execution, if the running cycle of the process is relatively long, we can store some state data of the process to etcd, so that when the process fails and needs to be restored to other places, we can restore some execution state from the etcd without having to complete the whole calculation logic, so as to accelerate the execution efficiency of the whole task.

This paper summarizes

This is the end of the main content shared in this article, here is a brief summary for you:

The first part introduces how the etcd project was born and several important moments in the development of etcd; the second part introduces the architecture of etcd and its internal basic operation interface, and shows some basic operations of etcd data and its internal implementation principles on the basis of understanding how etcd achieves high availability. The third part introduces three typical etcd usage scenarios and the design ideas of distributed systems under the corresponding scenarios.

"Alibaba Cloud's native Wechat official account (ID:Alicloudnative) focuses on micro-services, Serverless, containers, Service Mesh and other technology areas, focuses on cloud native popular technology trends, and large-scale cloud native landing practices, and is the technical official account that best understands cloud native developers."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.