Etcd series Part 1-etcd architecture and code framework 07/12 Update SLTechnology News&Howtos

Etcd series Part 1-etcd architecture and code framework

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

1. Introduction

As the core component of Huawei Cloud PaaS, etcd implements the functions of data persistence, cluster election, state synchronization and so on of most components of PaaS. For such an important part, only by deeply understanding its architecture design and internal working mechanism can we better learn Huawei Cloud Kubernetes container technology and laugh at the original "jianghu" of Huawei Cloud. This series will refine from the overall framework to the internal process, with a comprehensive interpretation of the code and design of etcd. This article is the first in a series of "profound and simple etcd", focusing on the architecture and code framework of etcd, and the code used below is based on the etcd v3.2.X version.

In addition, the book "Cloud Native distributed Storage Cornerstone: etcd in-depth Analysis" created by Huawei Cloud Container Service team has been officially published and is available on all major platforms. Buy books to learn more about distributed key-value storage and etcd!

2. Introduction to etcd

Etcd is a distributed key-value storage system. The underlying leader election and data backup through the Raft protocol provides highly available data storage, which can effectively deal with the problem of data loss caused by network problems and machine failures. At the same time, it can also provide service discovery, distributed lock, distributed data queue, distributed notification and coordination, cluster election and other functions. Why is etcd so important? Because etcd is the only storage implementation on the back end of Kubernetes, it is no exaggeration to say that etcd is the "heart" of Kubernetes.

2.1 Raft protocol

To understand the working principle of etcd distributed collaboration, we must mention the Raft algorithm. Raft algorithm is a consensus algorithm designed by Stanford's Diego Ongaro and John Ousterhout with the goal of Understandability. Prior to this, when it comes to consensus algorithm (Consensus Algorithm), Paxos is bound to be mentioned, but the implementation and understanding of Paxos are so complex that in the doctoral thesis of the author of the Raft algorithm, the author mentioned that they spent nearly a year studying the various interpretations of the algorithm, but still did not fully understand the algorithm. There is also a great distance between the algorithm principle and the real implementation of Paxos. The system that implements Paxos, such as Chubby, has made a lot of improvements and optimizations to Paxos, but the details are unknown. Raft protocol adopts the idea of divide-and-conquer and divides the problem of distributed cooperation into three problems:

Election: when a new cluster starts, or when an old leader fails, a new leader is elected

Log synchronization: leader must accept log entries from clients and synchronize them to all machines in the cluster

Security: ensure that as long as any node has one log entry in its state machine, it will not have another log entry on the same key.

A Raft cluster typically consists of several nodes, typically five, which can withstand the failure of two of the nodes. Each node is actually maintaining a state machine, and the node is in one of the following three states at all times.

Leader: responsible for synchronous management of logs, processing requests from clients, and keeping in touch with Follower in this heartBeat

Follower: all nodes are in Follower state at startup, respond to the log synchronization request of Leader, respond to the request of Candidate, and forward the transaction to Follower to Leader.

Candidate: responsible for voting in the election. When Raft is started, a node changes from Follower to Candidate to initiate the election, and after electing Leader, it changes from Candidate to Leader.

After the node starts, the first state is follower. In follower state, there will be a timer for election timeout (this time is obtained by adding a random time based on the configured timeout). If the heartbeat packet sent by leader is not received within this time, the node status will become candidate state, that is, the candidate will broadcast the election request in a loop. If more than half of the nodes agree to the election request, the node will be transformed into leader state. If during the election process, it is found that there is already leader or election information with a higher term value, it will automatically become a follower state. A node in the leader state automatically becomes a follower state if it finds that a leader with a higher expiration value exists.

Raft divides the time into Term (as shown in the following figure), which is an increasing integer, and a term is the period between the beginning of the election of leader and the expiration of leader. It's a bit like a presidential term, except that its time is uncertain, which means that as long as leader works well, it may become a dictator and never step down.

2.2 overall code architecture for etcd

It can be roughly divided into the following four modules:

Http: responsible for providing http access interface and http client

Raft state machine: carries on the state transition according to the accepted raft message, invokes the action in each state.

Wal log storage: persistent storage of log entries.

Kv data storage: kv data storage engine, v3 supports different back-end storage, currently using boltdb. Transaction operations are supported through boltdb.

The main changes relative to v2 and v3 are as follows:

Using grpc to communicate between peer and client

V2's store is a tree in memory, and v3 uses an abstract kvstore to support different back-end storage databases. Enhance the transaction ability.

Excluding the unit test code, the number of lines of code in etcd v2 is about 40k and the number of lines in v3 is about 70k.

2.3 typical internal processing flow

We number the various parts of the above architecture diagram so that we can find the location of the components processed by each process in the process description below.

2.3.1 message entry

After an etcd node is running, there are three channels to receive external messages. Taking the request processing of adding, deleting, modifying and querying kv data as an example, this paper introduces the working mechanism of these three channels. 1. The http call to client: it is handled by the ServeHTTP method registered with the keysHandler of the http module. The parsed message is processed by calling the Do () method of EtcdServer. (figure 2) 2. Grpc call to client: the quotaKVServer object is registered with grpc server at startup, and quotaKVServer enhances the kvServer data structure in a combined way. After parsing the grpc message, the Range, Put, DeleteRange, Txn, Compact, and other methods of kvServer are called. KvServer contains an interface for RaftKV, which is implemented by the EtcdServer structure. So the last thing is to call EtcdServer's Range, Put, DeleteRange, Txn, Compact, and so on. (figure 1) 3. Grpc messages between nodes: each EtcdServer contains a Transport structure, and there is a peers map in the Transport, and each peer encapsulates the way a node communicates with another node. Including streamReader, streamWriter, etc., used for sending and receiving messages. There are recvc and propc queues in streamReader, and streamReader will push messages to these queues after processing the received messages. It is handled by peer, and peer calls the Process method of raftNode to process the message. (figure 3, 4)

2.3.2 EtcdServer message processing

For client messages, when you call to EtcdServer processing, you usually register a waiting queue, call the Propose method of node, and then use the waiting queue blocking to wait for the message processing to be completed. The Propose method pushes a MsgProp message to the propc queue. For messages between nodes, the Process of raftNode directly calls the step method of node to push the message to the recvc or propc queue of node. As you can see, all messages from the outside world are now in the recvc queue or propc queue in the node structure. (figure 5)

2.3.3 node processing messages

When node starts, it starts a cooperative process that processes messages in various queues of node, including recvc and propc queues, of course. When you get a message from the propc and recvc queues, the Step method of the raft object is called. The raft object encapsulates the protocol data and operations of raft, and its external Step method is the step method of the real raft protocol state machine. When the message is received, the status change is made according to the protocol type and Term field, or the election request is processed accordingly. For general kv add, delete, change and query data request messages, the internal step method is called. The internal step method is a dynamically changeable method that will change with the state of the state machine. When the state machine is in the leader state, the method is stepLeader;. When the state machine is in the follower state, the method is stepFollower;. When the state machine is in the Candidate state, the method is stepCandidate. The leader state processes the MsgProp message directly. Save the log entries in the message to the local cache. Follower forwards the MsgProp message directly to leader, and the forwarding process is to push the message to raft's msgs array first. After the node processes the message, either the log entry in the cache or the message to be sent is generated. Log entries in the cache need further processing (such as synchronization and persistence), while messages need further processing to be sent. The processing process is still in the node process. NewReady is called at the beginning of the loop to encapsulate the log that needs to be further processed and the message that needs to be sent, as well as the state change information, in a Ready message. Ready messages are pushed to the readyc queue. (figure 5)

2.3.4 treatment of raftNode

The start () method of raftNode starts a separate cooperative process to process the readyc queue (figure 6). Take out the message that needs to be sent, call the Send method of transport and send it out (figure 4). Call the Save method of storage to persistently store log entries or snapshots (figure 9, 10) to update the kv cache. In addition, you need to apply the synchronized logs to the state machine, let the state machine update the status and kv storage, and notify the client waiting for the request to complete. Therefore, it is necessary to encapsulate logs, snapshots and other information that have been determined to be synchronized and push them to the applyc queue in an apply message.

2.3.5 apply processing of EtcdServer

EtcdServer processes the applyc queue and apply both snapshot and entries into the kv store (figure 8). Finally, the Trigger of applyWait is called to wake up the waiting thread of the client request and return the client request.

3. Important data structures

3.1 EtcdServer

Type EtcdServer struct {

/ / the number of snapshot currently being sent inflightSnapshots int64 / / the log indexappliedIndex uint64 that has been apply / / the log index that has been submitted, that is, leader confirms that the log indexcommittedIndex uint64 / / which has been synchronized by most members has been persisted to the indexconsistIndex consistentIndex / / configuration item Cfg * ServerConfig// of kvstore successfully started and registered with cluster, and closed the channel. Readych chan struct {} / / important data results that store the state machine information of raft. R raftNode// how many logs need to be performed snapshotsnapCount uint64// to allow the caller to block and wait for the result of the call in the case of synchronous calls. The following three results of w wait.Wait// are all designed to implement the readMu sync.RWMutexreadwaitc chan struct {} readNotifier * notifier// stop channel stop chan struct {} / / turn off the loop exit in the start function of this channel stopping chan struct {} / / etcd when it stops, which will close the channel done chan struct {} / / error channel to pass in an unrecoverable error and close the raft state machine. Errorc chan error//etcd instance idid types.ID//etcd instance properties attributes membership.Attributes// cluster information cluster * membership.RaftCluster//v2 kv storage store store.Store// for snapshotsnapshotter * snap.Snapshotter//v2 applier, for commited index apply to raft state machine applyV2 ApplierV2//v3 applier, for commited index apply to raft state machine applyV3 applierV3// to wait queue for applyV3applyV3Base applierV3//apply that strips authentication and quota functions Wait for the log apply of an index to complete the kv storage kv mvcc.ConsistentWatchableKV//v3 used by applyWait wait.WaitTime//v3 to achieve the expiration time of lessor lease.Lessor// to guard the lock stored at the backend. Change the backend storage and obtain the backend storage is to use bemu sync.Mutex// backend storage be backend.Backend// storage authentication data authStore auth.AuthStore// storage alarm data alarmStore * alarm.AlarmStore// current node status stats * stats.ServerStats//leader status lstats * stats.LeaderStats//v2 Implement the periodic task of SyncTicker * time.Ticker// compression data with expired ttl data compactor * compactor.Periodic// is used to send a remote request peerRt http.RoundTripper// to generate a request idreqIDGen * idutil.Generator// forceVersionC is used to force the version monitor loop// to detect the cluster version immediately.forceVersionC chan struct {} / / wgMu blocks concurrent waitgroup mutation while server stoppingwgMu sync.RWMutex// wg is used to wait for the go routines that depends on the server state// to exit when stopping the server .wg sync.WaitGroup// ctx is used for etcd-initiated requests that may need to be canceled// on etcd server shutdown.ctx context.Contextcancel context.CancelFuncleadTimeMu sync.RWMutexleadElectedTime time.Time

}

3.2 raftNode

RaftNode is the Raft node that maintains the steps and state transitions of the Raft state machine.

Type raftNode struct {

/ / the current state of the Cache of the latest raft index and raft term the server has seen.// These three unit64 fields must be the first elements to keep 64-bit// alignment for atomic access to the fields.// state machine. Index represents the log that has been apply to the state machine. Index,term is the term of the latest log entry. Lead is the current leader idindex uint64term uint64lead uint64//, including node, storage and other important data structures raftNodeConfig// a chan to send/receive snapshotmsgSnapC chan raftpb.Message// a chan to send out applyapplyc chan apply// a chan to send out readStatereadStateC chan raft.ReadState// utilityticker * time.Ticker// contention detectors for raft heartbeat messagetd * contention.TimeoutDetectorstopped chan struct {} done chan struct {}

}

3.3 node

Type node struct {

/ / Propose queue. Call the Propose of raftNode to plug Propose messages into the propc chan pb.Message//Message queue, and messages other than Propose messages into this queue. When the cluster node changes, the recvc chan pb.Message// cluster configuration information queue The modified information needs to be stuffed into this queue outside the confc chan pb.ConfChange// to obtain the modified cluster configuration information through this queue. Confstatec chan pb.ConfState// has prepared the information queue for apply. Readyc chan Ready// stuffed an empty object into the queue every time the apply is ready. Notification can continue to prepare Ready messages. Advancec chan struct {} / / tick information queue, used to call heartbeat tickc chan struct {} done chan struct {} stop chan struct {} status chan chan Statuslogger Logger

}

4. Summary

This article briefly introduces the raft protocol and the framework of etcd, and introduces the processing of internal and message flows in etcd. The internal mechanism of etcd will be described in detail in different topics such as heartbeat and election, data synchronization, data persistence and so on.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.