Detailed explanation of Kubernetes Etcd Cluster 07/16 Update SLTechnology News&Howtos

Detailed explanation of Kubernetes Etcd Cluster

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Preface

With the increasing popularity of the kubernetes project, the etcd components used in the project are gradually concerned by developers as a highly available and highly consistent service discovery repository.

In the era of cloud computing, how to make services quickly and transparently access to the computing cluster, and how to make the shared configuration information quickly found by all nodes in the cluster? how to build a set of highly available, secure, easy to deploy and fast response service cluster has become a problem to be solved.

Etcd brings convenience to solve this kind of problem.

Official address: https://coreos.com/etcd/

Project address: https://github.com/coreos/etcd

What is Etcd?

Etcd is a highly available key storage system, which is mainly used for shared configuration and service discovery. It handles log replication through Raft consistency algorithm to ensure strong consistency. We can understand that it is a highly available and strong consistency service discovery repository.

In kubernetes clusters, etcd is mainly used to configure sharing and service discovery

Etcd mainly solves the problem of data consistency in the distributed system, while the data in the distributed system is divided into control data and application data. The data type processed by etcd is control data, which can also be processed for a very small amount of application data.

Comparison between Etcd and Zookeeper

Zookeeper has the following disadvantages

1. Complicated. The deployment and maintenance of ZooKeeper is complex, and administrators need to master a series of knowledge and skills; and Paxos strong consistency algorithm has always been famous for its complexity and difficulty (ETCD uses the [Raft] protocol, ZK uses ZAB, PAXOS-like protocol); in addition, the use of ZooKeeper is also complicated, so you need to install the client. Officially, you only provide interfaces in Java and C languages.

Written by 2.Java. This is not a bias against Java, but Java itself tends to be heavy-duty applications, which introduces a lot of dependencies. On the other hand, the operators generally want to keep the machine cluster with strong consistency and high availability as simple as possible, and it is not easy to make mistakes in maintenance.

3. The development is slow. The unique "Apache Way" of the Apache Foundation project has been controversial in the open source community, one of the main reasons is that the development of the project is slow due to the huge structure and loose management of the foundation.

By contrast, Etcd

1. simple. It is easy to write and deploy in GE language, easy to use HTTP as interface, and use Raft algorithm to ensure strong consistency and make it easy for users to understand.

two。 Data persistence. Etcd default data is persisted as soon as it is updated.

3. It's clear. Etcd supports SSL client security authentication.

Architecture and terminology of Etcd

Process analysis

Usually, when a user's request is sent, it will be forwarded to Store through HTTP Server for specific transaction processing. If node modification is involved, it needs to be handed over to Raft module for status change and log recording.

Then synchronize to other etcd nodes to confirm the data submission, and finally submit the data and synchronize again.

working principle

Etcd uses the Raft protocol to maintain the consistency of the state of each node in the cluster. To put it simply, ETCD cluster is a distributed system, which consists of multiple nodes communicating with each other to form an overall external service. Each node stores complete data, and ensures that the data maintained by each node is consistent through the Raft protocol.

Etcd is mainly divided into four parts.

HTTP Server: used to handle API requests sent by users and requests for synchronization and heartbeat information from other etcd nodes

Store: used to handle transactions of various functions supported by etcd, including data indexing, node state change, monitoring and feedback, event handling and execution, etc., is the concrete implementation of most of the API functions provided by etcd to users.

Raft: the concrete implementation of Raft strong consistency algorithm is the core of etcd.

WAL:Write Ahead Log (pre-written log / log first) is not only the data storage mode of etcd, but also a standard method to implement transaction log. Etcd is persisted through WAL, and all data is logged before it is submitted. A Snapshot is a status snapshot taken to prevent too much data; Entry represents the specific log content stored.

Service discovery

Service discovery is also one of the most common problems in distributed systems, that is, how can processes or services in the same distributed cluster find each other and establish connections. In essence, service discovery wants to know if there are processes in the cluster listening on udp or tcp ports, and can find and connect by name. To solve the problem of service discovery, you need to have the following three points:

1. A highly consistent, highly available service storage directory. Etcd based on Raft algorithm is such a highly consistent and highly available service storage directory.

two。 A mechanism for registering services and monitoring the health status of services. Users can register the service in etcd, and set key TTL to the registered service, and regularly maintain the heartbeat of the service to achieve the effect of monitoring health status.

3. A mechanism for finding and connecting services. Services registered under the theme specified by etcd can also be found under the corresponding topic. To ensure connectivity, we can deploy a Proxy mode etcd on each service machine, which ensures that services that can access the etcd cluster can connect to each other.

For example, with the popularity of Docker containers, there are more and more cases in which a variety of micro services work together to form a relatively powerful architecture. There is also a growing need to add these services transparently and dynamically. Through the service discovery mechanism, register the directory of a service name in etcd, and store the IP of available service nodes in that directory. In the process of using the service, just look for the available service nodes from the service directory to use it.

Terminology in Etcd clusters

Raft: an algorithm adopted by etcd to ensure the strong consistency of distributed systems

Node: an instance of a Raft state machine

Member: an etcd instance that manages a Node and provides services for client requests

Cluster: an etcd cluster composed of multiple Member that can work together

Peer: the address of other Member in the same cluster

Client: the client that sends the HTTP request to the etcd cluster

WAL: pre-written log, which is the log format used by etcd for persistent storage

Snapshot: etcd prevents snapshots of too many WAL files and stores the state of etcd data

Proxy: a mode of etcd that provides reverse proxy services for etcd

Leader: a node in the Raft algorithm that handles all data submissions generated by election.

Follower: the node that lost the election in Raft algorithm. As a slave node, it provides strong consistency guarantee for the algorithm.

Candidate: when the Follower cannot receive the heartbeat of the Leader node for more than a certain period of time, it will be transformed into a Candidate (candidate) to start the Leader campaign.

Term: the time period between a node called Leader and the start of the next election, called Term (term of office)

Index: data item number. Term and Index are used to locate data in Raft

Raft algorithm

Raft is a consistency algorithm for managing replication logs. It provides the same function and performance as Paxos algorithm, but its algorithm structure is different from Paxos, which makes Raft algorithm easier to understand and easier to build a real system. The consistency algorithm allows a group of machines to work as a whole, and can continue to work even if some of them fail. Because of this, consistency algorithms play an important role in building reliable large-scale software systems.

Raft algorithm is divided into three parts.

Leader elections, log replication, and security

Features of Raft algorithm:

1. Strong leaders: Raft uses a stronger form of leadership than other consistency algorithms. For example, log entries are sent only from the leader to other servers. This approach simplifies the management of replication logs and makes the Raft algorithm easier to understand.

two。 Leadership election: the Raft algorithm uses a random timer to elect leaders. This approach only adds a little bit to the heartbeat mechanism that any consistent algorithm must implement. It will be easier and faster to resolve conflicts.

3. Membership adjustment: Raft uses a common and consistent approach to deal with the transformation of cluster members, in which most of the machines in the two different configurations in the adjustment process overlap, which allows the cluster to continue to work when the members change.

Leader election

Raft state machine

Each node in the Raft cluster is in a role-based state machine. Specifically, Raft defines three roles of nodes: Follower, Candidate, and Leader.

1.Leader (Leader): there is only one Leader node in the cluster, and it is responsible for synchronizing log data to all Follower nodes.

2.Follower (follower): the Follower node gets the log from the Leader node, provides data query function, and forwards all modification requests to the Leader node

3.Candidate (candidate): when the Leader node in the cluster does not exist or is lost, the other Follower nodes are converted to Candidate, and then start a new Leader node election

The transition between these three role states is shown in the following figure:

A Raft cluster contains several server nodes; usually five, which allows the entire system to tolerate failures of two nodes. At any time, each server node is in one of these three states: a leader, a follower, or a candidate. In general, there is only one leader in the system and all the other nodes are followers. Followers are passive: they don't send any requests, but simply respond to requests from leaders or candidates. The leader handles all client requests (if a client contacts the follower, the follower redirects the request to the leader)

When the node is initially started, the Raft state machines of all nodes are in the Follower state. When Follower does not receive heartbeat packets from Leader nodes within a certain period of time, nodes will change their status to Candidate and send voting requests to other Follower nodes in the cluster, and Follower will vote for the first voting request node received. When Candidate receives votes from more than half of the nodes in the cluster, it becomes the new Leader node.

The Leader node accepts and saves the data sent by the user and synchronizes the logs with other Follower nodes.

Follower only responds to requests from other servers. If Follower does not receive the message, he becomes a candidate and initiates an election. The candidate who gets the majority of the votes in the cluster will become Leader. During a term of office (Term), the leader will always be the leader until his downtime.

The Leader node maintains its position by regularly sending heartbeat data to all Follower. When the Leader node of the urgent crowd fails, Follower will re-elect a new node to ensure the normal operation of the whole cluster.

With each successful election, the Term (term of office) value of the new Leader will be increased by 1 over the previous Leader. When the cluster is split and then remerged due to network or other reasons, more than one Leader node may appear in the cluster, and the node with higher term value will become the real Leader.

Term (term of office) in Raft algorithm

For Term, see the following figure:

Raft will divide the time into terms of any length. And the term of office is marked by consecutive integers. Each term begins with an election, with one or more candidates trying to become leaders. If a candidate wins the election, then he will serve as Leader for the rest of his term. In some cases, an election will result in a division of votes, so that there will be no Leader for this term. Without Leader, a new round of elections would begin immediately, that is, a new term would begin. Raft guarantees that there will be one and only one Leader during one Term term.

Log replication

Log replication means that the master node forms a log entry for each operation, persists it to the local disk, and then sends it to other nodes through the network IO.

Once a leader is elected, he begins to provide services to the client. Each request from the client contains an instruction executed by the replicated state machine. The leader appends this directive to the log as a new log entry, and then initiates an additional entry RPCs to other servers in parallel, asking them to copy the log entry.

The Raft algorithm ensures that all committed log entries are persistent and eventually executed by all available state machines. When the master node receives more than half of the nodes, including itself, and returns successfully, the log is considered to be committed, and the log is entered into the state machine to return the result to the client.

In normal operation, the logs of the leader and followers are consistent, so the consistency check of the attached log RPC never fails. However, the collapse of the leader will leave the log in an inconsistent state (the old leader may not have fully copied all the log entries). This inconsistency can be exacerbated by the collapse of a series of leaders and followers. The journal of followers may be different from that of the new leader. The follower may lose some log entries in the new leader, he may have some log entries that the leader does not have, or both. The loss or excess of log entries may last for multiple terms. This leads to another part, which is security.

Security.

As of now, the selection of master and log replication does not guarantee data consistency between nodes. Imagine that when a node dies, it restarts again after a period of time, and is elected as the primary node. During the period when it is dead, if more than half of the nodes in the cluster survive and the cluster will work properly, then logs will be submitted. These submitted logs cannot be delivered to the dead node. When the dead node is re-elected as the master node, it will lose some of the committed logs. In such a scenario, according to the Raft protocol, it copies its logs to other nodes and overwrites the logs that have been submitted by the cluster. This is obviously wrong.

The solution to this problem by other protocols is that the newly elected master node asks other nodes, compares it with its own data, determines that the cluster has submitted data, and then synchronizes the missing data. This scheme has obvious defects, which increases the time for the cluster to restore services (the cluster is not available in the election phase), and increases the complexity of the protocol. The solution of Raft is to restrict the nodes that can be primary in the selection master logic to ensure that the selected nodes must contain all the logs that have been submitted by the cluster. If the newly selected master node already contains all the committed logs of the cluster, there is no need to compare the data with other nodes. It simplifies the process and shortens the time for the cluster to restore services.

There is a question here: after such restrictions are imposed, can we still elect a master? The answer is: as long as more than half of the nodes are still alive, such a master must be able to choose. Because the committed logs must be persisted by more than half of the nodes in the cluster, it is obvious that the last log submitted by the previous master node is also persisted by most of the nodes in the cluster. When the master node dies, most of the nodes in the cluster are still alive, so there must be one of the surviving nodes that contains committed logs.

Proxy node (proxy) of Etcd

Etcd extends the role model of Raft by adding Proxy roles. The job of proxy mode is to start a HTTP proxy server and forward customer requests to this server to other etcd nodes.

The node as the Proxy role does not participate in the election of Leader, but forwards all received user queries and modification requests to any Follower or Leader node.

The Proxy node can be specified with the "--proxy on" parameter when starting Etcd. In a cluster that uses the Node self-Discovery service, you can set a fixed number of nodes, and members beyond that number are automatically converted to Proxy nodes.

Once the node becomes Proxy, it no longer participates in all Leader elections and Raft state changes. Unless this node is restarted and designated as a member of the Follower node

Etcd acts as a reverse proxy to forward customer requests to available etcd clusters. In this way, you can deploy a Proxy mode etcd as a local service on each machine, and if these etcd Proxy work properly, then your service discovery must be stable and reliable.

The complete Etcd role state transition process is shown in the following figure:

What is Etcd used to do in the kubernetes project and why did you choose it

Etcd is used in the kubernetes cluster to store data and notify changes.

The database is not used in Kubernetes, it stores all the key data in etcd, which makes the overall structure of kubernetes very simple.

In kubernetes, data changes from time to time, such as users submitting new tasks, adding new Node, Node downtime, dead container, and so on, which will trigger changes in state data. After the state data is changed, kube-scheduler and kube-controller-manager on Master will reschedule their work, and the result of their work arrangement will also be data. These changes need to be notified to each component in a timely manner. Etcd has a particularly useful feature that you can call its api to listen to the data in it and receive a notification when the data changes. With this feature, each component in kubernetes only needs to listen to the data in etcd to know what it should do. As for kube-scheduler and kube-controller-manager, you only need to write the latest work schedule into etcd, and you don't have to bother to notify them one by one.

Just imagine, what would you do without etcd? The essence here is that there are two ways to transmit data, one is the way of message, for example, when NodeA has a new task, Master sends a message directly to NodeA without passing through anyone; the other is polling, in which everyone writes the data to the same place, and everyone stares at it consciously and discovers the change in time. The former evolves message queuing systems such as rabbitmq, while the latter evolves some distributed systems with subscription functions.

The problem with the first approach is that all components to be communicated establish long connections and deal with a variety of exceptions, such as disconnection, data transmission failure, and so on. However, with middleware such as message queuing (message queue), the problem is much simpler. Components can establish a connection with mq and handle all kinds of exceptions in mq.

So why did kubernetes choose etcd instead of mq? Mq and etcd are essentially completely different systems, mq functions as message delivery, does not store data (message backlog does not count as storage, because there is no query function), etcd is a distributed storage (its design goal is distributed lock, incidentally with storage function), is a key-value storage with subscription function. If you use mq, you also need to introduce a database to store state data in the database.

Another advantage of choosing etcd is that etcd uses the raft protocol for consistency, which is a distributed lock that can be used for election. If multiple kube-schdeuler are deployed in kubernetes, only one kube-scheduler can be working at a time, otherwise arranging their own work will be messed up. How to ensure that there is only one kube-schduler working? That is, as mentioned earlier, a leader is elected through etcd.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.