What if the etcd3 data is inconsistent? 10/26 Update SLTechnology News&Howtos

What if the etcd3 data is inconsistent?

2025-10-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of "what to do with etcd3 data inconsistency". The editor shows you the operation process through an actual case. The operation method is simple and fast, and it is practical. I hope this article "how to do etcd3 data inconsistency" can help you solve the problem.

Abnormal rolling update of K8S with weird background

The author received feedback from colleagues one day that the rolling update release of the K8S cluster in the test environment did not take effect. Through kube-apiserver inspection, it is found that the corresponding Deployment version is already the latest version, but this latest version of Pod has not been created.

In view of this phenomenon, we initially speculated that it may be caused by kube-controller-manager 's bug, but no obvious abnormality was found in the controller-manager log. After the first increase in the log level of controller-manager and the restart operation, it seems that because controller-manager did not watch to this update event, we still did not find the problem. At this point, if you look at the kube-apiserver log, there is also no obvious exception.

So, raise the log level again and restart kube-apiserver, something weird happened, and the previous Deployment was scrolled and updated normally!

Etcd data inconsistency?

Since there is also no way to extract useful information from kube-apiserver logs that can help solve the problem, at first we can only guess that it may be caused by an exception in kube-apiserver 's cache update. Just as we were about to solve the problem from this entry point, the colleague gave back an even weirder problem-his newly created Pod, querying the Pod list through kubectl, suddenly disappeared! What? What kind of coquettish operation is this? After many times of testing the query, we found that through the kubectl to list pod the list, the pod can sometimes be found, sometimes not. So the problem is, the list operation of K8s api is not cached, and the data is pulled by kube-apiserver directly from etcd and returned to the client. Is there something wrong with etcd itself?

As we all know, etcd itself is a highly consistent KV storage, and if the write operation is successful, the two read requests should not read different data. With an attitude of not believing in evil spirits, we directly query the etcd cluster status and cluster data through etcdctl. The returned results show that the states of the three nodes are normal and the RaftIndex is the same. No error message is found when observing the etcd log. The only suspicious thing is that the dbsize of the three nodes is quite different. Then, we specify the endpoint accessed by client as different node addresses to query the number of key of each node, and it is found that the number of key returned by three nodes is different, and even the difference between the number of Key on two different nodes can be up to thousands! Directly query the Pod you just created through etcdctl and find that you can query the pod by accessing some endpoint, but not by accessing other endpoint. At this point, it is almost certain that there is a data inconsistency between the nodes of the etcd cluster.

In the process of problem analysis and troubleshooting, don't ask Google when something happens.

Strong consistency of storage suddenly data inconsistency, such a serious problem, it must be reflected in the log. However, it may be because etcd developers are worried that too many logs will affect performance, and etcd logs are printed less, so that we checked the logs of each node in etcd and found no useful error logs. Even after we raised the log level, no abnormal information was found.

As a programmer in the 21st century, when I encounter this strange and temporarily confused problem, my first reaction is to Google first. After all, programmers who don't know how to StackOverFlow are not good operators! Google type "etcd data inconsistent" to search and find that we are not the only ones who have encountered this problem. Other people have reported similar problems to the etcd community before, but failed to provide a stable way to reproduce them.

Because this problem is serious, which will affect the consistency of data, and we currently use hundreds of etcd clusters in our production environment, in order to avoid similar problems, we decided to take a closer look.

A brief introduction to the working principle and terminology of etcd

Before we begin, in order to make it easier for readers to understand, here is a brief introduction to the common terms and basic reading and writing principles of etcd.

Glossary:

Etcd is a distributed KV storage with strong consistency. To put it simply, it means that after a successful write operation, the data read from any node is the latest value, and there is no case that the data can not be read or read the old data after the successful write operation. Etcd implements leader election, configuration changes and data read and write consistency through the raft protocol. The following is a brief introduction to the read and write process of etcd:

Write the data flow (take the leader node as an example, see figure above):

The etcd server module of any etcd node receives a Client write request (if it is a follower node, it will first forward the request to the leader node for processing through the Raft module).

Etcd server encapsulates the request as a Raft request and then submits it to the Raft module for processing.

Leader interacts with the follower node in the cluster through the Raft protocol, copies the message to the follower node, and at the same time persists the log to WAL.

The follower node responds to the request and replies whether it agrees to the request or not.

When more than half of the nodes in the cluster ((members 2) + 1 log) agree to receive the log data, it means that the request can be notified to etcd server by the Commit,Raft module that the log data has been Commit and can be Apply.

The applierV3 module of the etcd server of each node performs Apply operations asynchronously and writes the back-end storage BoltDB through the MVCC module.

When the apply of the node data connected by the client is successful, the result will be returned to the client apply.

Read data flow:

The etcd server module of any etcd node receives a client read request (Range request)

Determine the type of read request. If it is a serialized read (serializable), then go directly to the Apply process.

If it is a linear consistent read (linearizable), enter the Raft module

The Raft module sends a ReadIndex request to leader to obtain the latest data Index submitted by the current cluster

Enter the Apply process when waiting for the local AppliedIndex to be greater than or equal to the CommittedIndex obtained by ReadIndex

Apply process: get the latest Revision of Key from the KV Index module through the Key name, and then obtain the corresponding Key and Value from the BoltDB through Revision.

Preliminary verification

Usually under the normal operation of the cluster, if there are no external changes, there will not be such a serious problem. When we query the release records of the faulty etcd cluster in recent days, we found that in a release of the cluster the day before the failure, due to unreasonable dbsize configuration, the db was full and the cluster could not write new data. For this reason, the OPS staff updated the relevant configurations of cluster dbsize and compaction, and restarted etcd. After the restart, the operation and maintenance students continued to manually perform compact and defrag operations on etcd to compress the db space.

From the above scenarios, we can preliminarily identify the following suspicious trigger conditions:

Dbsize full

Dbsize and compaction configuration updates

Compaction operation and defrag operation

Restart etcd

When something goes wrong, it must be able to reproduce in order to be more conducive to solving the problem. The so-called bug that can be reproduced is not called bug. Before reproducing the problem, we analyzed the relevant issue before the etcd community and found that the common conditions for triggering the bug included performing compaction and defrag operations and restarting the etcd node. Therefore, we plan to first try to simulate these operations at the same time to see if they can be reproduced in the new environment. To this end, we create a new cluster, and then write scripts to write and delete data to the cluster until the dbsize reaches a certain level, then update and restart the nodes in turn, and trigger compaction and defrag operations. However, after many attempts, we did not reproduce a scenario similar to the inconsistency of the above data.

The first sign of cocoon peeling

Then, in the subsequent test, I inadvertently found that client specifies different endpoint to write data, and the nodes that can find the data are also different. For example, endpoint specifies node1 to write data, and all three nodes can find data; endpoint specifies node2 to write data, node2 and node3 can find it; endpoint specifies node3 to write data, and only node3 can find it. The details are as follows:

So we made a preliminary guess, there are the following possibilities:

The cluster may be split and leader did not send a message to the follower node.

Leader sends a message to the follower node, but the log is unexpected and there is no corresponding command.

Leader sent a message to the follower node with a corresponding command, but the apply exception occurred before the operation reached KV Index and boltdb.

Leader sent a message to the follower node with a corresponding command, but with an exception in apply, there was a problem with KV Index.

Leader sent a message to the follower node with a corresponding command, but with an exception in apply, there was a problem with boltdb.

To verify our guess, we conducted a series of tests to narrow the problem:

First of all, we view the cluster information through endpoint status and find that the clusterId,leader,raftTerm,raftIndex information of the three nodes is the same, while the dbSize size and revision information are not the same. ClusterId and leader are the same, basically excluding the guess of cluster split, while raftTerm and raftIndex are the same, indicating that leader has synchronized messages to follower, and further excludes the first guess. However, it is uncertain whether there is any exception in the WAL disk. The inconsistency between dbSize and revision further indicates that the data of the three nodes have been inconsistent.

Second, because etcd itself provides some dump tools, such as etcd-dump-log and etcd-dump-db. We can use etcd-dump-log dump to output the contents of the specified WAL file and etcd-dump-db dump to output the data of the db file, as shown in the following figure, so as to facilitate the analysis of WAL and db data.

So, we wrote a piece of distinguishing data to node3, and then analyzed the WAL of the three nodes through etcd-dump-log. According to the test just now, endpoint is designated as the data written by node3, which should not be found by the other two nodes. But we found that all three nodes received the WAL log, which means that the WAL was not lost, so we ruled out the second guess.

Next, we analyzed the data of db and found that the data specified by endpoint as written by node3 could not be found in the db files of the other two nodes, that is to say, the data did not fall into db, instead of being written into it.

Since it is in WAL but not in db, it is most likely that the apply process is abnormal and the data may be discarded during apply.

Since the existing logs do not provide more effective information, we plan to add some new logs to etcd to better help us locate the problem. When etcd performs apply operations, the trace log will print more than every request that exceeds 100ms. We first lower the threshold of 100ms to 1ns, so that each apply request can be recorded, which can better help us to locate the problem.

After compiling the new version, we replaced one of the etcd nodes and then initiated a write request test to a different node. Sure enough, we found an unusual error log: "error": "auth:revision in header is old", so we concluded that the problem was probably due to the fact that the corresponding key was not written in the node that sent the error log.

After searching the code, we find that when etcd performs apply operation, if authentication is enabled, the AuthRevision in the raft request header will be judged during authentication. If the AuthRevision in the request is less than the AuthRevision of the current node, the AuthRevision will be considered to be too old and cause the Apply to fail.

Func (as * authStore) isOpPermitted (userName string, revision uint64, key, rangeEnd [] byte, permTyp authpb.Permission_Type) error {/ /. If revision

< as.Revision() { return ErrAuthOldRevision } // ...} 这样看来，很可能是不同节点之间的 AuthRevision 不一致了，AuthRevision 是 etcd 启动时直接从 db 读取的，每次变更后也会及时的写入 db，于是我们简单修改了下 etcd-dump-db工具，将每个节点 db 内存储的 AuthRevision 解码出来对比了下，发现 3 个节点的 AuthRevision 确实不一致，node1 的 AuthRevision 最高，node3 的 AuthRevision 最低，这正好能够解释之前的现象，endpoint 指定为 node1 写入的数据，3 个节点都能查到，指定为 node3 写入的数据，只有 node3 能够查到，因为 AuthRevision 较低的节点发起的 Raft 请求，会被 AuthRevision 较高的节点在 Apply 的过程中丢弃掉（如下表）。源码之前，了无秘密？目前为止我们已经可以明确，新写入的数据通过访问某些 endpoint 查不出来的原因是由于 AuthRevision 不一致。但是，数据最开始发生不一致问题是否是由 AuthRevision 造成，还暂时不能断定。为什么这么说呢？因为 AuthRevision 很可能也是受害者，比如 AuthRevision 和数据的不一致都是由同一个原因导致的，只不过是 AuthRevision 的不一致放大了数据不一致的问题。但是，为更进一步接近真相，我们先假设 AuthRevision 就是导致数据不一致的罪魁祸首，进而找出导致 AuthRevision 不一致的真实原因。原因到底如何去找呢？正所谓，源码之前了无秘密，我们首先想到了分析代码。于是，我们走读了一遍 Auth 操作相关的代码（如下），发现只有在进行权限相关的写操作（如增删用户/角色，为角色授权等操作）时，AuthRevision 才会增加。AuthRevision 增加后，会和写权限操作一起，写入 backend 缓存，当写操作超过一定阈值（默认 10000 条记录）或者每隔100ms（默认值），会执行刷盘操作写入 db。由于 AuthRevision 的持久化和创建用户等操作的持久化放在一个事务内，因此基本不会存在创建用户成功了，而 AuthRevision 没有正常增加的情况。 func (as *authStore) UserAdd(r *pb.AuthUserAddRequest) (*pb.AuthUserAddResponse, error) { // ... tx := as.be.BatchTx() tx.Lock() defer tx.Unlock() // Unlock时满足条件会触发commit操作 // ... putUser(tx, newUser) as.commitRevision(tx) return &pb.AuthUserAddResponse{}, nil}func (t *batchTxBuffered) Unlock() { if t.pending != 0 { t.backend.readTx.Lock() // blocks txReadBuffer for writing. t.buf.writeback(&t.backend.readTx.buf) t.backend.readTx.Unlock() if t.pending >

= t.backend.batchLimit {t.commit (false)}} t.batchTx.Unlock ()}

So, since the AuthRevision of the three nodes is inconsistent, could it be because the operations related to the write permissions of some nodes are lost, so that the db is not written? If this guess is true, the bucket content of authUser and authRole in the three-node db should be different. So for further verification, we continue to modify the etcd-dump-db tool to add the ability to compare the bucket contents of different db files. Unfortunately, by comparison, there is no difference in the content of authUser and authRole bucket between the three nodes.

Since the operation related to the write permission of the node is not lost, is it possible that the command has been executed repeatedly? When viewing the logs during the exception period, it is found that there are more auth operations; further, compared with the logs related to the auth operations of the three nodes, it is found that some nodes have more logs and some nodes have fewer logs, which seems to be the phenomenon of repeated execution of commands. Due to log compression, although it is not clear whether the operation is repeated or lost, but this information can bring great inspiration for our follow-up investigation.

We continue to observe and find that there are differences in AuthRevison between different nodes, but the differences are small, and the differences do not change during our stress test. Since the AuthRevision differences between different nodes are not further magnified, there is basically no problem with the new logs, because the inconsistencies are likely to be caused instantly at some point in the past. As a result, if we want to find the root cause of the problem, we still need to be able to reproduce AuthRevison inconsistencies or data inconsistencies, and to be able to catch the scene of the moment of reproduction.

The problem seems to be back to square one, but the good news is that we have eliminated a lot of interference and focused on the auth operation.

Chaos project, reproduced successfully

In view of the fact that previous manual simulations of various scenarios have failed to reproduce various scenarios, we intend to develop an automated stress test scheme to reproduce this problem. The main points to consider when formulating the plan are as follows:

How to increase the probability of recurrence?

According to the previous troubleshooting results, it is very likely that the data inconsistency is caused by the auth operation, so we have implemented a monkey script that writes random users and roles to the cluster at regular intervals, authorizes the roles, writes data operations, and randomly restarts the nodes in the cluster, recording the time point and execution log of each operation in detail.

How to ensure that in the case of recurrence, we can locate the root cause of the problem as much as possible?

According to the previous analysis, the root cause of the problem is that there is a problem in the apply process, so we add a detailed log to the apply process and print committedIndex, appliedIndex, consistentIndex and other information for each apply operation.

If the reproduction is successful, how can it be found in the first place?

As the log is too large, only the first time to find the problem, can more accurately narrow the scope of the problem, can be more conducive to our positioning of the problem. So we implement a simple metric-server, pull the number of key of each node every other minute, and compare them, expose the difference as metric, pull it through prometheus, and display it with grafana. Once the difference exceeds a certain threshold (in the case of a large amount of data written, even if the number of key of each node is counted concurrently, there may be a small difference, so there is a tolerance error) Then immediately push the alarm to us through the unified alarm platform, in order to find out in time.

After the program is done, we have built a new set of etcd cluster, deployed our stress test program, and intend to make a long-term observation. As a result, at noon the next day, we received an alert from Wechat that there were data inconsistencies in the cluster.

So, we immediately logged in to the stress test machine for analysis, first stopped the stress test script, and then looked at the AuthRevision of each node in the cluster, and found that the AuthRevision of the three nodes was indeed different! According to the monitoring data on the grafana monitoring panel, we reduce the time range of data inconsistencies to 10 minutes, and then focus on analyzing the log of these 10 minutes, and find that after a node is restarted, the value of consistentIndex is smaller than that before startup. However, the idempotency of all apply operations of etcd depends on consistentIndex. When performing apply operations, it will determine whether the Index of the Entry to be apply is greater than consistentIndex. If Index is greater than consistentIndex, consistentIndex will be set to Index and the record will be allowed to be apply. Otherwise, it is assumed that the request has been repeated and no actual apply operation will be performed.

/ / applyEntryNormal apples an EntryNormal type raftpb request to the EtcdServerfunc (s * EtcdServer) applyEntryNormal (e * raftpb.Entry) {shouldApplyV3: = false if e.Index > s.consistIndex.ConsistentIndex () {/ / set the consistent index of current executing entry s.consistIndex.setConsistentIndex (e.Index) shouldApplyV3 = true} / /... / / do not re-apply applied entries. If! shouldApplyV3 {return} / /...}

In other words, due to the reduction of consistentIndex, etcd itself depends on its idempotent operation will no longer be idempotent, causing permission-related operations to be repeated apply after etcd restart, that is, a total of two apply!

Analysis of problem principle

Why is consistentIndex decreasing? After reading the code related to consistentIndex, we finally found the root cause of the problem: the persistence of consistentIndex itself depends on the write data operation of mvcc; when writing data through mvcc, saveIndex will be called to persist consistentIndex to backend, while the operation related to auth is directly read and written backend without going through mvcc. As a result, if a permission-related write operation is performed and the data is not written through mvcc, the consistentIndex will not be persisted during this period. If etcd is restarted at this time, the permission-related write operation will be apply twice, and the side effect may lead to repeated increase of AuthRevision, which will directly lead to inconsistent AuthRevision of different nodes, while inconsistent AuthRevision will result in data inconsistency.

Func putUser (lg * zap.Logger, tx backend.BatchTx, user * authpb.User) {b, err: = user.Marshal () tx.UnsafePut (authUsersBucketName, user.Name, b) / / write directly to Backend without MVCC, consistentIndex} func (tw * storeTxnWrite) End () {/ / only update index if the txn modifies the mvcc state will not be persisted at this time. If len (tw.changes)! = 0 {tw.s.saveIndex (tw.tx) / / when writing data through MVCC, consistentIndex is triggered to persist tw.s.revMu.Lock () tw.s.currentRev++} tw.tx.Unlock () if len (tw.changes)! = 0 {tw.s.revMu.Unlock ()} tw.s.mu.RUnlock ()}

In retrospect, why can the data be read when the data are inconsistent, and the data may not be the same? Doesn't etcd use the raft algorithm, can't it guarantee strong consistency? In fact, this has something to do with the read operation implementation of etcd itself.

What affects the ReadIndex operation is the CommittedIndex of the leader node and the AppliedIndex,etcd of the current node. During the apply process, regardless of whether the apply is successful or not, the AppliedIndex will be updated. As a result, although the apply of the current node fails, the read operation will not perceive the failure when judging, resulting in that some nodes may not be able to read the data. And etcd supports multi-version concurrency control. There can be multiple versions of data in the same key. The failure of apply may only be a failure to update a certain version of data, which results in inconsistency of the latest data versions among different nodes, resulting in different data readout.

Scope of influence

This problem was introduced in 2016, and all etcd3 clusters with authentication enabled will be affected. In certain scenarios, the data of multiple nodes in the etcd cluster will be inconsistent, and the external performance of etcd can be read and written normally, and there are no obvious errors in the log.

Trigger condition

The etcd3 cluster is used and authentication is enabled.

The node in the etcd cluster is rebooted.

Before the node is restarted, there is a grant-permission operation (or several consecutive additions and deletions to the same permission operation in a short period of time), and no other data is written before restart after the operation.

A write data request is made to the cluster through a non-restart node.

Repair scheme

Knowing the root cause of the problem, the fix is clear. We only need to trigger the persistence operation of consistentIndex after the auth operation calls commitRevision to ensure the correctness of consistentIndex itself when etcd is restarted, thus ensuring the idempotency of auth operation. We have submitted PR # 11652 to the etcd community for specific fixes. This feature is currently available in backport to 3.4 and 3.3, and will be updated in the latest release.

So what if the data are already inconsistent? is there a way to recover? If the etcd process does not restart frequently, you can first find the smallest node of authRevision, whose data should be the most complete. Then use the move-leader command of etcd to transfer leader to this node, then move other nodes out of the cluster in turn, back up and delete the data directory, and then add the node back up. At this time, it will synchronize a copy of the latest data from leader. In this way, the data of other nodes in the cluster can be consistent with leader, that is, the data is not lost as much as possible.

Upgrade recommendation

It should be noted that the upgrade is risky, although the new version solves this problem, but because the etcd needs to be restarted during the upgrade process, the restart process may still trigger the bug. Therefore, it is recommended to stop the operation related to write permission before upgrading the repair version, and manually trigger a write data operation before restarting the node to avoid problems caused by upgrade.

In addition, it is not recommended to upgrade directly across major versions, such as from etcd3.2 → etcd3.3. Large version upgrades are risky and need to be carefully tested and evaluated. We have previously found another inconsistency caused by lease and auth, as detailed in issue # 11689 and related PR # 11691.

This is the end of the content about "what to do with etcd3 data inconsistency". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.