What is the analysis of the disadvantages of Kafka backup and recovery based on HW? 07/01 Update SLTechnology News&Howtos

What is the analysis of the disadvantages of Kafka backup and recovery based on HW?

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you the analysis of the disadvantages of Kafka backup and recovery based on HW. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

We will explain the causes of the malpractice and explain that kafka introduced the concept of Lead Epoch to solve the problem.

There are two main problems with HW-based backups:

Data loss

Data inconsistency

The following problem explanation is based on the server min.insync.replicas value of 1, which means that as soon as the Leader receives the producer request and successfully writes the message to the log log, it informs the client that the message was written successfully, regardless of other follower copies in the ISR.

Data loss

The following picture is a state chart of data loss, and here's what happened to cause the data loss.

From the figure, we can see that the initial state is written to both leader An and follower B, except that the HW of leader A has been updated to 1, but the HW of follower B is 0 (because the update of follower's HW needs to go through two fetch data requests, which indicates that follower B has only made one fetch data request)

Suppose that follower does not make a second fetch data request (the reason given in the figure is a restart), and there is a downtime in leader A for some reason, at this time, when follower B is restarted, because leader A has already been down, it is reasonable to find that after leader,B has been elected leader, the HW value is 0, so delete the message with offset 1 and update the value of LEO to 1 (the next write message will start from offset 1 At this point, the original offset=1 message is lost.

When A restarts, the log is also truncated, and then the value of HW is adjusted to 0. At this point, the offset=1 message is completely lost.

Data inconsistency

The following figure is a state diagram of data inconsistencies. Let's explain how leader and follower have changed in the cluster during the process of data inconsistencies.

The initial state is that both Leader An and Follower B have successfully written the first log, and Leader has written the second log. Suppose Follower B initiates the fetch data request to synchronize the second log, and because the Leo value carried by Follower B is 1, when Leader A receives the fetch data request, it changes its RemoteLEO about B to 1, updates its own HW value to 1, and then returns the data to Follower B.

Just then, unfortunately, B had an outage and did not receive a response, while LeaderA also had an outage. However, in the process of restart, B recovers first, so B becomes Leader (the HWL value is updated to 0 and the Leo value is updated to 1). At this time, A has not yet resumed restart.

Suppose that at this point the producer generates a message, so B writes it to the lower-level log and updates its own HW to 1 and Leo to 2. Everything seems to be normal. At this time, A restarts successfully and finds that the HW of the partition Leader is 1 and its own HW is 1, so there are no updates and no log truncation. As a result, the messages stored in the offset=1 of Leader B and Follower An are inconsistent.

The way to improve

Because of the above two problems caused by the HW-based backup and recovery mechanism, Kafka introduced Lead Epoch after version 0.11.0.0 to solve the above problems.

Lead Epoch actually opens up a separate cache on Leader Broker to record (epoch, offset) this set of key-value pairs, which are written to a checkpoint file on a regular basis. Every time Leader changes the value of epoch, it will add 1 offset to represent the displacement of the first log written by the Leader of that version of epoch. When Leader first writes the underlying log, an entry is added to the cache, otherwise no update is made.

The above is the analysis of the drawbacks of Kafka backup recovery based on HW shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.