How to minimize data loss when the Kafka partition is unavailable and the copy is corrupted 04/23 Update SLTechnology News&Howtos

How to minimize data loss when the Kafka partition is unavailable and the copy is corrupted

2025-04-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how to minimize data loss when the Kafka partition is unavailable and the copy is corrupted. The content is concise and easy to understand, which will definitely brighten your eyes. I hope you can gain something through the detailed introduction of this article.

The following is devoted to the failure reproduction of unavailable partitions, and gives some of my actions to minimize data loss.

Fault recurrence

Let me use an example to reproduce an example where the partition is not available and the leader copy is corrupted:

Start broker0 with the unclean.leader.election.enable = false parameter

Start broker1 with the unclean.leader.election.enable = false parameter

Create topic-1,partition=1,replica-factor=2

Write a message to topic-1

At this point, the copies on both broker are in ISR, and the copy of broker0 is a leader copy.

Stop broker1, where the leader of topic-1 is still the copy of broker0, and the copy of broker1 is removed from ISR

Stop broker0 and delete log data on broker0

Restart broker1,topic-1 to try to connect to the leader copy, but broker0 has stopped running and the partition is unavailable and cannot write messages

Restore the replica on broker0,broker0 and restore the leader position. When broker1 tries to join ISR, but because the data of leader is cleared, that is, the offset is 0, the copy of broker1 needs to truncate the log to keep the offset not greater than the leader copy, and all the data of the partition is lost.

My advice.

Is it possible to provide an option for users to manually set any copy within the partition as leader when the partition is not available?

Because once unclean.leader.election.enable = false is set in the cluster, a copy other than ISR cannot be elected as leader. In extreme cases, only a copy of leader is still in ISR, and the broker where leader resides is down. What if the broker data is corrupted at this time? In this case, is it possible for the user to choose a copy of leader? Although there will be data loss in doing so, the situation will be much better than losing data for the entire partition.

My coquettish operation

First of all, you need to have an unavailable partition (and the leader copy of that partition has lost data). If you are testing, you can repeat the above failure steps 1-8 to achieve an unavailable partition (you need to add a broker):

At this time, the leader copy is in broker0, but it has been hung up, and the partition is not available. The broker2 copy cannot be selected as leader due to falling out of ISR, and the leader copy has been corrupted and erased. If you restart the broker0,follower copy at this time, the log will be truncated and all data of the partition will be lost.

After a series of tests and experiments, I have summed up the following actions, which can forcibly select a copy of broker2 as leader to minimize data loss:

1. Use the kafka-reassign-partitions.sh script to reallocate the partition of the topic. Of course, you can also use the kafka-manager console to reallocate the partition of the topic, as follows:

At this point, the preferred leader has been changed to the copy of the broker2, but the leader is still the copy of the broker0. It is important to note that the preferred leader after partition redistribution must be the copy that kicked out the ISR before, rather than the partition redistributing the newly generated copy. Because the offset of the newly generated copy is 0, if the automatic redistribution is not satisfied, you need to write a json file and manually change the allocation policy.

2. Go to zk, check the partition status and modify its contents:

Modify the node content, forcibly change the leader to 2 (the same as the preferred leader after redistribution), and add 1 to the leader_epoch, and change the ISR list to leader, as follows:

At this point, the kafka-manager console looks like this:

But it still doesn't work at this time. Remember that you need to restart broker 0 at this time.

3. Restart broker0 and find that the lastOffset of the partition has become the lastOffset of the copy of broker2:

Successfully recovered 46502 message data, although still lost 76053-46502 = 29551 message data, but it is better than losing all of it!

The above is how to minimize data loss when the Kafka partition is unavailable and the copy is corrupted. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.