How to ensure that the data written by Kafka will not be lost in case of sudden downtime? 04/19 Update SLTechnology News&Howtos

How to ensure that the data written by Kafka will not be lost in case of sudden downtime?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

The purpose of this article is to share with you about how to ensure that the data written by Kafka will not be lost during a sudden outage. The editor thinks it is very practical, so I hope you can learn something after reading this article. Let's take a look at it with the editor.

I believe we all know that the data written to Kafka will be written to disk, so let's talk about how to ensure that the data written into Kafka will not be lost.

Leaving aside the specific process of writing disks, let's take a look at the following figure, which represents the core architecture principle of Kafka.

Kafka distributed storage architecture

So now the question is, if dozens of TB of data are generated every day, are they all written on the disk of a machine? This is obviously unreliable!

So, here we have to consider the distributed storage of data, let's talk about it according to the specific situation of Kafka.

In Kafka, there is a core concept called "Topic". Let's just think of this Topic as a collection of data.

For example, if you now have a copy of user behavior data for a website to write to Kafka, you can create a Topic called "user_access_log_topic", which is all user behavior data.

Then if you want to change the order data of the e-commerce website to write Kafka, you can make a Topic called "order_tb_topic", which is written here is the change record of the order table.

Then, if we give an example, let's say that this user behavior Topic, if dozens of TB of data are written into it every day, do you think it is reliable to put it on one machine?

Obviously unreliable, so Kafka has a concept called Partition, which is to split a Topic data set into multiple data partitions, which you can think of as multiple data fragments, and each Partition can store part of the data on different machines.

In this way, can not a very large set of data be distributed and stored on multiple machines? Let's take a look at the picture below and have a taste of it.

Kafka High availability Architecture

But at this time, we will encounter another problem, that is, if a machine goes down, won't the data managed by the Partition on that machine be lost?

So, we still have to do multiple copy redundancy, each Partition can get a copy on another machine, so that one machine goes down, but one of the copies of Partition is lost.

If there are multiple copies of a Partition, Kafka elects one of the Parititon copies as the Leader, and then the other Partition copies are Follower.

Only Leader Partition provides read and write operations, and Follower Partition synchronizes data from Leader Partition.

Once the Leader Partition goes down, other Follower Partition will be selected as the new Leader Partition to provide read and write services, which will achieve a high availability architecture.

Look at the picture below to see the process:

Kafka write data loss problem

Now let's see, under what circumstances will write data be lost in Kafka? In fact, it is also very simple, everyone knows that the data is written to the Leader of a certain Partition, and then the Follower of that Partition synchronizes the data from the Leader.

But what if a piece of data has just been written to Leader Partition and before it can be synchronized to Follower, and the machine on which Leader Partiton resides suddenly goes down?

Take a look at the following picture:

As shown in the figure above, at this time, a piece of data is not synchronized to Partition0's Follower, and then Partition0's Leader machine goes down.

At this point, Partition0's Follower will be elected as the new Leader to provide services, and then the user will not be able to read the data just written?

This is because there is no data synchronized to * * on the Follower of Partition0. At this time, it will cause the problem of data loss.

What is the ISR mechanism of Kafka?

Now let's leave this problem without saying how to solve it. Let's go back to the core mechanism of Kafka, that is, the ISR mechanism.

To put it simply, it automatically maintains a list of ISR for each Partition, which must contain a Leader, and then a Follower that is synchronized with the Leader.

That is, as long as a Follower of Leader keeps its data synchronized with him, it will exist in the ISR list.

But if Follower cannot synchronize data from Leader in time because of some problems of its own, then the Follower will be considered "out-of-sync" and will be kicked out of the ISR list.

So we have to understand what this ISR is, to put it bluntly, which Follower is automatically maintained and monitored by Kafka to keep up with the data synchronization of Leader.

How can data written by Kafka be guaranteed not to be lost?

So if you want to keep the data written to Kafka unlost, you need to make sure that the following points are available:

Each Partition must have at least one Follower in the ISR list, keeping up with Leader's data synchronization.

Every time you write data, you need to write at least one Partition Leader successfully, and at least one Follower in the ISR is also written successfully.

If the above two conditions are not met, the write fails all the time, and the production system keeps trying to retry until the above two conditions are met before the write is considered successful.

Configure the corresponding parameters according to the above ideas in order to ensure that the data written to Kafka will not be lost.

good! Now let's analyze the above requirements.

First, you must require at least one Follower to be in the ISR list.

That's a must. If Leader doesn't have Follower, or if Follower can't synchronize Leader data in time, then this thing can't be done.

Second, every time you write data, it requires that the Follower in at least one ISR be written successfully in addition to the success of the Leader.

If you look at the figure below, this requirement is to ensure that every time you write data, you must be successful in both Leader and Follower, and ensure that there must be more than two copies of a piece of data.

At this time, if the Leader goes down, you can switch to that Follower, then there is data just written on the Follower, and the data will not be lost at this time.

As shown in the figure above, if there is no Follower in Leader, or if the Leader,Leader goes down as soon as it is written, and you haven't had time to synchronize it to Follower.

In this case, the write fails, and then you let the producer keep retrying until the Kafka returns to normal and meets the above conditions before continuing to write. This allows data written to the Kafka not to be lost.

The above is how to ensure that the data written by Kafka will not be lost in the event of a sudden outage. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.