How does kafka ensure the reliability and consistency of messages? 07/11 Update SLTechnology News&Howtos

How does kafka ensure the reliability and consistency of messages?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how kafka is to ensure the reliability and consistency of the message, the quality of the article content is high, so Xiaobian shares it with you as a reference, I hope you have a certain understanding of relevant knowledge after reading this article.

In kafka, ISR mechanism is mainly used to ensure the reliability of messages. The following questions illustrate how kafka ensures message reliability and consistency

What is ISR in kafka?

In zk, AR (Assigned Replicas) list is stored, which contains all copies of the partition, where AR = ISR+OSR

ISR（in sync replica）： It is a group of synchronized copies dynamically maintained by kafka. When there are members in ISR, only members of this group can become leaders. The copies stored internally must be synchronized every time information is submitted (acks = all). Whenever the leader hangs, a follower is elected in ISR set to provide services as leader. When the copy in ISR is considered broken, it will be kicked out of ISR. When it keeps up with the message data of leader again, it will re-enter ISR.

OSR (out sync replica): The copy saved does not have to be synchronized before confirmation. Whether the copy in the OSR synchronizes the data of the leader does not affect the submission of the data. The follower in the OSR tries its best to synchronize the leader, and the data version may lag behind.

How does kafka control how many copies need to be synchronized before it can be returned to determine that producer messages are available?

When writing to kakfa, the producer can choose whether to wait for a message acknowledgement of 0 (only write leader), 1 (only synchronize one copy), or-1 (all copies)(where copies refer to copies in ISR).

It should be noted that "all copies confirmation" does not guarantee that all assigned copies have been received. By default, when acks=all, acknowledgements are made whenever all replicas currently in sync (replicas in ISR) receive the message. So Kafka's delivery commitment can be understood as follows: there is no delivery guarantee for messages that are not successfully delivered, and there is a guarantee that "successfully delivered" messages will not be lost if there is at least one surviving fully synchronized replica in the ISR.

What are the conditions for a kafka node to be alive?

Point 1: A node must maintain a session with zk, via heartbeat detection of zk.

Second point: if the node is a slave, i.e., a replica node, then it must replicate the leader node not too far behind. Backwardness here can refer to two situations

1: Data replication lags behind, and the data of slave node and leader node is quite different. This situation has a disadvantage. After the producer suddenly sends a large number of messages to cause network congestion, a large number of slav replication is blocked, resulting in a large number of data replication lags behind and is kicked out of ISR.

2: The time difference is too large, which means that the time between the slave's request for replication to the leader is too long from the last request. This time parameter can be configured by configuring replica.lag.time.max. This approach solves the problem caused by the first approach described above.

How to restore kafka partition after it has been suspended?

There is a partition recovery mechanism in kafka for recovering suspended partitions.

Each Partition records a RecoveryPoint on disk, recording the maximum offset that has been flushed to disk. When the broker fails to restart, loadLogs is performed. First, it reads the RecoveryPoint of the Partition and finds the segments that contain the RecoveryPoint point and the subsequent segments that may not be fully flushed to the disk segments. Then call segment recover, reread the msg of each segment, and rebuild the index.

Advantages:

Partition data is managed by segment, which facilitates the management of data life cycle and easy deletion of expired data.

Speed up recovery when the program crashes and restarts, simply restore segments that are not fully flushed to disk

What causes the copy and leader to be out of sync?

Slow copy: Followers cannot catch up with the leader for a certain period of time. One of the most common causes is I / O bottlenecks that cause follower appends to copy messages slower than pulling from leader.

Stuck copy: The follower stops pulling requests from the leader for a certain period of time. A follower replica stuck is due to GC suspension or follower failure or death.

New start copy: When a user adds a copy factor to a topic, the new followers are not in the sync copy list until they completely catch up with the leader log.

A partition whose follower lags far enough behind the leader is considered to be out of sync or lagging.

As mentioned above, there are two ways for kafka to determine whether it is lagging behind. The replica lag is determined by the maximum number of messages that the replica lags behind the leader (replica. tag.max.messages) or the maximum waiting time for replicas to respond to the partition leader (replica.lag.time.max.ms). The former is used to detect slow copies, while the latter is used to detect failed or dead copies.

What if the copy in ISR dies?

1. The two options: the service is unavailable for a period of time and waits for the replica to recover in ISR (pray that the recovered replica has data) or the first replica (this replica is not necessarily in ISR) is selected as the leader. These two methods are also tradeoffs between availability and consistency.

2. Service unavailable mode This applies to situations where message loss is not allowed, applies to consistency over availability, and can be done in two ways

2.1 Set the minimum number of ISR synchronization copies. If the current number of ISR is greater than the minimum synchronization value, the partition will accept writes, avoiding too few ISR synchronization copies. If it is less than the minimum then the partition will not receive writes. This minimum setting only takes effect when acks = all.

2.2 Disable unleader-leader election. When all copies in isr are unavailable, you cannot use the copy in OSR as leader. Make the service unavailable directly. Wait until the copy in ISR is restored before electing leader.

3. The method of directly selecting the first replica as the leader is applicable to scenarios where availability is greater than consistency. This is also the default processing method adopted by kafka when all replicas in isr are dead. We can disable this behavior by configuring the parameter unlead.leader.election.enable. The first method is adopted.

So how does ISR synchronize?

There are three types of broker offsets: base offset, high water mark (HW) and log end offset (LEO).

base offset: start offset, offset of the first day message in replica

HW: replica High watermark value, displacement of the latest submitted message in the replica. The HW value of the leader is the range of the actual submitted message, and each replica has a HW value, but only HW in the leader can be used as identification information. What does it mean? It means that HW value will be updated only when message backup is successfully completed according to parameter standard (after successful synchronization to follower replica), which means that the message will not be lost in theory and can be considered as "submitted".

LEO: End of log offset, which is the offset of the next message to be written in the replica. Note that ha is the next and to be written, not the last one. This LEO personal feeling is also used to indicate the synchronization progress of the followers. So HW represents the position of the data that has been synchronized, LEO represents the latest position that has been written, and only the data before HW is accessible to the outside world. Now let's look at what happened in the black box from when the broker received the message to when it returned the response.

Broker receives producer's request

Leader receives message and writes successfully, LEO value +1

broker pushes message to follower replica, follower successfully writes LEO +1...

After all LEO writes, leader HW +1

Messages can be consumed and successfully responded to

The above process can be seen from the following diagram:

After solving the previous problem, how does kafka choose the leader?

The common method of electing leaders is majority election method, such as Redis, etc., but kafka does not choose majority election method, kafka uses quorum (quorum).

Quorum is a voting algorithm commonly used in distributed systems, which is mainly used to ensure data consistency through data redundancy. In kafka, the implementation of the algorithm is ISR, and in ISR, the quorum that can be elected leader.

After the leader goes down, only a new leader can be selected from the ISR list. No matter which copy in the ISR is selected as the new leader, it knows the data before HW, which can ensure that after switching the leader, the consumer can continue to see the data submitted before HW.

HW truncation mechanism: a new leader is selected, and the new leader cannot guarantee that all the data of the previous leader has been completely synchronized, but only that the data before HW has been synchronized. At this time, all followers must truncate the data to the HW position, and then synchronize the data with the new leader to ensure the data consistency. When the down leader recovers and finds that the data in the new leader is inconsistent with the data it holds, the down leader truncates its data to the hw position before the down, and then synchronizes the data of the new leader. The down leader comes back to life and synchronizes data like a follower to ensure data consistency.

About kafka is how to ensure the reliability and consistency of the message to share here, I hope the above content can be of some help to everyone, you can learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.