What are the causes and solutions of data loss, duplication and disorder in Producer transmission? 07/16 Update SLTechnology News&Howtos

What are the causes and solutions of data loss, duplication and disorder in Producer transmission?

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the reasons for data loss, repetition and disorder in Producer transmission and what the solution is. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

Production environment for producers, Kafka clusters often encounter problems such as message loss, duplication, disorder and so on. Let's explain the causes and solutions of these problems.

1. We know that Kafka uses a multi-copy storage mechanism to ensure the reliability of data.

Suppose a Topic is split into three Partition, which is PartitionA,PartitonB,PartitionC, and each Partition has two copies. For example, one copy of PartitionA is leader, and the other is follower,leader and follower. The two copies are distributed on different machines.

In general, we specify that there are 3 copies of Topic data in a production environment. Even if a broker is down, the data will not be completely lost, because there are other copies of data in other broker.

two。 How is the copy data synchronized?

For multiple replicas of topic, only the partition leader provides read and write services. Follower just keeps trying to pull the latest data from leader to the local, but does not provide read and write services. What data is synchronized with leader data? It is managed through ISRl.

The full name of ISR is "In-Sync Replicas", which is a follow-up copy that is basically consistent with the leader copy. If the synchronization speed is too slow, it will be kicked out of the ISR copy.

3. Let's talk about the Producer side message passing semantics of Kafka, which is controlled by the parameter "acks" and can be divided into three types:

A.acks=all

This means that when Producer sends a message, after leader receives the message, it must also require those follower in the ISR list that are synchronized with leader to synchronize the message before it can be considered as written successfully, which is the least efficient but the most reliable.

B.acks=1:

This means that when Producer sends a message, leader receives the message and writes it to the local disk, which is considered successful, regardless of whether his other follower synchronizes the message in the past or not, the default value is 1.

C.acks=0:

It means that when Producer sends a message, producer will not send it once, no matter whether it is sent successfully or not, there is no confirmation in this case, and there may be message loss; this is the most efficient, but the least reliable.

The cause and solution of data loss, duplication and disorder caused by 4.Producer

The data sent by producer is divided into synchronous mode and asynchronous mode, which is controlled by parameter producer.type. Generally, production uses asynchronous mode to send data in batches to improve the throughput of Kafka system.

A synchronous mode producer.type = sync:

When acks=0 does not confirm the receipt of messages, then when the network is abnormal, it will cause data loss. It is not recommended to set it to 0 in general production.

When acks=1, only leader receives successfully and sends ack acknowledgement, leader goes down, and the copy is not synchronized, which also results in data loss.

If the above two types of data are lost, we can set acks=all to ensure that produce writes to all copies are successful and inefficient.

Producer.type = sync request.required.acks=all

‍

b. Asynchronous mode producer.type = async:

In asynchronous mode, there is a buffer that controls the sending of data through buffer, and there are two values to control, the time threshold and the number of messages threshold. If the buffer is full of data and has not been sent, if the immediate cleaning mode is set, the risk is very high, and it is easy to cause data loss. Generally set to blocking mode: queue.enqueue.timeout.ms =-1 means that the background message queue will block until the queue space is released after the backlog reaches the upper limit

Producer.type = asyncrequest.required.acks=1queue.buffering.max.ms=6000queue.buffering.max.messages=10000queue.enqueue.timeout.ms =-1batch.num.messages=500

3) when acks = all, after the data is sent to leader, after the data is sent to leader, some copies of ISR are synchronized, and leader is hung up at this time. For example, both follower1 and follower2 may become the new leader, the producer side will get a return exception, and the producer side will resend the data, which may cause data duplication, which can be solved through idempotence and transactions, which will be discussed later.

4) when .acks = all, if the isr list is empty, if unclean.leader.election.enable is true, other surviving copies will be selected as the new leader, and there will also be the problem of message loss.

You can set parameters by:

Unclean.leader.election.enable=false

Before 0. 11. 0, this parameter defaults to true, while later versions default to false. It means whether to elect from the earliest follower in the OSR when there is no living copy of ISR in the leader election, if it is true, it may cause data loss.

5)。 When sending messages asynchronously, two pieces of data are ready to be sent to the same Partition. The first message failed to be written and the second message failed to be written. If the first batch is written successfully after retry, the data sent will be out of order. In this case, we can limit it by setting parameters:

Parameter: max.in.flight.requests.per.connection

Indicates the size of the request queue. By default, 5. Requests on the way to be sent are stored in the request queue, including requests that are being sent and requests that have not yet received response. When the request queue is full, sending messages will be blocked. That is, the maximum unanswered request sent to the same node. Setting this value to 1 means that the kafka broker client can no longer send requests to the same broker before responding to the request, thus avoiding message disorder.

Through the analysis of 4, it is found that the default configuration of kafka on both sides is at least once, which may be repeated, and exactly once cannot be achieved through configuration, as if the messages of kafka are bound to be lost or duplicated. After version 0.11.0.0 of kafka, idempotent and transaction mechanisms have been introduced to solve the above problems.

Expand knowledge points:

1.Kafka sends data too fast, resulting in a surge in server Nic traffic. Or the disk is too busy, resulting in packet loss. At this time, we have taken the following measures:

a. First of all, put a speed limit on the kafka

b. Secondly, the retry mechanism is enabled to make the retry interval longer.

C.Kafka sets ack=all, that is, all partitions in ISR (replica list) need to be confirmed before they are sent successfully.

About Producer sending data loss, repetition, disorder reasons and what is the solution to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.