Discussion on the possibility of data loss in kafka and why kafka has high throughput 07/06 Update SLTechnology News&Howtos

Discussion on the possibility of data loss in kafka and why kafka has high throughput

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Friday, 2019-2-22

Why high throughput is his advantage in kafka see the following four advantages

1. When creating a topic, you can specify the number of partitions at the same time. The more the number of partitions, the greater the throughput, but the more resources are required, which will also lead to higher unavailability. After receiving the message sent by the producer, the kafka will store the message in different partitions according to the equalization policy. Because each message is append into the Partition and belongs to a sequential write disk, it is very efficient (it has been proved that sequential write disk is more efficient than random write memory, which is an important guarantee of Kafka's high throughput).

2. We know that the data in kafaf will not be saved all the time. It is generally stored for 2 weeks by default, and the data will be deleted after 2 weeks. Of course, these can be adjusted by customs clearance parameters.

Therefore, Kafka provides two strategies to delete old data. One is based on time, and the other is based on Partition file size. For example, you can configure $KAFKA_HOME/config/server.properties to let Kafka delete data from a week ago, or delete old data when the Partition file exceeds 1GB.

Log.retention.hours=168

Log.segment.bytes=1073741824

Log.retention.check.interval.ms=300000

Log.cleaner.enable=false

Since deleting data has nothing to do with the performance of kafka, how to delete data is related to disk policy and specific requirements.

Offset is controlled by Consumer, because offet is controlled by Consumer, so Kafka broker is stateless, it does not need to mark which messages have been consumed, nor does it need to use broker to ensure that only one Consumer of the same Consumer Group can consume a message, so there is no need for locking mechanism, which also provides a strong guarantee for the high throughput of Kafka.

3. With Partition, different messages can be written to different Partition of different broker in parallel, which greatly improves the throughput.

4. When using consumer high level api, a message in the same topic can only be consumed by one consumer in the same consumer group, but multiple consumer group can consume the same message at the same time.

How to ensure the reliable transmission of messages? / / the possibility of kafka losing data

Https://blog.csdn.net/qq_36236890/article/details/81174504 / / this link is very important

Kafka has several failover / / reference links for https://www.cnblogs.com/qingyunzong/p/9004593.html

There are several possible delivery guarantee:

At most once messages may be lost, but will never be transmitted repeatedly

At least one messages are never lost, but may be transmitted repeatedly

Exactly once each message is sure to be transmitted once and only once, which is what the user wants in many cases.

Possibility of losing data 1

When Producer sends a message to broker, once the message is commit, the factor replication exists, it will not be lost. However, if the communication is interrupted due to network problems after Producer sends data to broker, then Producer cannot determine whether the message has been commit. Although Kafka can't determine what happened during a network failure, Producer can generate something similar to a primary key and idempotently retry multiple times in the event of a failure, thus achieving Exactly once.

/ / this step is the possibility and solution of message loss during the process from Producer producer to kafka broker.

Possibility of losing data 2

Next, we discuss the delivery guarantee semantics of messages from broker to Consumer. (Kafka consumer high level API only). After Consumer reads the message from the broker, it can select commit, which stores the offset of the message read by the Consumer in the Partition in the Zookeeper. The next time the Consumer reads the Partition, it starts with the next entry.

If not commit, the start position of the next read will be the same as the start position after the last commit. Of course, you can set Consumer to autocommit// auto-commit, that is, Consumer automatically commit as soon as it reads the data. If you only discuss the process of reading messages, then Kafka ensures Exactly once.

However, in practical use, the application does not end after Consumer reads the data, but needs further processing, and the order of data processing and commit largely determines the delivery guarantee semantic (delivery guarantee semantics) of messages from broker and consumer.

/ / this step is the possibility and solution of message loss during the process from Consumer consumers to kafka broker.

Kafka guarantees by default that At least once// messages will never be lost, but may be transmitted repeatedly, and allows At most once to be implemented by setting Producer asynchronous commit. Exactly once requires collaboration with external storage systems, and fortunately the offset provided by Kafka can be used in this way very directly and easily.

The producer passes the message to the broker process

1. When Producer publishes a message to a Partition, it first finds the Leader of the Partition through ZooKeeper, and then no matter what the Replication Factor of the Topic is, Producer only sends the message to the Leader of the Partition.

2. Leader writes the message to its local Log.

3. Each Follower has data from Leader pull. In this way, the order of data stored by Follower is the same as that of Leader.

4. After receiving the message and writing its Log, Follower sends ACK to Leader.

5. Once Leader receives the ACK of all the Replica in the ISR, the message is considered to have been commit

6. Leader will add HW and send ACK to Producer.

Tip:

To improve performance, each Follower sends an ACK to the Leader as soon as it receives the data, rather than waiting for the data to be written to the Log.

Note:

Therefore, for messages that have been commit, Kafka can only guarantee that they are stored in the memory of multiple Replica, but not that they are persisted to disk, so there is no complete guarantee that the message will be consumed by Consumer after an exception occurs. Consumer read messages are also read from Leader, and only messages that have been commit are exposed to Consumer. / / there may be the possibility of losing data 3.

What is the synchronous mode and asynchronous mode of producer?

Synchronous mode

If Producer uses synchronous mode, Producer throws an Exception after attempting to resend the message.send.max.retries (default is 3), and the user can choose to stop sending subsequent data or choose to continue sending. The former will cause data blocking, while the latter will cause the loss of data that should have been sent to the Broker.

Asynchronous mode

If Producer uses asynchronous mode, Producer attempts to resend the message.send.max.retries (default is 3) and then logs the exception and continues to send subsequent data, which results in data loss and the user can only discover the problem through the log. At the same time, Kafka's Producer does not provide a callback interface for asynchronous mode.

Tip:

Kafka is a replication strategy between non-synchronous and non-asynchronous. Both synchronous and asynchronous modes are likely to have data loss.

Synchronization: it can only be finished after it is completed; async: the user does not know about the loss of data and can only log it.

The guarantee of Kafka's high reliability comes from its robust replica (replication) strategy.

Reference link: https://www.cnblogs.com/qingyunzong/p/9004703.html

Leader tracks a list of Replica that is synchronized with it, which is called ISR (that is, in-sync Replica).

The replication mechanism of Kafka is neither complete synchronous replication nor simple asynchronous replication.

Kafka's way of using ISR is a good balance to ensure that data is not lost and throughput. Follower can copy data from Leader in batches, which greatly improves replication performance (bulk writing to disk) and greatly reduces the gap between Follower and Leader. / / copy data in batches to prevent data loss 1

A message is considered submitted only if all Follower in the ISR are copied from the Leader. This prevents some of the data from being written into Leader and downtime before it is replicated by any Follower, resulting in data loss (Consumer cannot consume the data). / / prevent data loss 2

What information does zookeeper record in kafka?

1. Record broker information such as which kafka nodes there are.

2. Record the partitions partition information of topic and the leader of poartitions

3. Controller registration information

4. Controller_epoch information

[zk: 192.168.0.151 (CONNECTED) 39] get / kafkagroup/controller_epoch

1 / / this value is a number. The first broker in the kafka cluster starts at 1. Whenever the broker of the center controller central controller in the cluster changes or dies, the new center controller will be re-elected. Every time the center controller changes the controller_ center controller value, it will be + 1.

CZxid = 0x1500000049

Ctime = Sun Jan 27 16:33:22 CST 2019

MZxid = 0x1500000049

Mtime = Sun Jan 27 16:33:22 CST 2019

PZxid = 0x1500000049

Cversion = 0

DataVersion = 0

AclVersion = 0

EphemeralOwner = 0x0

DataLength = 1

NumChildren = 0

5. [zk: 192.168.0.151 (CONNECTED) 41] ls / kafkagroup/admin/delete_topics

6. Record consumer and consumer group information consumers consumers group

How to ensure that when kafka consumes topic, the data is completely ordered, that is, between different partition.

1. We know the order within the current paritition, but we can't compare the order from two different partition, which is meaningless. The data in partition is ordered, and the data between different partition loses the order of the data. If the topic has multiple partition, the order of the data cannot be guaranteed when consuming the data. In scenarios where the consumption order of messages needs to be strictly guaranteed, the number of partition needs to be set to 1.

2. For each data written into the kafka, they will be randomly written to a partition in the current topic. With one exception, you provide a key to the current data. At this time, you can use the current key to control which partition the current data should be transferred to.

By default, kafka allows consumers to be grouped (Consumer Group). For example, there are two consumer groups An and B, and the messages consumed by a topic:order_info,An and B are not duplicated.

For example, there are 100 messages in order_info, and each message has an id with a number from 0 to 99. If group A consumes numbers 0-49, group B consumes numbers 50-99.

/ / can multiple consumer consume data from the same topic in a production environment? Need to set up and adjust

Implementation method 1: code snippets can be implemented

When using Consumer high level API, a message of the same Topic can only be consumed by one Consumer within the same Consumer Group, but multiple Consumer Group can consume this message at the same time.

Method 2: we can divide An and B consumers into two different groups, and different consumption groups can consume the same topic data.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.