Optimization Analysis of data loss in Kafka 07/19 Update SLTechnology News&Howtos

Optimization Analysis of data loss in Kafka

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "Kafka lost data optimization analysis", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let Xiaobian take you to learn "Kafka lost data problem optimization analysis"!

Data loss is a very serious thing, for the problem of data loss we need to have a clear idea to determine the problem, for this period of time summary, I personally face kafka data loss problem solution ideas are as follows:

1. Whether there is a real data loss problem, for example, there may be many times when other colleagues operate the test environment, so first make sure that there is no third-party interference in the data.

2. Clarify your business process, data flow, where is the data lost, the link before kafka or the process after kafka? For example, kafka's data is provided by flume, perhaps flume lost data, kafka naturally does not have this part of the data.

3. How to find out that there is data loss and how to verify it. From the business point of view, for example, in the education industry, the amount of data after the college entrance examination is huge every year, but it is abnormally less than that before the college entrance examination, or the amount of data at the source end does not match the amount of data at the destination end.

4. Whether the positioning data has been lost before kafka or not

Kafka supports data playback (change consumption group), clear all data at the destination, and re-consume.

If the data is lost at the consumer end, the probability of multiple consumption results being exactly the same is very low.

If data is lost on the write side, the result should be exactly the same every time (provided there is no problem on the write side).

5. The kafka link loses data. Common reasons for losing data in the kafka link include:

If auto.commit.enable=true, when the consumer fetches some data but has not completely processed it, just before the commit interval starts the commit offset operation, and then the consumer crashes. At this time, the data that has been fetched has not been processed but has been committed, so there is no chance to be processed again, and the data is lost.

There is no automatic retry for resending messages when the network load is high or the disk is busy. There is no speed limit processing, exceeding the network bandwidth limit. Kafka must configure the mechanism of message retry, and the retry interval must be longer. The default time of 1 second does not conform to the production environment (the network interruption time may exceed 1 second).

If the disk is broken, it will lose the data that has been dropped.

If the length of a single batch exceeds the limit, data will be lost, and kafka.common.MessageSizeTooLargeException exception will be reported.

Consumer side:fetch.message.max.bytes- this will determine the largest size of a message that can be fetched by the consumer.

Broker side:replica.fetch.max.bytes- this will allow for the replicas in the brokers to send messages within the cluster and make sure the messages are replicated correctly. If this is too small, then the message will never be replicated, and therefore, the consumer will never see the message because the message will never be committed (fully replicated).

Broker side:message.max.bytes- this is the largest size of the message that can be received by the broker from a producer.

Broker side (per topic):max.message.bytes- this is the largest size of the message the broker will allow to be appended to the topic. This size is validated pre-compression. (Defaults to broker'smessage.max.bytes.)

6. The partition leader crashes when the backup of the number of copies follows is not completed. Even if a new leader is elected, the pushed data is lost because it is not backed up! Kafka is multicopy when you configure synchronous replication. The data of multiple copies are all in PageCache, so the probability of multiple copies failing at the same time is very small compared to the probability of one copy failing. (The official recommendation is to ensure the integrity of the data through copies)

7, kafka data is stored in PageCache at the beginning, regularly flush to disk, that is, not every message is stored in disk, if there is a power failure or machine failure, the data on PageCache is lost. Flush interval can be configured by log.flush.interval.messages and log. flush. interval. ms. More data is lost in interval, which will affect performance. However, in version 0.8, replication mechanism can be used to ensure that data is not lost, at the cost of requiring more resources, especially disk resources. Kafka currently supports GZip and Snappy compression to alleviate this problem. Whether to use replica depends on the balance between reliability and resource cost.

Kafka also provides configuration parameters that allow you to balance performance against reliability (default):

When the following number of messages is reached, the data is flushed into the log file. Default 10000

log.flush.interval.messages=10000

Perform a forced flush operation when the following time (ms) is reached. Interval.ms and interval.messages flush whenever they are reached. Default 3000ms

log.flush.interval.ms=1000

Check if you need to flush logs at intervals

log.flush.scheduler.interval.ms = 3000 Kafka optimization suggestions

producer end

Design to ensure the reliability and security of data, according to the number of partitions to do a good job of data backup, set up the number of copies, etc. Push data mode: synchronous asynchronous push data: weighing the requirements of security and speed, choose the corresponding synchronous push or asynchronous push mode, when there is a problem with the data, you can change to synchronous to find the problem.

Flush is the internal mechanism of kafka,kafka priority in memory to complete the exchange of data, and then persist the data to disk.kafka first cache the data (cache to memory) up and then batch flush. Flush intervals can be configured via log.flush.interval.messages and log. flush. interval. ms.

The replication mechanism can be used to ensure that data is not lost. The price is more resources, especially disk resources, and kafka currently supports GZip and Snappy compression to alleviate this problem. Whether to use replica depends on the balance between reliability and resource cost

The broker to Consumer kafka consumer provides two interfaces.

The high-level version has encapsulated the management of partition and offset. By default, it will automatically commit offset periodically, which may cause data loss.

The low-level version manages its own correspondence between spouts and partitions and consumed offsets on each partition (written periodically to zk)

And only when this offset is ack, that is, after successful processing, will it be updated to zk, so basically it can ensure that data is not lost. Even if the spout thread crashes, the corresponding offset can still be read from zk after restarting.

Asynchronous should consider the situation that the partition leader will go down before completing the backup of the number of copies of followers. Even if a new leader is elected, the pushed data will be lost because it has not been backed up!

Do not let the memory buffer pool is too full, if full memory overflow, that is to say, data write too fast, kafka buffer pool data drop speed is too slow, then it will definitely cause data loss.

Try to ensure that the producer-side data is always in a thread-blocking state, so that the disk is dropped while writing memory.

For asynchronous writing, you can also set the batch number similar to flume rollback type, that is, set the batch size according to the accumulated message number, accumulated time interval and accumulated data size.

Set the appropriate way to increase the batch size to reduce network IO and disk IO requests, which is kafka efficiency considerations.

However, asynchronous write loss is still difficult to control

It is still necessary to stabilize the operation of the overall cluster architecture, especially the zookeeper. Of course, in the case of asynchronous data loss, try to ensure the stable operation of the broker side.

Unlike hadoop, kafka's message queues are better at handling small data. For specific services, if a large amount of data is continuously pushed (eg: web crawler), message compression can be considered. However, this also puts pressure on the CPU to a certain extent, and it is still necessary to combine the test selection with the business data.

broker end

topic Set multi-partition, partition adaptive to the machine, in order to make each partition evenly distributed in the broker, the number of partitions should be greater than the number of brokers. Partitions are the unit of kafka for parallel reads and writes, and are the key to increasing kafka speed.

The maximum number of bytes a broker can receive must be smaller than the maximum number of bytes a consumer can consume, otherwise the broker hangs because the consumer cannot use the message.

The maximum number of bytes a broker can assign to a message must be greater than the maximum number of bytes it can accept, otherwise the broker will be unable to replicate due to data volume problems, resulting in data loss.

summer end

Turn off automatic offset updating and wait until the data is processed before manually updating the offset.

Before consumption, verify whether the data taken before consumption is the data of the previous consumption. If it is incorrect, return will handle the error first.

In general, zookeeper has no problem as long as the offset recorded is stable, unless multiple consumer groups consume data from a partition at the same time, one of which is submitted first and the other is lost.

problem

Kafka's data is stored in PageCache at the beginning and flushed to disk regularly, that is, not every message is stored on disk. If there is a power failure or a machine failure, the data on PageCache will be lost. This is the summary of the data loss so far.

//producer The compression type used to compress the data. The default is no compression. The correct option values are none, gzip, snappy. Compression is best used for batch processing, the more messages you batch, the better compression performance

props.put("compression.type", "gzip");

//add delay

props.put("linger.ms", "50");

//This means that the leader needs to wait for all backups to be successfully written to the log, a policy that ensures that as long as one backup survives, no data will be lost. This is the strongest guarantee.

props.put("acks", "all");

//retry indefinitely until you realize there is a problem, setting a value greater than 0 will cause the client to resend any data once it fails to be sent. Note that these retries are no different than retries when the client receives a send error. Allow retries to potentially change the order of data. If both message records are sent to the same partition, the first message fails and the second message succeeds, and the second message appears earlier than the first.

props.put("retries ", MAX_VALUE);

props.put("reconnect.backoff.ms ", 20000);

props.put("retry.backoff.ms", 20000);

//Turn off unleader election, i.e. do not allow copies in non-ISRs to be elected as leaders to avoid data loss

props.put("unclean.leader.election.enable", false);

//turn off automatic submission offset

props.put("enable.auto.commit", false);

Limit the number of unanswered requests a client can send on a single connection. Setting this value to 1 means that a client cannot send a request to the same broker until the kafka broker responds to the request. Note: This parameter is set to avoid message disorder

props.put("max.in.flight.requests.per.connection", 1); Kafka causes repeated consumption

Forcing the kill thread causes the consumed data to be disconnected if the offset is not submitted. For example, consumption data is usually encountered, which takes a lot of time to process, resulting in exceeding Kafka's session timeout time (the default is 30 seconds in version 0.10.x), so re-blance rebalancing will occur. At this time, there is a certain probability that offset is not submitted, which will lead to repeated consumption after rebalancing.

If consumer.unsubscribe() is called before close, it is possible that some offsets are not submitted, and the next restart will repeat the consumption.

Kafka is designed to be (at-least once) at least once logic, which determines that data may be duplicated, kafka uses a time-based SLA(service level guarantee), messages will be deleted after a certain period of time (usually 7 days).

Kafka's duplicate data should generally be on the consumer side, in which case log.cleanup.policy = delete uses a periodic deletion mechanism.

At this point, I believe that everyone has a deeper understanding of "Kafka data loss optimization analysis", may wish to actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.