What if the kafka message is lost? 07/09 Update SLTechnology News&Howtos

What if the kafka message is lost?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article focuses on "what to do when kafka messages are lost". Friends who are interested might as well take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what if the kafka message is lost"!

In the process of dealing with the production environment, it was found that a small amount of data was lost in a kafka cluster on the 12th, with a probability of about 3/10000000. After the data is written into kafka, it completely disappears, and consumers do not consume this data at all. By finding the data of that day and viewing the problematic data when writing to kafka, the context application log found a small number of the following errors:

[2019-10-12 111VO3VO3VO3JING xxx] This is not the correct coordinator.

In theory, under normal circumstances, kafka is unlikely to lose data. If this happens, it must be the developer or the hardware causing some problems, because there are log writes. Take a look at the application configuration.

Acks=1

I immediately realized that the breakthrough of the problem should be here.

If the acks=0 producer is able to send the message over the network, it is assumed that the message has been successfully written to Kafka and that some data must be lost. Will acks=1 master return an acknowledgement or error response when channeling the message and writing it to the partition data query, or will it lose data? acks=all master will wait for all synchronous copies to receive the message before returning confirmation or error response.

Perhaps in the past, in order to ensure that sex can be fast, the eclectic application configuration acks=1 was chosen.

Immediately thought to take a look at the kafka log, guess that this period of time there must be a situation where master is not available will lead to data loss.

See this log on the machine on node 57 of the kafka cluster:

[2019-10-12 11 WARN Client session timed out, have not heard from server in 4034ms for sessionid 0x396aaaadbbxx00 (org.apache.zookeeper.ClientCnxn) [2019-10-12 11 V 03V 40] INFO Client session timed out, have not heard from server in 4034ms for sessionid 0x396aaaabbxx00 Closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2019-10-12 11 (INFO zookeeper state changed) 41253] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2019-10-12 11) (0315)] INFO Opening socket connection to server xx.xx.xx.59/xx.xx.xx.59:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2019-10-12 11 org.apache.zookeeper.ClientCnxn 03 org.apache.zookeeper.ClientCnxn 41962] INFO Socket connection established to xx.xx.xx.59/10.xx.xx.xx:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2019-10-12 11 11 purl 03 org.apache.zookeeper.ClientCnxn 42293] WARN Unable to reconnect to ZooKeeper service Session 0x396cf664cdbb0000 has expired (org.apache.zookeeper.ClientCnxn) [2019-10-12 11 INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2019-10-12 11 INFO zookeeper state changed (42293)] INFO Initiating client connection, connectString=xx.xx.xx.55:2181,xx.xx.xx.56:2181,xx.xx.xx.57:2181,xx.xx.xx.58:2181 Xx.xx.xx.59:2181 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@342xxx2d (org.apache.zookeeper.ZooKeeper) [2019-10-12 11 INFO Unable to reconnect to ZooKeeper service INFO Unable to reconnect to ZooKeeper service, session 0x396cxxxxxb0000 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2019-10-12 11 INFO Unable to reconnect to ZooKeeper service 03VR 42323] INFO EventThread shut down for session: 0x396cxxxx000 (org.apache.zookeeper.ClientCnxn) [2019-10-12 11 INFO Unable to reconnect to ZooKeeper service 03lav 42342] INFO Opening socket connection to server xx.xx.xx.58/xx.xx.xx.58:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2019-10-12 11 org.apache.zookeeper.ClientCnxn 03 org.apache.zookeeper.ClientCnxn 42343] INFO Socket connection established to xx.xx.xx.58/xx.xx.xx.58:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2019-10-12 11 11 org.apache.zookeeper.ClientCnxn 0315 org.apache.zookeeper.ClientCnxn 43516] INFO Session establishment complete on server xx.xx.xx.58/xx.xx.xx.58:2181, sessionid = 0x3ax4xxxxx01 Negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2019-10-12 11 INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)

It seems to be the same as guessed that the kafka node machine is unable to connect to the zk, so the machine is in a "lost" state. I also took a look at kafka's controller log:

[2019-10-12 11 INFO [Controller id=56] Newly added brokers:, deleted brokers: 57, all live brokers:... (kafka.controller.KafkaController) [2019-10-12 11 INFO [Controller-56-to-broker-57-send-thread]: Shutting down (kafka.controller.RequestSendThread) [2019-10-12 11 Controller-56-to-broker-57-send-thread] INFO [Controller-56-to-broker-57-send-thread]: Stopped (kafka.controller.RequestSendThread) [2019-10-12 11 Controller-56-to-broker-57-send-thread] INFO [Controller-56-to-broker-57-send-thread]: Shutdown completed (kafka.controller.RequestSendThread) INFO [Controller id=56] Broker failure callback for 57 (kafka.controller.KafkaController) [2019-10-12 11 INFO [Controller id=56] Removed ArrayBuffer () from list of shutting down brokers. (kafka.controller.KafkaController)

Let's take a look at the network level monitoring of this node system again:

The truth emerges and coincides with speculation. This problem is somewhat similar to a problem encountered before, except that only one node was lost this time. The last problem was that all network nodes were briefly lost. For more information, please see my previous blog: https://my.oschina.net/110NotFound/blog/3105190

Analysis of the reasons this time:

After node 57 was lost, the election was held immediately. At this time, the data arrived at master during production, and the master was changed at the time of election, while acks=1, the data has already entered master, but has not yet been synchronized to slave, which leads to the loss of data.

Summary:

If there is a requirement for strong data consistency, be sure to choose acks=all, otherwise the slight system jitter of the hardware will cause the kafka cluster to be re-elected and lose data one day.

At this point, I believe you have a deeper understanding of "what to do with the loss of kafka messages". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.