How to solve the BUG of kafka itself caused by network failure 07/19 Update SLTechnology News&Howtos

How to solve the BUG of kafka itself caused by network failure

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

How to solve the BUG of kafka itself caused by network failure? I believe many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

2019-09-03 17:06:25

The network of the computer room fluctuated for one minute, and the switch problem caused the kafka clusters to lose contact with each other occasionally.

The kafka log is as follows:

WARN Attempting to send response via channel for which there is no open connection, connection id xxxxx (kafka.network.Processor) [2019-09-03 17kafka.network.Processor] INFO Unable to read additional data from server sessionid 0x46b0xxxx027, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2019-09-03 17purl 06kafka.network.Processor 32076] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2019-09-0317purl 32609] INFO Opening socket connection to server xxxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2019-09-03 17 org.apache.zookeeper.ClientCnxn 06WARN Client session timed out] WARN Client session timed out, have not heard from server in 1796ms for sessionid 0x46bxxxx40027 (org.apache.zookeeper.ClientCnxn) [2019-09-03 17org.apache.zookeeper.ClientCnxn 06WARN Client session timed out 33810] INFO Client session timed out, have not heard from server in 1796ms for sessionid 0x46b03bxxx027, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2019-09-03 17org.apache.zookeeper.ClientCnxn 06WARN Client session timed out 34942] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2019-09-03 17 INFO [Partition opStaffCancelPost-18 broker=180] Shrinking ISR from 180182183 to 36059] WARN Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx027 (org.apache.zookeeper.ClientCnxn) [2019-09-03 17 Partition opStaffCancelPost-18 broker=180 180182183 to 36059] INFO Client session timed out, have not heard from server in 2092ms for sessionid 0x46b0xxxx0027 Closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2019-09-03 17 INFO Waiting for keeper state SyncConnected INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient) [2019-09-03 17 purl 06VR 37305] INFO Opening socket connection to server xxxx. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2019-09-03 17 WARN Client session timed out, have not heard from server in 2135ms for sessionid 0x46bxxxx0027 (org.apache.zookeeper.ClientCnxn)

The brief fluctuation of the network lasted about 1 minute, and then the network resumed.

It should also be automatically recoverable for high-availability kafka clusters, but it backfired.

Then there is an exception in the monitoring of kafka-manager, a large number of topic have no broker, and the offset of the message is no longer changed.

Using kafka client connections manually also seems strange, with no messages coming in for consumption, but application production and consumption are normal.

You can see the following in kafka's log:

[2019-09-03 17 kafka.cluster.Partition] INFO [Partition opStaffCancelPost-18 broker=180] Cached zkVersion [50] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) [2019-09-03 17 Cached zkVersion 11V 37572] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Shrinking ISR from 180182183 to 180 (kafka.cluster.Partition) [2019-09-03 17 Cached zkVersion 11V 37574] INFO [Partition sendThrowThirdBoxRetry-1 broker=180] Cached zkVersion [48] not equal to that in zookeeper Skip updating ISR (kafka.cluster.Partition) [2019-09-03 17 consumer_offsets-42 broker=180 11VV 37574] INFO [Partition _ consumer_offsets-42 broker=180] Shrinking ISR from 180181182 to 180181 (kafka.cluster.Partition) [2019-09-03 17V 11V 11V 37576] INFO [Partition _ consumer_offsets-42 broker=180] Cached zkVersion [45] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

It was found that the monitoring was still like this the next day, and something was wrong (at this time, it is normal for the application to use kafka for production and consumption, but various monitoring data say it is abnormal), and the kafka cluster was still in a problematic state. We manually restarted the node after 11:00, and at this time the failure occurred again. The whole set of kafka cluster connection failed to produce messages. The kafka log is as follows:

[2019-09-04 11 Shrinking ISR from 59 to 12862] INFO [Partition routeStaffPostQueue-15 broker=180] Shrinking ISR from 180182183 to 180183 (kafka.cluster.Partition) [2019-09-04 11 Fringe 59 Shrinking ISR from 12864] INFO [Partition routeStaffPostQueue-15 broker=180] Cached zkVersion [43] not equal to that in zookeeper Skip updating ISR (kafka.cluster.Partition) [2019-09-04 11 INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Shrinking ISR from 180182183 to 180183 (kafka.cluster.Partition) [2019-09-04 11 INFO [Partition sfPushFvpRetryMsgProcQueue-5 broker=180] Cached zkVersion [41] not equal to that in zookeeper Skip updating ISR (kafka.cluster.Partition) [2019-09-04 11 INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Shrinking ISR from 180182183 to 12866 (kafka.cluster.Partition) [2019-09-04 11 INFO [Partition openRouteRetryMsgProcQueue-3 broker=180] Cached zkVersion [44] not equal to that in zookeeper Skip updating ISR (kafka.cluster.Partition) [2019-09-04 11 Shrinking ISR from 59 not equal to that in zookeeper] INFO [Partition routeStaffCancelQueue-5 broker=180] Shrinking ISR from 180182183 to 180183 (kafka.cluster.Partition) [2019-09-04 11 Visa 59 Swiss 12870] INFO [Partition routeStaffCancelQueue-5 broker=180] Cached zkVersion [43] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)

Reason:

When there is a network problem, when the session between kafka's controller and zk expires and loses control, the zombie controller continues to update zk and send LeaderAndIsrRequests to broker in a short period of time. When this happens, the other broker has not updated the leader information and isr, causing the update to fail on the zk when the update is needed later.

Kafka has officially identified the BUG and fixed the problem in KAFKA-5642 by properly handling zk session expiration events.

The kakfa we use is 1.0.0, and the official fix version is 1.1.0, so if you are still in version 1.1.0 of kafka, be sure to pay attention to this problem. You can adjust the timeout for connecting to zk to extend the timeout for a few more seconds, or upgrade the kafka version.

Zookeeper.connection.timeout.ms=10000zookeeper.session.timeout.ms=10000 after reading the above, do you know how to solve the BUG of kafka itself caused by network failure? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.