Kafka can't be consumed?! My distributed messaging service Kafka is as stable as Mount Tai! 04/11 Update SLTechnology News&Howtos

Kafka can't be consumed?! My distributed messaging service Kafka is as stable as Mount Tai!

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

On a dark and windy night, I suddenly received an alarm about the backlog of Kafka messages in the production environment of the existing network. I woke up in a dream and immediately got up to check the log.

Problem phenomenon: consumption request is stuck in finding Coordinator

What is Coordinator? Coordinator is used to manage individual members of the Consumer Group and is responsible for consuming offset displacement management and Consumer Rebalance. When consuming, Consumer must first confirm the Coordinator corresponding to Consumer Group, and then join Group and obtain the corresponding topic partition for consumption.

So how to determine the Coordinator of Consumer Group? Take two steps:

1. A Consumer Group corresponds to a partition of _ _ consumers_offsets. First, calculate the partition of _ _ consumers_offsets corresponding to Consumer Group. The formula is as follows:

_ _ consumers_offsets partition# = Math.abs (groupId.hashCode ()% groupMetadataTopicPartitionCount, where groupMetadataTopicPartitionCount is specified by offsets.topic.num.partitions.

The leader of the partition calculated in 2 and 1 is the broker in which the Coordinator is selected.

Positioning process

The Coordinator node has been found, and now see if there is a problem with Coordinator:

As expected, the corresponding partition Leader of Coordinator is-1, and the consumer program will wait until the Leader is selected, which directly leads to the consumption card.

Why can't Leader elect? Controller is responsible for the Leader election. The Controller node is responsible for managing the status of partitions and replicas in the entire cluster, such as partition Leader election, topic creation, replica allocation, partition and replica expansion, etc. Now let's take a look at Controller's log:

1. June 10 15 4815 30006 seconds Broker 1 becomes controller

At this point, the perceived nodes are 1 and 2, and node 3 cannot be read in zk:

At 31.847 seconds, the Leader of partition 3 of _ _ consumer_offsets is selected as 1, the ISR is [1Power2], and the leader_epoch is 14:

It took another second to perceive that the Controller had changed and cleared itself.

2. Broker 2 also became Controller after a few hundred milliseconds.

But Broker2 senses that the Broker 3 node is alive, and the log is as follows:

Note that at this point in time, Broker1 has not changed the Leader of partition 3 of _ _ consumer_offsets from node 3 to 1 at zk, so Broker 2 also thinks that Broker 3 is Leader, and Broker 3 is alive in it, so there is no need to re-elect Leader. This has been going on for quite a long time, and even though Broker 1 has switched the Leader for this partition, it is not aware of it.

3. Broker 2 sensed the Broker 1 network outage again at 21:43:19 on the 12th, and handled the node failure event:

Because the Leader of _ _ consumer_offsets partition 3 is considered to be broker 3 in Broker 2 memory, the Leader switch of partition 3 will not be triggered.

Broker 2, however, when processing the failed node Broker 1, the copy will be removed from the ISR list, and the zk will be read once before removal, as follows:

However, it is found that the Leader of partition 3 in zk has become 1, and the ISR list is [1dag2]. When the node 1 to be removed is Leader, Leader will become-1, and ISR only [2]. You can also see it from the log:

In this way, the Leader of partition 3 remains-1 until a new event triggers node 2 re-election (such as restarting a node).

Root cause summary

After the occurrence of a network exception, due to the different perceived available nodes between the new controller and the old Leader, the memory information of the Leader of a partition by the new controller is inconsistent with that of the zk record metadata, resulting in an error in the controller election process and no Leader can be selected. A new election event is required to trigger a Leader election, such as a restart.

Problem summary

This is a typical brain fissure caused by abnormal network, and then there are multiple Controller. The distributed message service Kafka (https://www.huaweicloud.com/product/dmskafka.html) of Chrysanthemum Factory has been verified by telecom-level reliability and has solved these problems perfectly.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.