Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Kafka troubleshooting-consumer handles exceptions caused by timeout

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Recently, we have encountered a problem with kafka, which is roughly due to the timeout of consumer processing business, resulting in the inability to submit Offset normally, which in turn leads to the problem that new messages cannot be consumed. Below, I would like to review and analyze this troubleshooting from the following aspects: business background, problem description, troubleshooting ideas, lessons learned.

First, business background

Let's start with a brief description of the business background. We have a business that needs to consume Topic messages in strict order, so we set a unique partition and a unique copy for this topic. When multiple consumer of the same consumption group are started, only one consumer will subscribe to the Topic for consumption to ensure the consumption order within the same consumption group.

Note: the groupId name of the consumer group is "smart-building-consumer-group" and the Topic name of the subscription is "gate_contact_modify".

Second, problem description

One day we suddenly received a problem feedback: after the business on the producer side generated a message, the consumer side did not get the expected results. After investigation, we ruled out the possibility of a problem with the business logic, and we decided that the most likely reason was that the kafka message was not consumed. To confirm this guess, we looked at the consumer consumption log and found that there are several problems in the log:

(1) occasionally, a warning log of Kafka is printed in the log, which reads:

Org.springframework.kafka.KafkaListenerEndpointContainer#2-0-C-1 org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.maybeAutoCommitOffsetsSync:648-Auto-commit of offsets {gate_contact_modify-0=OffsetAndMetadata {offset=2801, metadata=''}} failed for group smart-building-consumer-group: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll () was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll () with max.poll.records.

(2) followed by a rebalance.

(3) the consumer side outputs the business log of the Topic consumer, indicating that the Topic message has been obtained normally.

Then we look at the changes in the Offset corresponding to the Topic in the kafka consumer group and find that the Offset has not changed.

Third, the train of thought of investigation

The exception information in the log clearly tells us that after the consumption of topic messages, a rebalance occurred in group, resulting in Commit not being submitted, which indicates that the interval between two poll messages exceeds the maximum interval defined by max.poll.interval.ms, which also means that a poll post-processing message timed out, which is precisely because the poll interval timed out, resulting in a rebalance. At the same time, it is suggested that we either increase the interval time or reduce the maximum number of messages pulled each time.

In addition, because the Commit is not submitted, the Offset value does not change, so the messages pulled each time are the same batch of duplicate messages. The specific exception flow is shown below:

Based on the above information, we further check the max.poll.records configuration and max.poll.interval.ms configuration of consumer, and count the processing time of each Topic message, and find that max.poll.records uses the default configuration value of 500 max.poll. Interval. Ms uses the default configuration value of 300s, while the processing time of each Topic message takes 10s. This further confirms our inference:

Because there are too many messages pulled each time, and the processing time of each message is long, the processing time of each message exceeds the pull interval, which makes group perform a rebalance, resulting in commit failure, and finally leads to the next time to pull repeated messages, continue processing timeout, and enter a dead loop state.

After knowing the root cause of the problem, we changed the max.poll.records=1 according to the business characteristics, pulled only one message at a time for processing, and finally solved the problem.

IV. Experiences and lessons

This troubleshooting enables us to have a deeper understanding of the poll mechanism of Kafka messages and the interaction between rebalance and commit.

(1) kafka can specify the number of messages in batches per poll to improve consumption efficiency, but the size of the batch should be weighed against the poll interval timeout and the processing time of each message.

(2) once the interval between two poll exceeds the threshold, group will think that there may be a point of failure in the current consumer and trigger a rebalance to reassign the partition of the Topic.

(3) if a rebalance is performed before commit, the commit will fail this time, and the next time poll will pull the old data (repeated consumption), so it is necessary to ensure the idempotency of good news processing.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report