How to troubleshoot Kafka restart failure 07/06 Update SLTechnology News&Howtos

How to troubleshoot Kafka restart failure

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how to troubleshoot the failure of Kafka restart. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

Background

After receiving the feedback from the user, it is found that partition 34 of the topic of log kafka cluster A cannot elect leader. When some messages are sent to this partition, the following no leader error message will be reported:

In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.

Next, the OPS could not find the broker0 node in kafka-manager, but the process was still there. There was no response to the restart for a long time, and then the restart failed after killing the node process with the kill-9 command, resulting in the following problems:

Because the leader copy of topic 34 partition is in broker0, and the other copy has been kicked out of the ISR,0.11 version of kafka because it cannot keep up with the speed of leader, the default unclean.leader.election.enable parameter of kafka is false, which means that the partition cannot elect leader in a copy other than ISR, resulting in an error that topic A sends messages that the partition leader does not exist, and messages that have not been consumed by the partition can no longer be consumed.

Kafka log analysis

After viewing the KafkaServer.log log, it is found that a large number of logs were generated during the restart of Kafka:

Find a large number of topic index files corrupted and rebuild the index file warning message, navigate to the source code:

Kafka.log.OffsetIndex#sanityCheck

According to my own understanding, describe it:

When Kafka starts, it checks whether kafka is cleanshutdown and determines whether there is a .kafka _ cleanshutDown file in the ${log.dirs} directory. If it does not exit normally, there is no such file, and then recover log processing is required. During the processing, the sanityCheck () method is called to verify the index file of each log sement to ensure the integrity of the index file:

Entries: because the kafka index file is a sparse index, the location of each message is not saved to the .index file, so the entry pattern is introduced, that is, only one location per batch of messages is recorded, so the entries of the index file = mmap.position / entrySize

LastOffset: the displacement of the last entry, that is, lastOffset = lastEntry.offset

BaseOffset: refers to the base offset of the index file, that is, the number of the index file name.

The corresponding relationship between the index file and the log file is as follows:

The basis for determining whether the index file is corrupted is:

_ entries = = 0 | | _ lastOffset > baseOffset = false / / damaged _ entries = = 0 | | _ lastOffset > baseOffset = true / / normal

My understanding of this judgment logic is:

Entries index block equal to 00:00, means that the index has no content, at this time can be considered that the index file is not damaged; when the entries index block is not equal to 0, you need to determine whether the final offset of the index file is greater than the base offset of the index file, if not, it means that the index file is corrupted and needs to be rebuilt.

Then why did this happen?

I seem to have found some answers in the relevant issue:

Https://issues.apache.org/jira/browse/KAFKA-1112

Https://issues.apache.org/jira/browse/KAFKA-1554

Overall, the problem of abnormal exit seems likely to occur in the old version?

What's interesting is that this problem is not caused by this problem, because this problem has been fixed in subsequent versions. As can be seen from the log, it will delete and rebuild the corrupted log files. Let's move on to the error message that causes the restart to fail:

The problem lies here. The above problem may occur in the process of deleting and rebuilding the index. There are many descriptions of this bug on the issues.apache.org website, and I will post two here:

Https://issues.apache.org/jira/browse/KAFKA-4972

Https://issues.apache.org/jira/browse/KAFKA-3955

These bug are very obscure and very difficult to reproduce. Since this problem does not exist in subsequent versions, it is urgent to upgrade the Kafka version. Later, after I am familiar with scala, I will continue to study the source code. The details must be presented in the source code.

Analysis of solution ideas

In view of the two problems in the background, the contradiction is caused by the failure of broker0 restart, so we can restore the partition of topic A 34 only if we start broker0 successfully.

Because the log and index files can't start all the time, we just need to delete the corrupted log and index files and restart them.

However, if the log index file of partition 34 is also corrupted, the unconsumed data under that partition will be lost for the following reasons:

At this time, the leader of partition 34 is still in broker0. Due to the hang of broker0 and only leader of partition 34, partition 34 is unavailable. In this case, if you empty the data of leader in broker0, and after restart, Kafka will still use the copy on broker0 as leader, then you need to take the offset of leader as standard. When the data of leader is cleared, you can only truncate the data of follower to 0, which is not greater than the offset of leader.

This seems unreasonable. Is it possible to provide an operation at this time:

When the partition is not available, the user can manually set any copy within the partition as leader?

Later, I will analyze this problem in a separate article.

Optimization of subsequent clusters

Develop an upgrade plan to upgrade the cluster to version 2.2

The server of each node put the default timeout value of systemd at 600s, because I found that the operation and maintenance staff did not respond for a long time when the 33 nodes were shut down on the day of the failure, so I used the kill-9 command to force the shutdown. But as far as I know, when shutting down a Kafka server, Kafka needs to do a lot of work, and this process may exist for quite some time, while the default timeout value of systemd is 90 seconds to stop the process, which is equivalent to an abnormal exit.

Set the broker parameter unclean.leader.election.enable to true (ensure that the partition can elect leader from a non-ISR)

Set the broker parameter default.replication.factor to 3 (increase the high availability, but increase the storage pressure on the cluster, which can be discussed later)

Set the broker parameter min.insync.replicas to 2 (this ensures that there are two ISR at the same time, but is it necessary to do so due to performance loss? Because we have set unclean.leader.election.enable to true)

The sender sends acks=1 (make sure that a copy is synchronized successfully at the time of sending, but whether this is necessary, because it may cause performance loss).

The above is the troubleshooting of Kafka restart failure shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.