How to solve the sudden downtime of Kafka 04/10 Update SLTechnology News&Howtos

How to solve the sudden downtime of Kafka

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to solve the sudden downtime of Kafka". In the daily operation, I believe many people have doubts about how to solve the problem of sudden downtime of Kafka. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubt of "how to solve the sudden downtime of Kafka"! Next, please follow the editor to study!

High availability problems caused by Kafka downtime

Since the deployment from Kafka, the Kafka used internally in the system has been running stably and has not been unavailable.

But recently, system testers often reported that occasionally Kafka consumers could not receive messages, and the login management interface found that one of the three nodes was down.

But according to the concept of high availability, how can three nodes and two nodes be available so that consumers in the entire cluster cannot receive the message?

To solve this problem, start with the high availability implementation of Kafka.

Multi-copy redundancy Design of Kafka

Whether it is a traditional system based on relational database design, or distributed systems such as Zookeeper, Redis, Kafka, HDFS and so on, the way to achieve high availability is usually to adopt redundant design to solve the problem of node downtime and unavailability.

First, take a brief look at a few concepts of Kafka:

Physical model, as shown in the following figure:

Logical model, as shown in the following figure:

Broker (node): Kafka service node. To put it simply, a Broker is a Kafka server, a physical node.

Topic (topic): in Kafka, messages are classified by topic. Each topic has a Topic Name. Producers send messages to a specific Topic according to Topic Name, while consumers also consume from corresponding Topic according to Topic Name.

Partition (partition): Topic (topic) is a unit of message classification, but each topic can be subdivided into one or more Partition (partitions), and a partition can only belong to one topic.

Topics and partitions are logical concepts. For example, messages 1 and 2 are sent to topic 1, and they may enter the same partition or different partitions (so different partitions under the same topic contain different messages), which are then sent to the corresponding Broker node of the partition.

Offset (offset): a partition can be regarded as a queue that can only enter and exit (Kafka only ensures that messages in a partition are ordered). Messages will be appended to the end of the queue. After each message enters the partition, there will be an offset to identify the location of the message in the partition. Consumers want to consume the message by offset to identify it.

In fact, based on the above concepts, is it possible to guess that the multi-copy redundancy design of Kafka has been implemented? Don't worry, let's move on.

Prior to Kafka version 0.8, there was no multi-copy redundancy mechanism, and once a node died, all Partition data on that node could no longer be consumed. This means that some of the data sent to Topic has been lost.

The introduction of copy reporters after version 0.8 is a good solution to the problem of data loss after downtime. Replicas are based on the data of each Partition in Topic, and the data of each Partition is synchronized to other physical nodes to form multiple copies.

Each copy of Partition includes a copy of Leader and multiple copies of Follower, the Leader is elected by all copies, and the other copies are Follower copies.

When the producer writes or the consumer reads, they will only deal with Leader. After writing the data, Follower will pull the data for data synchronization.

It's that simple? Yes, the high availability of Kafka is achieved based on the multi-copy architecture diagram above.

When a Broker dies, don't worry, the Partition on this Broker has a copy on other Broker nodes.

What do you think if it's Leader who hung up? Then you can elect a Leader in Follower, and producers and consumers can have fun with the new Leader again. This is High availability.

You may still be wondering, how many copies will it take? what if there is no full synchronization between Follower and Leader? What are the election rules for Leader after a node goes down?

Just come to the conclusion: how many copies are enough? Certainly, the more copies, the higher the availability of Kafka, but the more copies, the more network and disk resources are consumed, and the performance will be degraded.

Generally speaking, the number of copies is 3 to ensure high availability, and in extreme cases, you can increase the replication-factor parameter.

What if there is no full synchronization between Follower and Leader? Follower and Leader are not fully synchronous, but they are not completely asynchronous, but adopt an ISR mechanism (In-Sync Replica).

Each Leader dynamically maintains a list of ISR, which stores Follower that is basically synchronized with Leader.

If a Follower does not initiate a data pull request to Leader due to network, GC and other reasons, and the Follower is out of sync with the Leader, it will be kicked out of the ISR list.

So, the Follower in the ISR list is a copy of the Leader.

What are the election rules for Leader after a node goes down? There are many distributed election rules, such as Zookeeper's Zab, Raft, Viewstamped Replication, Microsoft's PacificA and so on.

The Leader election idea of Kafka is very simple. Based on the ISR list mentioned above, all replicas will be searched sequentially after downtime. If the found copy is in the ISR list, it will be selected as Leader.

Also make sure that the previous Leader has abdicated, otherwise there will be a brain fissure (there are two Leader).

How do you guarantee that? Kafka ensures that there is only one Leader by setting a Controller.

Ack parameters determine the degree of reliability

In addition, here is an additional Kafka highly available essential knowledge point: the request.required.asks parameter.

The Asks parameter is an important configuration for the producer client, which can be set when sending a message.

This parameter has three values that can be configured:

one

All

The first is set to 0, which means that after the producer sends out the message, we don't care whether the message is dead or alive. It is a little bit forgotten after it is sent, and we will not be responsible if we say it. If you are not responsible for nature, the news may be lost, and so will the usability.

The second is set to 1, which means that after the producer sends the message, as long as the message is successfully transmitted to Leader, it doesn't matter whether the other Follower is synchronized or not.

There is a situation in which the Leader receives the message and the Follower goes down before it has time to synchronize the Broker, but the producer already thinks that the message has been sent successfully, so the message is lost.

Note that setting to 1 is the default configuration of Kafka! It can be seen that the default configuration of Kafka is not so highly available, but makes a tradeoff between high availability and high throughput.

The third is set to All (or-1), which means that after the producer sends the message, not only the Leader but also the Follower in the ISR list must be synchronized before the producer will send the task message successfully.

If you think about it further, won't Asks=All lose messages? The answer is no.

When there is only Leader left in the ISR list, Asks=All is equivalent to Asks=1. In this case, can you guarantee that the data will not be lost if the node goes down?

Therefore, data loss is guaranteed only if there are two copies in the Asks=All and there are two copies in the ISR.

Solve the problem

After going all the way around to understand the highly available mechanism of Kafka, we finally come back to our initial question: why is it not available after a node in Kafka goes down?

The number of Broker nodes I configured in the development and test environment is 3 Magi topic, the number of copies is 3 Magi Partition, the number of Partition is 6, the Parks parameter is 1.

What does the cluster do first when one of the three nodes goes down? Yes, as we mentioned above, the cluster finds that the Leader of Partition is invalid, and it is time to re-elect the Leader from the ISR list.

Is it not available if the ISR list is empty? No, but choose one of the surviving copies of Partition as Leader, but there is a potential risk of data loss.

Therefore, as long as the number of Topic replicas is set to the same as the number of Broker, the multi-replica redundancy design of Kafka can ensure high availability and will not be unavailable in the event of a downtime (however, it is important to note that Kafka has a protection policy, and Kafka will stop when more than half of the nodes are unavailable).

Then think about it carefully, is there a Topic with 1 copies on Kafka?

The problem lies in _ _ consumer_offset. _ _ consumer_offset is a Topic automatically created by Kafka to store Offset (offset) information of consumer consumption. The default number of Partition is 50.

This is the Topic, which has a default number of copies of 1. If all Partition exist on the same machine, it is obvious that there is a single point of failure!

When you give Kill the Broker of the Partition that stores _ _ consumer_offset, you will find that all consumers have stopped spending.

How to solve this problem?

First, you need to delete _ _ consumer_offset. Note that the Topic built into Kafka in this Topic cannot be deleted with the command. I delete the logs by deleting it.

Second, you need to change the number of copies of _ _ consumer_offset to 3 by setting offsets.topic.replication.factor to 3.

The consumption problem of consumers after a node downtime is solved later by using _ _ consumer_offset to do copy redundancy.

At this point, the study on "how to solve the sudden downtime of Kafka" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.