How to understand the principle of Kafka downtime and Kafka high availability 10/21 Update SLTechnology News&Howtos

How to understand the principle of Kafka downtime and Kafka high availability

2025-10-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to understand the principle of Kafka downtime and Kafka high availability? I believe many inexperienced people don't know what to do about it. Therefore, this article summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

High availability problems caused by Kafka downtime

The problem starts with a downtime in Kafka.

The author works in a financial technology company, but the company does not use RabbitMQ, which is more popular in the field of financial payment, but uses Kafka, which was born for log processing at the beginning of the design, so I have always been curious about the high availability implementation and guarantee of Kafka. Since the deployment from Kafka, the Kafka used internally in the system has been running stably and has not been unavailable.

But recently, system testers often reported that occasionally Kafka consumers could not receive messages, and the login management interface found that one of the three nodes was down. But according to the concept of high availability, how can three nodes and two nodes be available so that consumers in the entire cluster cannot receive the message?

To solve this problem, we should start with the high-availability implementation of Kafka.

Multi-copy redundancy Design of Kafka

Whether it is a traditional system based on relational database design, or distributed systems such as zookeeper, redis, Kafka, HDFS and so on, the way to achieve high availability is usually to adopt redundant design to solve the problem of node downtime and unavailability.

First, take a brief look at a few concepts of Kafka:

Physical model

Logical model

Broker (node): Kafka service node. Simply put, a Broker is a Kafka server, a physical node.

Topic (topic): in Kafka, messages are classified by topic. Each topic has a Topic Name. Producers send messages to a specific Topic according to Topic Name, while consumers also consume from corresponding Topic according to Topic Name.

Partition (partition): Topic (topic) is a unit of message classification, but each topic can be subdivided into one or more Partition (partitions), and a partition can only belong to one topic. Topics and partitions are logical concepts. For example, messages 1 and 2 are sent to topic 1, and they may enter the same partition or different partitions (so different partitions under the same topic contain different messages), which are then sent to the corresponding Broker node of the partition.

Offset (offset): a partition can be regarded as a queue that can only enter and exit (Kafka only ensures that messages in a partition are ordered). Messages will be appended to the end of the queue. After each message enters the partition, there will be an offset to identify the location of the message in the partition. Consumers want to consume the message by offset to identify it.

In fact, according to the above concepts, is it possible to guess that the multi-copy redundancy design of Kafka has been implemented? Don't worry, let's move on.

Prior to Kafka version 0.8, there was no multi-copy redundancy mechanism, and once a node died, all Partition data on that node could no longer be consumed. This means that some of the data sent to Topic has been lost.

The introduction of copy reporters after version 0.8 is a good solution to the problem of data loss after downtime. Replicas are based on the data of each Partition in Topic, and the data of each Partition is synchronized to other physical nodes to form multiple replicas.

Each copy of Partition includes a copy of Leader and multiple copies of Follower, the Leader is elected by all copies, and the other copies are Follower copies. When the producer writes or the consumer reads, they will only deal with Leader. After writing the data, Follower will pull the data for data synchronization.

It's that simple? Yes, the high availability of Kafka is achieved based on the multi-copy architecture diagram above. When a Broker dies, don't worry, the Partition on this Broker has a copy on other Broker nodes. What do you think if it's Leader who hung up? Then you can elect a Leader in Follower, and producers and consumers can have fun with the new Leader. This is High availability.

You may still be wondering, how many copies will it take? what if there is no full synchronization between Follower and Leader? What are the election rules for Leader after a node goes down?

Come to a conclusion directly:

How many copies will be enough? The more copies you have, the more you can ensure the high availability of Kafka, but more copies means more network and disk resources are consumed, and the performance will be degraded. Generally speaking, the number of copies is 3 to ensure high availability. In extreme cases, you can increase the replication-factor parameters.

What if there is no full synchronization between Follower and Lead? Follower and Leader are not completely synchronous, but also not completely asynchronous, but use an ISR mechanism (In-Sync Replica). Each Leader dynamically maintains a list of ISR, which stores Follower that is basically synchronized with Leader. If a Follower does not initiate a data pull request to Leader due to network, GC and other reasons, and the Follower is out of sync with the Leader, it will be kicked out of the ISR list. So, the Follower in the ISR list is a copy of the Leader.

What are the election rules for Leader after a node goes down? There are many distributed election rules, such as Zookeeper's Zab, Raft, Viewstamped Replication, Microsoft's PacificA and so on. The Leader election idea of Kafka is very simple. Based on the ISR list mentioned above, all replicas will be searched sequentially after downtime. If the found copy is in the ISR list, it will be selected as Leader. Also make sure that the previous Leader has abdicated, otherwise there will be a brain fissure (there are two Leader). How do you guarantee that? Kafka ensures that there is only one Leader by setting a controller.

Ack parameters determine the degree of reliability

In addition, here is an additional Kafka highly available essential knowledge point: the request.required.asks parameter.

The Asks parameter is an important configuration for the producer client, which can be set when sending a message. This parameter has three values that can be configured: 0, 1, and All.

The first is set to 0, which means that after the producer sends out the message, we don't care whether the message is dead or alive. It is a little bit forgotten after it is sent, and we will not be responsible if we say it. If you are not responsible for nature, the news may be lost, and so will the usability.

The second is set to 1, which means that after the producer sends the message, as long as the message is successfully transmitted to Leader, it doesn't matter whether the other Follower is synchronized or not. There is a situation in which the Leader receives the message and the Follower goes down before it has time to synchronize the Broker, but the producer already thinks that the message has been sent successfully, so the message is lost. Note that setting to 1 is the default configuration of Kafka! It can be seen that the default configuration of Kafka is not so highly available, but makes a tradeoff between high availability and high throughput.

The third is set to All (or-1), which means that after the producer sends the message, not only the Leader but also the Follower in the ISR list must be synchronized before the producer will send the task message successfully.

If you think about it further, won't Asks=All lose messages? The answer is no. When there is only Leader left in the ISR list, Asks=All is equivalent to Asks=1. In this case, can you guarantee that the data will not be lost if the node goes down? Therefore, data loss is guaranteed only if there are two copies in the Asks=All and there are two copies in the ISR.

Solve the problem

After going all the way around to understand the high availability mechanism of Kafka, we finally come back to our initial question: why is it not available after a node in Kafka goes down?

The number of Broker nodes I configured in the development and test environment is 3 Magi topic, the number of copies is 3 Magi Partition, the number of Partition is 6, the Parks parameter is 1.

What does the cluster do first when one of the three nodes goes down? Yes, as we mentioned above, the cluster finds that the Leader of Partition is invalid, and it is time to re-elect the Leader from the ISR list. Is it not available if the ISR list is empty? No, but choose one of the surviving copies of Partition as Leader, but there is a potential risk of data loss.

Therefore, as long as the number of Topic replicas is set to the same as the number of Broker, the multi-replica redundancy design of Kafka can ensure high availability and will not be unavailable in the event of a downtime (however, it is important to note that Kafka has a protection policy, and Kafka will stop when more than half of the nodes are unavailable). Then think about it carefully, is there a Topic with 1 copies on Kafka?

The problem lies in _ _ consumer_offset. _ _ consumer_offset is a Topic created automatically by Kafka to store offset (offset) information of consumer consumption. The default number of Partition is 50. And this is the Topic, whose default number of copies is 1. If all Partition exist on the same machine, it is obvious that there is a single point of failure! When you give Kill the Broker of the Partition that stores _ _ consumer_offset, you will find that all consumers have stopped spending.

How to solve this problem?

First, you need to delete _ _ consumer_offset. Note that the Topic built into Kafka in this Topic cannot be deleted with the command. I delete the logs by deleting it.

Second, you need to change the number of copies of _ _ consumer_offset to 3 by setting offsets.topic.replication.factor to 3.

The consumption problem of consumers after a node downtime is solved later by using _ _ consumer_offset to do copy redundancy.

After reading the above, have you mastered the method of understanding Kafka downtime and Kafka high availability principle? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.