How to analyze Kafka architecture and highly available mechanisms 07/06 Update SLTechnology News&Howtos

How to analyze Kafka architecture and highly available mechanisms

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article is about how to analyze the Kafka architecture and highly available mechanisms, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Today, let's talk about kafka first. I don't think there are many people watching Hbase, so I just skip it and talk about everyone's favorite.

1. Kafka architecture diagram

In a set of kafka architecture, there are multiple Producer, multiple Broker, multiple Consumer, each Producer can correspond to multiple Topic, each Consumer can only correspond to one ConsumerGroup.

The entire Kafka architecture corresponds to a ZK cluster, which manages the cluster configuration through ZK, elects Leader, and rebalance when the consumer group changes.

For a complex distributed system, without rich experience and strong architectural capabilities, it is difficult to make the system simple and easy to maintain. As we all know, late maintenance accounts for 70% of the life cycle of a software, so the maintainability of the system is extremely important. Kafka can become the factual standard in big data's field, largely because the operation and maintenance is very convenient and simple. Today let's take a look at how kafka simplifies operation and maintenance.

Kafka uses multiple replicas to ensure that messages are not lost, and multiple replicas involve the replication mechanism of kafka. In a very large-scale cluster, the disk at this point is broken from time to time, the cpu load at that point is high, and there are various problems. If you want to fully automate fault tolerance, you have to make some considerations and trade-offs. Let's give an example to illustrate the complexity faced by operation and maintenance. We all know that kafka has an ISR collection. Let me first explain this concept:

Kafka is not fully synchronous or completely asynchronous, but is an ISR mechanism:

1. Leader maintains a list of Replica that is basically synchronized with it, which is called ISR (in-sync Replica). Each Partition has an ISR and is dynamically maintained by leader.

two。 If a follower lags far behind a leader, or if a data replication request is not initiated for a certain period of time, the leader removes it from the ISR

3. When all the Replica in the ISR sends ACK to the Leader, the leader is commit, and then the producer can assume that the messages in a request are commit.

Under this mechanism, what if a producer sends too many messages per request, causing the flower to lag too much behind the leader in an instant? Will it affect performance if follower keeps moving in and out of ISR? If an alarm is added to this situation, it may cause alarm bombardment. If we do not give an alarm, what if the broker is down or the broker lags too much behind the leader due to IO performance or GC problems?

Today let's take a look at how kafka is designed to completely avoid this headache in operation and maintenance.

Second, the replication mechanism of kafka

Kafka each partition consists of sequentially appended immutable sequences of messages, each with a unique offset to mark the location.

The replica mechanism in kafka replicates at a partition granularity. When we create a topic in kafka, we can set a replication factor that determines the number of partition replicas. If the leader fails, kafka will failover the partition master node to other replica nodes, thus ensuring that the messages for this partition are available. The leader node is responsible for receiving messages from producer, and other replica nodes (follower) copy messages from the primary node.

The guarantee provided by the kakfa log replication algorithm is that when a message is considered to have been committed on the producer side, if the leader node dies and other nodes are elected to become leader nodes, the message can also be consumed.

In this way, when leader is elected, it can only be elected from the ISR collection, and every point in the collection must be synchronized with the leader message, that is, there is no delay. The leader of the partition maintains the list of ISR collections, and if a point lags too far, it is kicked out of the ISR collection.

After producer sends a message to the leader node, leader only commit when all Replica in ISR send ACK confirmation message to Leader. Only then can producer consider the message commit. Because of this, the write performance of kafka client depends on the performance of the slowest broker in the ISR collection to receive messages. If the performance of a point is too poor, it must be identified as soon as possible, and then kicked out of the ISR collection. To avoid performance problems.

3. How can a copy keep up with the copy of leader

If a copy cannot "caught up" the leader node, it is likely to be kicked out of the ISR collection. Let's give an example of what the real "caught up" is-synchronizing with the leader node message.

A single-partition topic-foo in kafka with a replication factor of 3, partition distribution and leader and follower as shown in the following figure, broker 2 and 3 are now follower and both in the ISR collection. We set replica.lag.max.messages to 4. As long as follower does not lag behind leader more than 3 messages, then nodes that can keep up with leader will not be kicked out. Setting replica.lag.time.max.ms to 500ms means that as long as follower sends fetch requests within each 500ms, it will not be considered dead and will not be kicked out of the ISR collection.

Now producer sends a message with an offset of 3, and GC occurs in broker 3, as shown below:

Because broker 3 is now in the ISR collection, either broker 3 pulls the message with offset 3 on synchronization, or 3 is kicked out of the ISR collection, otherwise the message will not be committed, because replica.lag.max.messages=4 is 4, broker 3 is only one message behind, and will not be kicked out of the ISR collection. If broker 3 ends GC 100ms and GC at this time, and then pulls the message with offset 3, it will be fully synchronized with leader again. The whole process is always in the ISR collection, as shown in the following figure:

4. When will a copy be kicked out of the ISR collection

There are several reasons why a copy is kicked out of the ISR collection:

A replica fails to keep up with the leader node for a period of time, that is, the gap between the IO node and the leader node is greater than that of the replica.lag.max.messages. Usually, the performance of the IO fails to keep up, or the CPU load is too high, which causes the broker to append messages to disk at a slower speed than to receive leader messages.

A broker has not sent a fetch request to leader for a long time (greater than replica.lag.time.max.ms), perhaps because full GC has occurred in broker, or it has died for some other reason.

A new broker node, like a broker id, with a broken disk, a new machine, or a partition reassign to a new broker node, starts synchronizing with the oldest messages existing on the partition leader.

So after version 0.8 of kafka, two parameters are set, replica.lag.max.messages is used to identify nodes that have been slow in performance, and replica.lag.time.max.ms is used to identify nodes that are stuck.

5. Under what circumstances does a node really lag behind

Judging from the above situation, the two parameters seem to be enough. If a replica exceeds replica.lag.time.max.ms and has not sent a fetch synchronization request, it can be considered that the replica node is stuck and kicked out, but there is a special case that has not been taken into account.

We set replica.lag.max.messages to 4 above and set it to 4 because we already know that the number of messages sent by producer for each request is less than 4. If our parameter acts on multiple topic, it is difficult to estimate the maximum number of messages sent by this producer, or what will happen when traffic jitter occurs frequently? let's use an example to illustrate:

If the producer of our topic-foo calls a request containing 4 messages due to traffic jitter, we still set the replica.lag.max.messages to 4. At this time, all follower will be kicked out of the ISR collection because the number of messages lagging behind:

Then, because the follower is normal, the next fetch request will catch up with the leader, and the ISR collection will be added again. If there is frequent jitter, it will continue to move in and out of the ISR collection, causing a headache alarm bombardment.

The core problem here is that in the case of massive topic or frequent traffic jitter, we cannot make any assumptions about the number of messages each time topic's producer calls, so it is not easy to determine an appropriate eplica.lag.max.messages value.

VI. One configuration is all done.

In fact, there are only two situations that are abnormal, one is jam, and the other is the slow performance of follower. If we only judge whether the follower should be proposed to the ISR set based on how much follower lags behind leader, we must predict and estimate the traffic. How can we avoid this unreliable estimation?

The solution given by kafka is as follows:

The meaning of the configuration of replica.lag.time.max.ms has been enhanced. as before, if follower is stuck for more than this time and does not send a fetch request, it will be kicked out of the ISR collection. The new enhanced logic is that when follower lags behind leader and exceeds eplica.lag.max.messages messages, it will not be kicked out of the ISR collection immediately, but will continue to lag beyond the replica.lag.time.max.ms time before it will be kicked out. In this way, the operation and maintenance problems caused by traffic jitter can be avoided, because follower will keep up with leader in the next fetch, so there is no need to make any estimation of the write speed of topic.

The above is how to analyze the Kafka architecture and highly available mechanisms. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.