What is the scaling process of KAFKA's ISR? 07/03 Update SLTechnology News&Howtos

What is the scaling process of KAFKA's ISR?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what is the scaling process of KAFKA's ISR". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

I. Preface

Summary of knowledge points

II. Overview

We understand that the ISR list is constantly stretching, kick out the ISR list in time after the copy expires, and re-add the copy to the ISR list after the copy catches up. Later, we will look at the details according to this idea.

What is an invalid copy?

Functional failure: the node is down, and the copies on that node belong to functional failure copies.

Synchronization failure: the broker where the follower copy is located is kicked out of the ISR due to bandwidth or load and other factors such as unable to complete synchronization in time.

4. Do you know the ISR scaling control parameters?

Before version 0.9x, there is a control parameter: the default value of replica.lag.max.messages is 4000, which means that if the number of messages in follower falls behind the number of leader by 4000, it will be kicked out of the ISR list.

We can think about whether this method of directly specifying the number of entries is reasonable. It is obviously unreasonable. The reasons are as follows:

High throughput scenario: tens of thousands of messages in an instant, follower may be judged to be invalid and kicked out after a few seconds delay, which may lead to frequent changes in ISR list and frequent updates of metadata.

Low throughput scenario: maybe only a few messages a day, then maybe follower lags behind for several days and still exists in ISR, so ISR is meaningless, isn't it?

So starting with version 0.9x, the parameter is removed and replaced with replica.lag.time.max.ms, which defaults to 10000ms, which is 10s.

In other words, if follower fails to catch up with leader's LEO in 10 seconds, it will be considered invalid and will be kicked out of the IS list.

5. How does ISR eliminate invalid copies?

Now that we know how ISR determines the invalid copy, let's take a look at how on earth did we kick the invalid copy out?

1. Each broker starts two scheduled tasks when it starts:

Isr-expiration: regularly check the expiration information of the copy corresponding to the eader on the current broker, that is, to see if there is an invalid copy in the ISR list of the current Leader. The default execution period is replica.lag.time.max.ms / 2 = 5s.

Isr-change-propagation: regularly check whether there is any new change data in the memory isrChangeSet, with a fixed execution cycle of 2.5s

2. Judge that the copy is invalid:

The isr-expiration task subtracts the lastCaughtUpTimeMs of a follower based on the current time now. If it is greater than the value of replica.lag.time.max.ms, it is invalid.

The value of lastCaughtUpTimeMs is updated when the LEO of follower is equal to the LEO of leader (the LEO information of follower is maintained in Leader).

In other words, follower is updated only when it fully catches up with Leader, rather than every Fetch.

About why not update this value every time you Fetch?

Let's imagine that if the write rate of leader is much higher than the synchronization rate of follower, leader may have written 10w pieces of data, and follower is still synchronizing slowly due to network / load reasons, but because the Fetch request is sent normally, the lastCaughtUpTimeMs value is updated every time, so that the follower is considered to be valid, which leads to a huge data difference between leader and follower in this scenario. Thus affecting the reliability of the data.

3. How to convey the message of this change in ISR?

The isr-expiration scheduled task of the broker where the leader is located checks the failed copy and updates the / state node data of the zk, while writing to the isrChangeSet.

Isr-change-propagation to check if there is any new data in isrChangeSet, and if so, create a child node under the / isr_change_notification node in zk.

Controller has a Watcher for this node, and if a new child node is found, Controller will retrieve the latest metadata from zk and notify all Broker to update the metadata.

From the above process, we can also know that, in fact, the changed data will stay in memory for a period of time. If our corresponding broker goes down at this time, won't we change the zk but not let other broker update the metadata?

In fact, it is not, because in this case, broker will trigger controller to delete the corresponding node under brokers/ids under zk, so Controller will also let other broker update metadata, so it will be updated anyway.

Finally, let's summarize the whole process of ISR culling:

Each leader starts two scheduled check tasks at startup, checking for the existence of invalid copies at regular intervals.

If the lastCaughtUpTimeMs of a follower is more than 10s, then it will be judged to be an invalid copy.

If the scheduled task scans for an invalid copy, the latest ISR list data is updated under the / state node of the zk and the change data is written to the isrChangeSet in memory.

Another propagation task then periodically checks isrChangeSet for tasks that need to be changed, and if it senses it, creates a child node under the / isr_change_notification node of zk.

Finally, Controller senses the changes of nodes, then gets the latest metadata from zk, and then notifies all Broker to update metadata to complete the data update of the entire ISR list.

6. How did the catch-up copy rejoin the ISR?

After reading the fifth section, the sixth section will seem very simple, but you just need to know when a copy will be re-determined as a synchronous copy. That is: when the LEO of the current invalid follower is equal to leaderHW, it is judged that you can rejoin the ISR.

So one of the questions that follow is where to judge followerLEO = = leaderHW?

Unlike the elimination of ISR members above, it is not detected by a scheduled task, but when the Fetch request is processed, if it is determined that the Fetch request is sent by follower (replicaId > = 0), then it will check how much the LEO of the current follower is (actually brought by the Fetch request) and whether it catches up with the current leaderHW, and if so, perform the expansion ISR operation.

The operation process of expanding ISR is the same as the above process, first write / state data under zk, then write isrChangeSet, and finally Controller perceive data changes and update cluster metadata.

The main difference we need to remember is that the extension of the ISR list is judged and executed at the time of the Fetch request.

Summary of the whole stretching process

Finally, we use diagrams to deepen our impression.

1. Invalid copy (source: "in-depth understanding of kafka"):

2. Kick out the ISR list:

VIII. Performance optimization

As we can see from the above, the scaling of ISR requires metadata updates for zk and Controller as well as each Broker, so too often it will cause performance problems.

So before judging ISR scaling, kafka determines two more conditions to reduce the frequency:

The last time the ISR collection changed, it is now more than 5s.

The last time I wrote zk, it was more than 60s now.

If a copy has just caught up with Leader and joined ISR, but is checked to be invalid after failing to catch up with LEO,5s in a short period of time, and is not going to be kicked out again to update metadata, it will be too frequent. Therefore, with the above two restrictions, at least 60 seconds are given for the newly added follower to catch up with Leader's LEO.

This is the end of the content of "what is the scaling process of KAFKA's ISR". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.