How to find the throughput limit of a Kafka cluster? 07/04 Update SLTechnology News&Howtos

How to find the throughput limit of a Kafka cluster?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Kafka is a very popular distributed streaming and big data message queue solution, which has been widely used in the technology industry, and Dropbox is no exception. Kafka plays an important role in many distributed system data structures of Dropbox: data analysis, machine learning, monitoring, search and streaming, and so on. The Dropbox,Kafka cluster is managed by the Jetstream team, whose primary responsibility is to provide high-quality Kafka services. One of their main goals is to understand the throughput limits of Kafka in the Dropbox infrastructure, which is critical to making appropriate configuration decisions for different use cases. Recently, they created an automated test platform to achieve this goal. This article will share the methods they use and some interesting findings.

Test platform

The above figure depicts the settings of the test platform used in this article. We use the Kafka client in Spark so that traffic can be generated and consumed at any scale. We have built three Kafka clusters of different sizes. To resize the cluster, you only need to redirect the traffic to different clusters. We created a Kafka topic to generate test traffic. For simplicity, we distribute the traffic evenly among the Kafka broker. To achieve this goal, we created a test theme with 10 times the number of partitions as the number of broker, so that each broker is the leader of 10 partitions. Because writing to a single partition is serial, too few partitions per broker can lead to write contention, which limits throughput. According to our experiments, 10 is the right number to avoid throughput bottlenecks caused by write competition.

Due to the distributed nature of the infrastructure, clients are distributed in different parts of the United States. Because the test traffic is much lower than the limit of the Dropbox network backbone, we can safely assume that the cross-area traffic limit also applies to local traffic.

What affects the workload?

There are a number of factors that affect the workload of the Kafka cluster: the number of producers, the number of consumer groups, the initial consumer offset, messages per second, the size of each message, the number of topics and partitions involved, and so on. We are free to set parameters, so it is necessary to find the dominant factors in order to reduce the test complexity to a practical level.

We studied different combinations of parameters and concluded that the main factors we need to consider are the number of messages generated per second (mps) and the byte size of each message (bpm).

Flow model

We took a formal approach to understanding the throughput limits of Kafka. A specific Kafka cluster has an associated traffic space, and each point in this multidimensional space corresponds to a Kafka traffic pattern, which can be expressed by a parameter vector. All traffic patterns that do not cause Kafka overload form a closed subspace, which appears to be the throughput limit of the Kafka cluster.

For the initial test, we chose mps and bpm as the basis for the throughput limit, so the traffic space was reduced to a two-dimensional plane. This series of acceptable traffic forms a closed area, and finding the throughput limit of Kafka is equivalent to drawing the boundaries of that area.

Automated testing

In order to draw the boundary with reasonable precision, we need to do hundreds of experiments with different settings, which is impractical by manual operation. Therefore, we have designed an algorithm that can run all experiments without human intervention.

Overload indicator

We need to find a series of indicators that can programmatically judge the health of Kafka. We studied a large number of candidate indicators and finally locked in the following:

IO threads are less than 20% idle: this means that the worker thread pool that Kafka uses to process client requests is too busy to handle more workloads.

The synchronous replica set changes by more than 50%: this means that at least one broker cannot copy the leader's data in time.

The Jetstream team also uses these metrics to monitor Kafka health, which will be the first to signal when the cluster is under too much pressure.

Find the boundary.

To find a boundary point, we fixed the bpm dimension and tried to overload the Kafka by changing the mps value. When we have a secure mps value and another mps value that causes the cluster to approach overload, the boundary is found. We treat the safe value as a boundary point, and then find the entire boundary line by repeating the process, as follows:

It is worth noting that we have adjusted producers with the same production rate (represented by np) rather than directly adjusting mps. It is mainly because the batch processing mode makes it difficult to control the production rate of a single producer. On the contrary, changing the number of producers can scale the flow linearly. According to our early research, increasing the number of producers alone will not bring significant load differences to Kafka.

We use binary search to find a single boundary point. The binary search starts with a very large np window, where max is a value that is sure to cause overload. In each iteration, an intermediate value is selected to generate traffic. If Kafka overloads when using this value, the value becomes the new upper limit, otherwise it becomes the new lower limit. Stop the process when the window is narrow enough. We regard the mps value corresponding to the current lower bound as the boundary.

Result

We have drawn the boundaries of different sizes of Kafka in the figure above. Based on this result, we can conclude that the maximum throughput that the Dropbox infrastructure can withstand is per broker 60MB/s.

It is worth noting that this is only a conservative limit because the message sizes we tested are completely random, mainly to minimize the impact of the Kafka internal message compression mechanism. In a production environment, Kafka messages usually follow a pattern because they are usually generated by similar processes, which provides a lot of room for compression optimization. We tested an extreme case where messages are all made up of the same characters, and we can see a higher throughput limit at this time.

In addition, this throughput limit is still in effect when there are five consumer groups subscribing to the test topic. In other words, such write throughput can still be achieved when the read throughput is five times that of the current. As the number of consumer groups increases to more than 5, write throughput begins to decline as the network becomes a bottleneck. Because the read / write traffic ratio in the Dropbox production environment is much less than 5, the limit we get applies to all production clusters.

This result provides a guiding basis for future Kafka configuration. Assuming that we allow up to 20% of broker offline, the maximum secure throughput of a single broker should be 60MB/s * 0.8 ~ = 50MB/s. With this, we can determine the cluster size based on the estimated throughput of future use cases.

Impact on future work

This platform and automated test suite will be a valuable asset to the Jetstream team. When we switch to new hardware, change the network configuration, or upgrade the Kafka version, we can rerun these tests and achieve new throughput limits. We can apply the same approach to explore other factors that affect Kafka performance. Finally, this platform can be used as a test platform for Jetstream to simulate new traffic patterns or reproduce problems in an isolated environment.

Summary

In this article, we propose a systematic approach to understanding the throughput limits of Kafka. It is worth noting that we get these results based on the Dropbox infrastructure, so the numbers we get may not apply to other Kafka instances due to differences in hardware, software stacks, and network conditions. We hope that the techniques introduced here will help readers understand their own Kafka systems.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.