Is it better to have as many partitions as possible in Kafka? 07/06 Update SLTechnology News&Howtos

Is it better to have as many partitions as possible in Kafka?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "is the number of Kafka partitions as many as possible". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The advantage of more partitions

Kafka uses partitions to scatter topic messages to multiple partition distributions and store them on different broker, achieving high throughput of producer and consumer message processing. Both producer and consumer of Kafka can operate in parallel with multiple threads, and each thread processes a partition of data. So the partition is actually the smallest unit for tuning Kafka parallelism. For producer, it actually uses multiple threads to concurrently initiate Socket connections to broker with different partitions and send messages to those partitions at the same time; while consumer, all explorer threads in the same consumer group are consumed by a partition of the designated topic.

So, if there are more topic partitions, the more throughput the entire cluster can theoretically achieve.

The more partitions, the better.

Is it better to have as many districts as possible? Obviously not, because each partition has its own overhead:

First, the more memory the client / server needs to use

After Kafka0.8.2, there is a parameter batch.size in the client-side producer, which defaults to 16KB. It caches messages for each partition, and once it is full, it packages and sends messages in batches. It looks like a design that can improve performance. Obviously, however, because this parameter is partition-level, the more partitions there are, the more memory will be required for this part of the cache. Suppose you have 10000 partitions, and by default, this part of the cache takes up about 157MB memory. What about the consumer side? Let's put aside the memory needed to get the data, just talk about the thread overhead. If you still assume that there are 10000 partitions and the number of processors matches the number of partitions (which is the best consumer throughput configuration in most cases), then 10000 threads will be created in consumer client and about 10000 Socket will need to be created to get the partition data. The cost of thread switching itself is not to be underestimated.

The overhead on the server side is not small. If you read the Kafka source code, you can find that many components on the server side maintain partition-level caches in memory, such as controller,FetcherManager, etc., so the more the number of partitions, the greater the cost of this cache.

Second, the cost of file handles

Each partition has its own directory on the underlying file system. There are usually two files in this directory: base_offset.log and base_offset.index. Kafak's controller and ReplicaManager save these two file handles (file handler) for each broker. Obviously, the more partitions you have, the more file handles you need to keep open, which may eventually break your ulimit-n limit.

Third, reduce high availability

Kafka ensures high availability through the replica mechanism. To do this, save several copies for each partition (replica_factor specifies the number of copies). Each copy is saved on a different broker. One of the replicas acts as a leader copy and is responsible for processing producer and consumer requests. The other replicas act as follower, and Kafka controller is responsible for ensuring synchronization with leader. If the broker where the leader is located dies, contorller will detect and reselect the new leader-- with the help of zookeeper with a brief window of unavailability, although in most cases it may only be a few milliseconds. But if you have 10000 partitions and 10 broker, that means an average of 1000 partitions per broker. At this point, the broker is dead, so zookeeper and controller need to conduct an leader election for these 1000 partitions immediately. This is bound to take longer than few partitioned leader elections and is usually not linearly cumulative. If this broker is also a controller, the situation will be even worse.

How to determine the number of partitions?

You can follow certain steps to try to determine the number of partitions: create a topic with only 1 partition, and then test the producer throughput and consumer throughput of this topic. Suppose their values are Tp and Tc, respectively, and the unit can be MB/s. Then assume that the total target throughput is Tt, then the number of partitions = Tt / max (Tp, Tc)

Description: Tp represents the throughput of producer. Testing producer is usually easy because its logic is very simple, just send a message directly to Kafka. Tc represents the throughput of the consumer. Testing Tc usually has more to do with the application, because the value of Tc depends on what you do after you get the message, so testing Tc is usually more troublesome.

How does a message know which partition to send to?

Assign according to key value

By default, Kafka allocates partitions based on the key of the message, that is, hash (key)% numPartitions:

Def partition (key: Any, numPartitions: Int): Int = {Utils.abs (key.hashCode)% numPartitions}

This ensures that messages with the same key must be routed to the same partition. When key is null, take the partition id from the cache or take one at random. If you don't specify key, how does Kafka determine which partition this message is going to?

When key is not specified, Kafka almost randomly finds a partition to send a message without key, and then adds the partition number to the cache for later use-- of course, Kafka itself clears the cache (by default every 10 minutes or every request for topic metadata).

What is the relationship between the number of Consumer and the number of partition?

A partition under topic can only be consumed by one explorer thread under the same consumer group, but the reverse is not true, that is, an explorer thread can consume data from multiple partitions, such as ConsoleConsumer provided by Kafka. By default, there is only one thread to consume data from all partitions.

Finally, the partitions are assigned to different consumer threads in round-robin style.

In this example, if the topic-partitions groups sorted by hashCode are T1-5, T1-3, T1-0, T1-8, T1-2, T1-1, T1-4, T1-7, T1-6, T1-9, our consumer thread order is C1-0, C1-1, C2-0, C2-1, and the final partition allocation result is:

C1-0 will consume T1-5, T1-2, T1-6 partitions

C1-1 will consume T1-3, T1-1, T1-9 partitions

C2-0 will consume T1-0, T1-4 partition

C2-1 will consume T1-8, T1-7

The partition allocation of multiple topics is similar to that of a single topic. Unfortunately, we cannot customize the partition allocation policy at this time, we can only select range or roundrobin through the partition.assignment.strategy parameter.

This is the end of the content of "the more partitions of Kafka, the better". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.