Example Analysis of excellent Architecture Design of Kafka producer 07/19 Update SLTechnology News&Howtos

Example Analysis of excellent Architecture Design of Kafka producer

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What this article shares with you is an example analysis of the excellent architecture design of Kafka producers. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Kafka is a high-throughput distributed publish and subscribe messaging system, which is very popular all over the world, especially in big data project. The author has seen the source code of many big data open source products and feels that the source code of Kafka is one of the best, thanks to the author's superb coding level and superb architecture design ability.

The core source code of Kafka is divided into two parts: the client source code and the server source code, and the client side is divided into producers and consumers. I personally think that the producers' source code in the Kafka source code has the highest technical content, so today I would like to analyze the architecture design of the producers of Kafka. Kafka is a rapidly developing messaging system, and its architecture is also evolving all the time. The version of Kafka we analyzed today is a relatively mature and stable Kafka1.0.0 version of the source code.

Figure 1 Kafka core module

Overview of producer process

First of all, I would like to introduce to you the general operation process of the producer.

Figure 2 how Kafka works

As shown in the figure above: step 1: a message is first encapsulated into a ProducerRecord object.

Step 2: the next step is to serialize this object, because the Kafka message needs to be transmitted from the client to the server, involving network transmission, so you need to implement the sequence. Kafka provides a default serialization mechanism and supports custom serialization (this design is also worth accumulating to improve the scalability of the project).

Step 3: after the message is serialized, the message is partitioned, and the cluster metadata needs to be obtained when partitioning. This process of partitioning is critical because that's when it is decided that our message will be sent to which partition of which topic on the Kafka server.

Step 4: the partitioned message is not sent directly to the server, but is put into a cache of the producer. In this cache, multiple messages are encapsulated into a batch, and the default size of a batch is 16K.

Step 5: after the Sender thread starts, it will get the batches that can be sent from the cache.

Step 6: the Sender thread sends a batch to the server. We should pay attention to this design, before the Kafka0.8 version, the design of Kafka producers is to send a piece of data to the server, frequent network requests occur, resulting in poor performance. In the later version, when the architecture evolved again, it was changed to batch processing, and the performance index was improved. This design is worth accumulating.

In-depth analysis of producer details

Next, we producer here is a place with relatively high technical content. In the previous overview, we saw that after a message is partitioned, the message will be put into a cache. Let's take a look at the specific details inside. The default size of the cache block is 32m. In this cache block, there is an important data structure: batches, which is the result of key-value. Key is the partition of the message topic. Value is a queue that stores batches sent to the corresponding partition. The Sender thread sends these batches to the server.

Figure 3 producer architecture

01 / Custom data structures for producer high-level design

The producer stores the batch information with the object batches. If it were everyone, what data structure would you consider using to store batch information?

The approach Kafka takes here is to customize a data structure: CopyOnWriteMap. Students who are familiar with Java know that there is a CopyOnWriteArrayList data structure under JUC, but there is no CopyOnWriteMap. Let me explain why Kafka designs such a data structure.

1. The information they store is the structure of key-value, key is the partition, and value is the corresponding batch to be saved to this partition (there may be multiple batches, so queues are used), so because it is the data structure of key-value, the Map data structure is used for locking.

two。 The Kafka producer is faced with a high concurrency scenario, and a large number of messages will pour into this data structure, so this data structure needs to be thread-safe, so we can't use data structures like HashMap.

3. This data structure needs to support scenarios that read more and write less. Read mostly because each message will read value information according to key. If there are 10 million messages, then the batches object will be read 1000 million times. The reason for writing less is that, for example, we producers need to send data to a topic. Suppose the topic has 50 partitions, then 50 key-value data need to be written into the batches. (we need to make it clear that although we have to write 10 million pieces of data, these 10 million items are written into the batch of the queue queue, not directly into batches, so we just talked about the scenario. You only need to write no more than 50 pieces of data in batches.

According to the second and third scenarios, we conclude that Kafka needs a thread-safe Map data structure that supports reading more and writing less. However, there is no such data provided in Java. The only thing that is close to this requirement is CopyOnWriteArrayList, but it is not Map structure, so Kafka designed CopyOnWriteMap here in imitation of CopyOnWriteArrayList. The idea of separation of read and write is adopted to solve the problems of thread safety and supporting more reading and writing less.

Efficient data structure ensures the performance of producers. Students who are not familiar with CopyOnWriteArrayList can try Baidu to learn. Here, the author suggests that you can take a look at the Kafka producer to insert data into the batches source code, producers to ensure the high performance of inserting data, using multithreading, but also for thread safety, the use of segmented locking and other means, the source code is very exciting.

02 / producer Advanced Design memory Pool Design

We just saw that batches are stored in batches. The default size of batches is 16K, and the size of the entire cache is 32m. Producers need to apply for memory for each batch. Normally, if a batch is sent out, then the 16K memory will be reclaimed by GC. However, if this is the case, FullGC may be triggered frequently, thus affecting the performance of producers, so a memory pool is designed in the cache (similar to the connection pool of the database we usually use). After a 16K memory is used up, the data is emptied and put into the memory pool, which can be obtained directly from the next batch. This greatly reduces the frequency of GC and ensures the stability and efficiency of producers (Java's GC problem is a headache, so this design is also worth accumulating).

The above is an example analysis of the excellent architecture design of Kafka producers. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.