Why is Kafka writing so fast? 07/01 Update SLTechnology News&Howtos

Why is Kafka writing so fast?

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Kafka write why so fast, I believe that many inexperienced people do not know what to do, so this article summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Kafka messages are saved or cached on disk, and it is generally believed that reading and writing data on disk will degrade performance because addressing takes time, but in fact, one of the characteristics of Kafka is high throughput.

Even on ordinary servers, Kafka can easily support the highest write requests per second, which exceeds most message middleware. This feature makes Kafka widely used in massive data scenarios such as log processing.

The following is an analysis of why Kafka is so fast in terms of data writing and reading.

Data writing

Kafka will write all the messages received to the hard drive, and it will never lose data. In order to optimize the write speed, Kafka uses two techniques, sequential write and MMFile (Memory Mapped File).

Sequential write

The speed of reading and writing to a disk depends on how you use it, that is, read and write sequentially or randomly. In the case of sequential read and write, the sequential read and write speed of the disk is the same as that of memory.

Because the hard disk is a mechanical structure, every read and write will address-> write, in which addressing is a "mechanical action", which is the most time-consuming.

So the hard disk hates the random Imax O most, and likes the sequence Imax O most. In order to improve the speed of reading and writing to the hard disk, Kafka uses sequential I _ hand O.

And Linux also has many optimizations for disk read and write, including read-ahead and write-behind, disk cache and so on.

If you do these operations in memory, one is that the memory overhead of the Java object is high, and the other is that the GC time of the Java becomes very long as the heap memory data increases.

Using disk operations has the following benefits:

Disk sequential read and write speed exceeds memory random read and write.

The GC of JVM is inefficient and takes up a lot of memory. This problem can be avoided by using disks.

After the system starts cold, the disk cache is still available.

The following figure shows how Kafka writes data. Each Partition is actually a file. After receiving the message, Kafka will insert the data at the end of the file (virtual box):

There is a drawback in this approach-there is no way to delete the data, so Kafka will not delete the data, it will retain all the data, and each Consumer has an Offset for each Topic to indicate which item of data has been read.

Two consumers:

Consumer1 has two Offset corresponding to Partition0 and Partition1 (assuming one Partition for each Topic).

Consumer2 has an Offset corresponding to Partition2.

The Offset is saved by the client SDK, and Kafka's Broker completely ignores the existence of this thing.

Normally, SDK will save it in Zookeeper, so you need to provide Consumer with the address of Zookeeper.

The hard drive is sure to be full if you don't delete it, so Kakfa provides two strategies to delete data:

Based on time

Based on Partition file size

For specific configuration, please refer to its configuration documentation.

Memory Mapped Files

Even if it is written sequentially to the hard disk, the access speed of the hard disk cannot catch up with the memory. Therefore, the data of Kafka is not written to the hard disk in real time, it makes full use of the paging storage of modern operating system to improve the efficiency of IWeiO by using memory.

Memory Mapped Files (hereinafter referred to as mmap) is also translated into memory-mapped files, which can generally represent 20G data files in 64-bit operating systems. its working principle is to directly use the Page of the operating system to realize the direct mapping of files to physical memory.

After the mapping is completed, your operations on the physical memory will be synchronized to the hard disk (the operating system at the appropriate time).

With mmap, the process reads and writes memory (virtual machine memory, of course) like a hard disk, and doesn't care about the size of memory. There is virtual memory for us.

In this way, you can get a large increase in Icano, saving the overhead of copying from user space to kernel space. (the Read of the calling file puts the data into memory in kernel space and then copies it to memory in user space.)

But there is also an obvious defect-unreliable, the data written to mmap is not really written to the hard disk, and the operating system will not actually write the data to the hard disk until the program actively calls Flush.

Kafka provides a parameter producer.type to control whether it is an active Flush:

Flush immediately after Kafka is written to mmap, and then return Producer to call synchronization (Sync).

If Kafka returns Producer immediately after writing to mmap, it is called Async without calling Flush.

Data reading

What optimizations did Kafka make when reading the disk?

Implementation of Zero Copy based on Sendfile

In traditional mode, when a file needs to be transferred, the details of the process are as follows:

The Read function is called, and the file data is Copy to the kernel buffer.

The Read function returns the file data from the kernel buffer Copy to the user buffer

The Write function call transfers the file data from the user buffer Copy to the kernel Socket-related buffer.

The data is Copy from the Socket buffer to the relevant protocol engine.

The above details are the traditional Read/Write method for network file transfer. We can see that in this process, the file data has actually gone through four Copy operations:

Hard disk-> Kernel buf- > user buf- > Socket related buffer-> Protocol engine

Sendfile system call provides a way to reduce the number of Copy mentioned above and improve the performance of file transfer.

In kernel version 2. 1, Sendfile system calls were introduced to simplify data transfer over the network and between two local files.

The introduction of Sendfile reduces not only data replication but also context switching.

Sendfile (socket, file, len)

The running process is as follows:

Sendfile system call, the file data is Copy to the kernel buffer.

Then Copy from the kernel buffer to the Socket-related buffer in the kernel.

* then Socket the related buffer Copy to the protocol engine.

Compared with the traditional Read/Write method, the Sendfile introduced in version 2.1 kernel has reduced the kernel buffer to the User buffer, and then from the User buffer to the file Copy of the Socket-related buffer.

After kernel version 2.4, the file descriptor result was changed, and Sendfile implemented a simpler way, once again reducing the Copy operation.

In Apache, Nginx, Lighttpd and other Web servers, there is a Sendfile-related configuration, the use of Sendfile can greatly improve file transfer performance.

Kafka stores all messages in a file. When consumers need data, Kafka sends the file directly to consumers. With mmap as a way to read and write files, Kafka sends it directly to Sendfile.

Batch compression

In many cases, the bottleneck of the system is not the CPU or disk, but the network IO, especially for data pipelines that need to send messages between data centers on the WAN.

Data compression consumes a small amount of CPU resources, but for Kafka, network IO should consider:

Because each message is compressed, but the compression ratio is relatively low, Kafka uses batch compression, which means that multiple messages are compressed together instead of a single message.

Kafka allows the use of recursive collections of messages, where batches of messages can be transmitted in compressed form and can be kept in a compressed format in the log until they are decompressed by consumers.

Kafka supports a variety of compression protocols, including Gzip and Snappy.

The secret of Kafka speed is that it turns all messages into a batch file, and carries out reasonable batch compression, reduces the loss of network IO, and improves the speed of mmap.

When writing data, a single Partion is added at the end, so the speed is * *; when reading data, it is output directly with Sendfile violence.

After reading the above, have you mastered why Kafka is written so fast? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.