Kafka, as the upper layer of streaming, why is the throughput so large? 07/06 Update SLTechnology News&Howtos

Kafka, as the upper layer of streaming, why is the throughput so large?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Why Kafka has high speed and large throughput

Kafka is a ubiquitous message middleware in big data field. At present, it is widely used in the real-time data pipeline within enterprises and helps enterprises to build their own stream computing applications. Although Kafka is based on disk data storage, it has the characteristics of high performance, high throughput and low latency, and its throughput is tens of thousands or tens of millions. However, many people who have used Kafka are often asked why Kafka is fast and has a large throughput. Most of the people who are asked are confused at once, or only know some simple points. This article briefly introduces why Kafka is fast and has a large throughput.

Read and write sequentially

It is well known that Kafka persists message records to the local disk. People generally think that disk read and write performance is poor, and they may question how Kafka performance is guaranteed. In fact, whether it is memory or disk, the key to speed or slowness lies in the way of addressing. Disks are divided into sequential read-write and random read-write, and memory is also divided into sequential read-write and random read-write. Random read and write based on disk is indeed very slow, but the performance of sequential read and write of disk is very high, generally speaking, it is three orders of magnitude higher than that of random read and write of disk. In some cases, the performance of sequential read and write of disk is even higher than that of random read and write of memory. Here is a performance comparison chart on the famous academic journal ACM Queue:

Disk sequential read and write is the most regular disk usage mode, and the operating system has also made a lot of optimizations to this mode. Kafka uses disk sequential read and write to improve performance. Kafka's message is constantly appended to the end of the local disk file, rather than random writes, which results in a significant improvement in Kafka write throughput.

II. Page Cache

In order to optimize read and write performance, Kafka takes advantage of the operating system's own Page Cache, which uses the operating system's own memory instead of JVM space memory. The benefits of this are:

Avoid Object consumption: if you are using the Java heap, the memory consumption of Java objects is relatively large, usually twice or more than the data stored. Avoid GC problems: as data in JVM increases, garbage collection will become complex and slow, and there will be no GC problems using system caching

Compared with using data structures such as JVM or in-memory cache, using Page Cache of the operating system is more simple and reliable. First of all, cache utilization at the operating system level is higher because it stores compact byte structures rather than separate objects. Secondly, the operating system itself has also done a lot of optimization for Page Cache, providing a variety of mechanisms such as write-behind, read-ahead and flush. Furthermore, even if the service process is restarted, the system cache will not disappear, avoiding the process of rebuilding the cache by in-process cache.

The read and write operations through the Page Cache,Kafka of the operating system are basically based on memory, and the read and write speed has been greatly improved.

Three, zero copy

This is mainly about the optimization done by Kafka on the consumer side using the "zero copy (zero-copy)" mechanism of the linux operating system. First, let's take a look at the general transfer path of data from a file to a socket network connection:

Operating system reads data from disk to kernel space (kernel space) Page Cache application reads data from Page Cache to user space (user space) buffer application writes data from user space buffer back to kernel space to socket buffer (socket buffer) operating system copies data from socket buffer to NIC buffer sent over the network

This process involves 4 copy operations and 2 system context switches, and the performance is actually very inefficient. The "zero copy" mechanism of the linux operating system uses the sendfile method, which allows the operating system to send data directly from the Page Cache to the network, requiring only the last step of the copy operation to copy the data to the NIC buffer to avoid re-copying the data. The schematic diagram is as follows:

Through this "zero copy" mechanism, Page Cache combined with the sendfile method, the performance of the consumer side of Kafka is also greatly improved. This is why sometimes when consumers continue to consume data, we do not see that the disk io is relatively high. At this time, it is the operating system cache that provides data.

IV. Division and segmentation

The message of Kafka is stored according to topic, and the data in topic is stored in different broker nodes according to a partition. Each partition corresponds to a folder on the operating system, and the partition is actually stored in segments by segment. This is also very consistent with the design idea of partition and bucket in distributed system.

Through this partition and segmentation design, the message messages of Kafka are actually distributed stored in a small segment, and each file operation is also a direct operation of the segment. For further query optimization, Kafka defaults to the index file created by the segmented data file, which is the .index file on the file system. This design of partition and segmentation + index not only improves the efficiency of data reading, but also improves the parallelism of data operations.

Summary

Kafka uses sequential read and write, Page Cache, zero copy and partition segmentation design, coupled with the optimization in the index, in addition, Kafka data read and write is also batch rather than single, so Kafka has the characteristics of high performance, high throughput and low delay. In this way, it has become an advantage that Kafka provides high-capacity disk storage.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.