Why can Kafka be so fast? 07/03 Update SLTechnology News&Howtos

Why can Kafka be so fast?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Why Kafka can be so fast, I believe that many inexperienced people are at a loss about it. Therefore, this article summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Whether Kafka is used as a MQ or as a storage layer, it has nothing more than two functions: one is that the data produced by Producer is stored in Broker, and the other is that Consumer reads data from Broker.

Then the speed of Kafka is reflected in both reading and writing. Let's talk about the reasons why Kafka is fast.

Using Partition to realize parallel processing

We all know that Kafka is a Pub-Sub messaging system, whether publish or subscribe, must specify Topic.

Topic is just a logical concept. Each Topic contains one or more Partition, and different Partition can be located in different nodes.

On the one hand, because different Partition can be located on different machines, we can make full use of the advantages of cluster to realize parallel processing between machines.

On the other hand, because Partition physically corresponds to a folder, even if multiple Partition are located in the same node, you can configure different Partition on the same node on different disks, so as to achieve parallel processing between disks and give full play to the advantages of multiple disks.

If it can be processed in parallel, the speed will certainly be improved, and multiple workers will certainly work faster than one worker. Can you write to different disks in parallel? Can the speed of reading and writing on the disk be controlled? First of all, let's briefly talk about the things about disk Ibank O.

What are the constraints on hard disk performance? How to design the system according to the disk Imax O characteristics?

The main internal components of the hard disk are disk disk, drive arm, read-write head and spindle motor. The actual data are written on the disk, and the reading and writing are mainly done by driving the read-write head on the arm.

In actual operation, the spindle rotates the disk, and then the transmission arm can be extended to allow the read head to read and write on the disk.

The physical structure of the disk is shown in the following figure:

Due to the limited capacity of a single disk, the general hard disk has more than two disks, each disk has two sides, and information can be recorded, so one disk corresponds to two heads.

The disk is divided into many fan-shaped areas, each called a sector. On the surface of the disk, the center of the disk is the center, the concentric circles with different radii are called tracks, and the cylinders composed of tracks with the same radius of different disks are called cylinders.

Track and cylinder are circles with different radii, and in many cases, tracks and cylinders can be used interchangeably.

The vertical view of the disk is shown in the following figure:

The key factor affecting the disk is the disk service time, that is, the time it takes for the disk to complete an I _ swap O request, which consists of seek time, rotation delay and data transmission time.

The continuous read and write performance of the mechanical hard disk is very good, but the random read and write performance is very poor, mainly because it takes time for the head to move to the correct track. When reading and writing at random, the head needs to move constantly, and the time is wasted on head addressing, so the performance is not high. The important key metrics for measuring disks are IOPS and throughput.

In many open source frameworks such as Kafka and HBase, it is possible to convert random I _ peg O into sequential I _ hand O as much as possible by appending write, so as to reduce the addressing time and rotation delay and maximize IOPS.

If you are interested, you can take a look at disk Igamo [1]. The speed of reading and writing on disk depends on how you use it, that is, reading and writing sequentially or randomly.

Write disk sequentially

Each partition in Kafka is an ordered, immutable sequence of messages, and new messages are constantly appended to the end of the Partition. This is sequential writing.

A long time ago, someone did a benchmark test: "write 2 million per second (on three cheap machines)."

Http://ifeve.com/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines/

Because the disk is limited, it is impossible to save all the data. In fact, as a message system Kafka, it is not necessary to save all the data, and the old data needs to be deleted.

And because of sequential writing, when Kafka uses various deletion strategies to delete data, it does not modify the file by using "read-write" mode, but divides the Partition into multiple Segment.

Each Segment corresponds to a physical file, and the data in the Partition is deleted by deleting the entire file.

This way of cleaning up old data also avoids random writes to files.

Make full use of Page Cache

The purpose of introducing Cache layer is to improve the performance of disk access of Linux operating system. The Cache layer caches some of the data on the disk in memory.

When the request for data arrives, if the data exists in the Cache and is up-to-date, the data is passed directly to the user program, eliminating the operation on the underlying disk and improving the performance. The Cache layer is also one of the main reasons why disk IOPS can exceed 20000.

In the implementation of Linux, the file Cache is divided into two levels, one is Page Cache, the other is Buffer Cache, and each Page Cache contains several Buffer Cache.

Page Cache is mainly used as a cache of file data on the file system, especially when the process has read/write operations on the file.

Buffer Cache is mainly designed to be used by a system that caches blocks when the system reads and writes block devices.

Benefits of using Page Cache:

I _ Scheduler improves performance by assembling consecutive small chunks of writing into large physical writes.

IPot O Scheduler tries to rearrange some writes in order to reduce the time it takes to move the disk head.

Make full use of all free memory (non-JVM memory). If you use the application layer Cache (that is, JVM heap memory), it will increase the burden on GC.

The read operation can be done directly in the Page Cache. If consumption and production are at the same speed, you don't even need to exchange data through physical disks (directly through Page Cache).

If the process is restarted, the Cache in the JVM will fail, but the Page Cache will still be available.

After Broker receives the data, it only writes the data to Page Cache when writing to disk, and there is no guarantee that the data will be fully written to disk. From this point of view, it is possible to cause data in the Page Cache not to be written to disk when the machine is down, resulting in data loss.

However, this loss only occurs in scenarios where the operating system does not work, such as a power outage, and this scenario can be solved by the Replication mechanism at the Kafka level.

If you force Flush of data from Page Cache to disk to ensure that data is not lost in this case, it will degrade performance.

For this reason, although Kafka provides two parameters, flush.messages and flush.ms, to force the data in Page Cache to Flush to disk, Kafka is not recommended.

Zero copy technology

In Kafka, a large amount of network data is persisted to disk (Producer to Broker) and disk files are sent over the network (Broker to Consumer). The performance of this process directly affects the overall throughput of Kafka.

The core of the operating system is the kernel, which is independent of ordinary applications and can access protected memory space as well as the underlying hardware devices.

In order to prevent user processes from directly operating the kernel and ensure kernel security, the operating system divides virtual memory into two parts, one is kernel space (Kernel-space) and the other is user space (User-space).

In traditional Linux systems, the standard iUnip O interface (such as read,write) is based on the data copy operation, that is, the iUnip O operation will cause the data to be copied between the buffer of the kernel address space and the buffer of the user address space, so the standard iUnip O is also known as the cache Igamo.

The advantage of this is that if the requested data is already stored in the kernel's cache, the actual Imax O operation can be reduced, but the downside is that the process of copying the data will result in CPU overhead.

Let's simplify the production and consumption of Kafka into the following two processes [2]:

Network data persisted to disk (Producer to Broker)

Disk files are sent over the network (Broker to Consumer)

① network data persisted to disk (Producer to Broker)

In traditional mode, data transfer from network to file requires 4 data copies, 4 context switches and 2 system calls.

Data = socket.read () / / read network data File file = new File () file.write (data) / / persist to disk file.flush ()

This process actually takes place four copies of the data:

First of all, the network data is copied to the kernel Socket Buffer through DMA copy.

The application then reads the kernel state Buffer data into the user state (CPU copy).

The user program then copies the user mode Buffer to the kernel state (CPU copy).

Finally, copy the data to the disk file through DMA copy.

DMA (Direct Memory Access): direct memory access. DMA is a hardware mechanism that allows two-way data transmission between peripherals and system memory without the participation of CPU.

Using DMA can make the system CPU get rid of the actual data transmission process, thus greatly improving the throughput of the system.

At the same time, it is accompanied by four context switches, as shown in the following figure:

Data offsets are usually non-real-time, as is the persistence of Kafka producer data. Kafka data is not written to the hard disk in real time, it makes full use of the paging storage of modern operating system to improve the efficiency of Page Cache, which is mentioned in the previous section.

For Kafka, the data produced by Producer is stored in Broker, and this process reads the network data of socket buffer. In fact, the disk can be installed directly in kernel space.

It is not necessary to read the socket buffer network data into the application process buffer; here the application process buffer is actually the data received by the Broker,Broker from the producer for persistence.

In this special scenario: when receiving network data from socket buffer and the application process does not need intermediate processing, it can be persisted directly. You can use mmap memory file mapping.

Memory Mapped Files: referred to as mmap, also known as MMFile, the purpose of using mmap is to map the address of the read buffer (read buffer) in the kernel to the buffer (user buffer) in user space.

As a result, the kernel buffer is shared with the application memory, eliminating the process of copying data from the kernel read buffer (read buffer) to the user buffer (user buffer).

Its working principle is to directly use the Page of the operating system to realize the direct mapping of files to physical memory. After the mapping is completed, your operations on physical memory will be synchronized to the hard drive.

In this way, you can get a large increase in Icano, saving the overhead of copying from user space to kernel space.

Mmap also has an obvious defect: unreliable, the data written to mmap is not really written to the hard disk, and the operating system will not actually write the data to the hard disk until the program actively calls Flush.

Kafka provides a parameter, producer.type, to control whether the Flush; is active. If Kafka is written to mmap, Flush immediately, and then return Producer called sync.

Immediately after writing mmap, it returns Producer. If you do not call Flush, it is called async. The default is sync.

Zero copy (Zero-copy) technology means that when the computer performs operations, CPU does not need to copy data from one memory area to another, thus reducing context switching and CPU copy time.

Its function is in the process of transferring Datagram from network equipment to user program space, reduce the number of data copies, reduce system calls, achieve zero participation in CPU, and completely eliminate the load of CPU in this aspect.

At present, there are three main types of zero copy technology [3]:

Direct Iripple O: the data is transferred directly across the kernel and between the user address space and the Icano device. The kernel only carries out the necessary virtual storage configuration and other auxiliary work.

Avoid copying data between kernel and user space: when applications do not need to access data, you can avoid copying data from kernel space to user space, mmap,sendfile,splice & & tee,sockmap.

Copy on write: the technique of copying when writing. The data does not need to be copied in advance, but partially copied when it needs to be modified.

② disk files are sent over the network (Broker to Consumer)

The traditional way to achieve this: read the disk first, and then send it with Socket. In fact, you have entered Copy for four times.

Buffer = File.read Socket.send (buffer)

This process can be compared to the production message above:

First of all, the file data is read into the kernel Buffer (DMA copy) through the system call.

The application then reads the memory-state Buffer data into the user-mode Buffer (CPU copy).

Then the user program copies the user-mode Buffer data to the kernel-state Buffer (CPU copy) when sending data through the Socket.

Finally, the data is copied to NIC Buffer through a DMA copy.

The Linux 2.4 + kernel provides zero copy through Sendfile system calls. After the data is copied to the kernel Buffer through DMA, it is directly copied to NIC Buffer through DMA, without the need for CPU copy. This is also the source of the term zero copy.

In addition to reducing data copies, because the entire read file, the network send is done by a Sendfile call, the whole process has only two context switches, thus greatly improving performance.

The solution adopted by Kafka here is to call the Sendfile of the operating system through the transferTo/transferFrom of NIO to achieve zero copy.

A total of 2 kernel data copies, 2 context switches and 1 system call occurred, eliminating the CPU data copy.

Batch processing

In many cases, the bottleneck of the system is not the CPU or disk, but the network IO.

Therefore, in addition to the low-level batch provided by the operating system, Kafka clients and Broker accumulate multiple records (including reads and writes) in a batch before sending data over the network.

The recorded batches apportion the network round-trip overhead and use larger packets to improve bandwidth utilization.

Data compression

Producer can compress the data and send it to Broker to reduce the network transmission cost. At present, the compression algorithms supported are: Snappy, Gzip, LZ4. Data compression is generally used with batch processing as a means of optimization.

The next time the interviewer asks me why Kafka is fast, I will say:

Partition parallel processing.

Write the disk sequentially and make full use of the disk characteristics.

The paging storage Page Cache of modern operating system is used to improve the efficiency of Imax O by using memory.

Zero-copy technology is adopted: the data produced by Producer is persisted to Broker, and the mmap file is mapped to realize sequential fast writing; Customer reads data from Broker, uses Sendfile, reads the disk file into the OS kernel buffer, and then transfers it to NIO buffer for network transmission to reduce CPU consumption.

After reading the above, do you understand why Kafka can be done so quickly? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.