RocketMQ High-performance low-level Storage Design 07/01 Update SLTechnology News&Howtos

RocketMQ High-performance low-level Storage Design

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Say in front.

RocketMQ borrows Kafka from the underlying storage, but it also has its unique design. This article focuses on the underlying file storage structure that has a profound impact on RocketMQ performance, interspersed with a little bit of Kafka as a comparison.

Examples

Commit Log, a collection of files, the size of each file is 1G, and the next one can be saved after full storage. For the convenience of discussion, you can regard it as a file, and all the message contents can be persisted into this file; Consume Queue: a Topic can have multiple files, each file represents a logical queue, where the offset value of the message in Commit Log as well as the size and Tag attributes are stored.

For the sake of simplicity, let's give an example.

If the cluster has a queue (Consume Queue) with a Broker,Topic of binlog, as shown in the following figure, the five messages with different contents are sent sequentially.

Let's take a brief look at Commit Log and Consume Queue.

The messages in RMQ are ordered as a whole, so the five messages persist the content in Commit Log in order. Consume Queue is used to arrange messages evenly in different logical queues, so that multiple consumers in cluster mode can consume Consume Queue messages in parallel.

Page Cache

Now that you know where each file is stored and what content is stored, it's time to officially discuss why this storage scheme brings performance improvements.

Usually, the reading and writing of files is relatively slow, if you read and write files sequentially, the speed is almost close to the random read and write of memory, why it is so fast, the reason is Page Cache.

To take an intuitive look, the entire OS has 3.7 gigabytes of physical memory, consumes 2.7 gigabytes, and should have 1 gigabyte of free memory, but OS gives 175 megabytes. Of course, this math problem certainly can not be calculated in this way.

When OS finds that there is a large amount of physical memory left in the system, in order to improve the performance of IO, it will use excess memory as file cache, that is, buff / cache on the graph. In a broad sense, Page Cache is a subset of these memory.

When OS reads the disk, it reads all the contents of the current area into Cache, so that it can hit Cache the next time it is read. When writing to the disk, it is written directly to Cache and returned. The pdflush of OS uses certain strategies to Flush the data of Cache back to disk.

But there are many files on the system, even the redundant Page Cache is a very valuable resource. It is impossible for OS to randomly assign Page Cache to any file. The bottom layer of Linux provides mmap to map a program-specified file to virtual memory (Virtual Memory). Reading and writing to the file becomes the read and write to memory, which can make full use of Page Cache. However, it is not enough for the file IO to use Page Cache alone. If the file is read and written randomly, it will cause a lot of Page Fault interrupts in virtual memory.

Each process in user space has its own virtual memory, and each process thinks that it has all its physical memory, but virtual memory is only logical memory. If you want to access the data in memory, you have to look up the page table through the memory management unit (MMU) and map the virtual memory to physical memory. If the mapped file is very large and the program accesses virtual memory that is not partially mapped to physical memory, a page fault occurs, and OS needs to read and write the real data from the disk file and then load it into memory. Just like our application does not have a Cache to live in a piece of data, directly access the database to ask for data and then write the results to Cache, this process is relatively slow.

But when sequential IO, read and write areas are hot areas passed by OS intelligent Cache, will not produce a large number of page break, file IO is almost equal to memory IO, performance will certainly go up.

Having said so many of the advantages of Page Cache, we should also mention its disadvantages. After the kernel allocates the available memory to Page Cache, free will have relatively less memory. If the program has new memory allocation requirements or page fault interruptions, it happens that free does not have enough memory, and the kernel also needs to spend a little time to recycle the memory of the low-hot Page Cache, which will cause burrs on very demanding systems.

Brush the disc

Brushing is generally divided into synchronous and asynchronous brushing.

Synchronous brushing disk

The message is returned to Producer successfully after the message is actually on the disk. As long as the disk is not damaged, the message will not be lost.

Generally only used for financial scenarios, this approach is not the focus of this article, because it does not take advantage of the characteristics of Page Cache, RMQ uses GroupCommit to optimize synchronous flushing.

Asynchronous flushing disk

Read and write files make full use of Page Cache, that is, write Page Cache will return success to Producer,RMQ there are two ways to do asynchronous flushing, the overall principle is the same.

Brushing is controlled by both the program and OS.

First talk about OS, when the program writes files sequentially, it is first written to Cache, this part is modified, but not brushed into the disk, resulting in inconsistencies, these inconsistent memory is called dirty pages (Dirty Page).

If the dirty page setting is too small, the number of Flush disks will increase and the performance will decline; if the dirty page setting is too large, the performance will be improved, but the one-size-fits-all OS crashes and there is no time to flush the dirty pages, so the message will be lost.

Generally not a high-end player, just use the default value of OS, as shown above.

RMQ wants high performance, so when sending messages, messages are written to Page Cache rather than directly to disk, and when messages are received, messages are fetched directly from Page Cache rather than read from disk with missing pages.

All right, after reviewing the principle, we can see the IO of Commit Log and Consume Queue after mmap in RMQ from the point of view of message sending and message receiving.

RMQ sending logic

When sending, Producer does not deal directly with Consume Queue. As mentioned above, all RMQ messages are stored in Commit Log, and in order to avoid confusion in message storage, Commit Log is locked before writing.

After the message is persistently serialized by the lock, it is written sequentially to the Commit Log, which is often called the Append operation. With Page Cache,RMQ, it will be very efficient when writing Commit Log.

After the Commit Log is persisted, the data in it will be Dispatch to the corresponding Consume Queue.

Each Consume Queue represents a logical queue, which is Append by ReputMessageService in a single Thread Loop, obviously also written sequentially.

The bottom layer of consumption logic

When consuming, Consumer does not deal directly with Commit Log, but pulls data from Consume Queue.

The order of pulling is from old to new, and each Consume Queue is read sequentially in the file, making full use of Page Cache.

There is no data just to pull Consume Queue, there is only a reference to Commit Log, so pull Commit Log again.

Commit Log will read it randomly.

But there is only one Commit Log in the whole RMQ, although it is read randomly, it is read in an orderly manner as a whole. As long as the whole area is still within the range of Page Cache, Page Cache can be fully utilized.

Looking at the network and disk on a real MQ, even if the message side has been reading messages from the MQ, it is almost impossible to see the process pulling data from the disk, and the data is sent directly from the Page Cache to the Consumer via Socket.

Contrast Kafka

At the beginning of the article, it is said that RMQ borrows the idea of Kafka, but also breaks the design of Kafka in the underlying storage.

There is only one file about message storage in Kafka, called Partition (regardless of refinement Segment), which fulfills the common responsibility of Commit Log and Consume Queue in RMQ, that is, it splits storage logically to improve consumption parallelism, and stores the real message content internally.

This looks perfect, no matter for Producer or Consumer, a single Partition file is sequential IO in normal sending and consumption logic, making full use of the huge performance improvement brought by Page Cache, but in case there are a lot of Topic, each Topic is divided into N Partition, then for OS, the sequential read and write of so many files becomes random read and write when concurrently.

At this time, I do not know why, I suddenly thought of the game "hit the gopher". For each hole, the gophers I play are always in order, but if there are 10000 holes, you are the only one to play, and countless gophers come and go in and out of each hole at random, and the students make up the scene.

Of course, students with good ideas immediately found that RMQ Consume Queue is not similar to Kafka in the case of a very large queue, although each file is a sequential IO, but the whole is a random IO. Don't forget that the Consume Queue of RMQ does not store the contents of messages, and any message takes up only 20 Byte, so files can be controlled very small, and most of the access is Page Cache access, not disk access. Formal deployment can also place Commit Log and Consume Queue on different physical SSD, avoiding IO competition for multiple types of files.

Said it was in the back.

For more wonderful articles, please follow my official Wechat account: Eric's technology.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.