How to break through the bottleneck of disk performance from the Design of RocketMQ message persistence 07/06 Update SLTechnology News&Howtos

How to break through the bottleneck of disk performance from the Design of RocketMQ message persistence

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about the breakthrough of disk performance bottleneck from the design of RocketMQ message persistence. Many people may not know much about it. In order to make you understand better, the editor summarizes the following contents. I hope you can get something from this article.

Breakthrough of disk performance bottleneck from RocketMQ message persistence Design

Distributed message queues usually require high reliability, so message data needs to be stored persistently. So how to persist is a questionable question.

From the point of view of storage mode and efficiency, file system > KV storage > relational database, direct operation of file system is naturally the fastest way of storage, but is that enough?

Of course not. In countless past studies, it is well known that disk IO performance is a drag on system performance. So how does RocketMQ solve it?

Ladies and gentlemen, please wait for me to tell you slowly.

Storage architecture design

First of all, let's recall that if there is a word you don't know now, and then you happen to have a Chinese language dictionary on hand, what can you do to find this word as quickly as possible?

Anyone who has been to primary school should not start a page-by-page search from the first page of the Chinese language dictionary.

As excellent primary school graduates, we must first retrieve the word by side, and then find the word in the corresponding page number in the Chinese language dictionary according to the page number of the word, so you know what it reads.

The main roads are connected.

RocketMQ in the file system, all the messages are stored in the same file, this is like a thick Chinese language dictionary, as a consumer, want to achieve the maximum efficiency of real-time consumption, to put it bluntly, is to quickly locate the location of the message in the file, certainly can not start from the file offset 0 down search.

The top of a picture is hundreds of words:

RocketMQ has three main storage files, which are:

CommitLog: message store file, where all messages are stored

ConsumeQueue: consumption queue file. After the message is stored in CommitLog, the hashcode of the CommitLog offset, size and tag of the message will be asynchronously forwarded to the consumption queue storage for consumer consumption. It is similar to the index file of the database and stores the physical storage address. Each Message Queue under each topic has a corresponding ConsumeQueue file.

Index: index file. After the message is stored in CommitLog, the offset of the message key and the CommitLog where the message is located will be forwarded to the index file store for message query.

From the schematic diagram, we can see that the production and consumption of the message are separated. The final writing of the message sent by the producer side is that the CommitLog,Consumer side first reads the starting physical location offset offset, size size and message Tag value of the persistent message from ConsumeQueue, and then reads the real entity content part of the consumer message to be pulled from CommitLog.

The above mentioned how consumers quickly locate the location of messages, so that consumers can consume efficiently, so let's talk about how to achieve the efficiency of message storage in RocketMQ.

Let's think about a question first. If you are the owner of a printing factory, how can you quickly print a complete Chinese language dictionary without errors?

The answer is simple, starting from the first page, printing page by page in order, do not skip page printing, let alone random printing.

Just like our disk writing, according to the research of so-and-so, the speed of high-performance disk can basically be comparable to the write speed of memory when writing sequentially, but when the disk is written randomly, the performance bottleneck is very obvious. the speed will be slow.

Therefore, RocketMQ adopts that all messages are stored in a CommitLog file, and the write operation is locked (putMessageLock) to ensure that messages are written sequentially, avoiding the increase of IO WAIT caused by disk competition, and greatly improving the writing efficiency.

We can use a more detailed diagram to illustrate:

The producer writes the CommitLog sequentially, and the consumer consumes by reading the ConsumeQueue sequentially. It should be noted here that although the consumer reads the ConsumeQueue sequentially, it does not mean that the message is read sequentially, because the real content of the message is read according to the starting physical position offset offset in the ConsumeQueue. In the case of very high concurrency, the CommitLog is actually read randomly. However, the performance overhead caused by random reading of files is still relatively large, so here, RocketMQ makes use of the pagecache mechanism of the operating system to read in batches from disk, which is stored in memory as cache to accelerate the reading speed.

Storage file

We can see the three folders of CommitLog,ConsumeQueue,Index intuitively by opening the directory where RocketMQ persists on disk (under the store directory). (the config folder contains some configuration information during the run time, and abort,checkpoint I will talk about their role in later articles. Follow "IT for a quarter of an hour" and don't miss the important content in hesitation! )

Contents of the CommitLog folder (${ROCKET_HOME} / store/commitlog)

You can see that each file is 1G in size, with the first offset in the file as the file name, and the offset less than 20 bits is filled with 0. As shown in the figure, the initial offset of the first file is 9663676416 and the initial offset of the second file is 10737418240.

The internal storage logic of the CommitLog file is that the first four bytes of each message store the total length of the message (including the length information itself), followed by the message content. As shown in the figure:

Message length = message length information (4 bytes) + message content length.

Steps to implement message lookup:

1. Consumers get the offset offset and length size of a message from the consumption queue

two。 Locate the commitLog physical file where the message is located according to the offset offset

3. Use the offset and the file length to get the offset of the message inside the commitLog file.

4. You can return the content of the size length from this offset.

Note: if you only look up the message based on the message offset, first find the offset within the file, then read the first 4 bytes to get the actual length of the message, and then read the specified length. Here is an ingenious design. Instead of generating one at a time and then creating the next one after the CommitLog file is full, there is a pre-allocation mechanism.

That is, the CommitLog creation process encapsulates the path of the next file, the path of the next file and the file size into the AllocateRequest object as parameters and adds it to the queue. The AllocateMappedFileService service thread running in the background will keep run. As long as there is a request object in the request queue, it will create the next CommitLog. At the same time, the next CommitLog will be pre-created and saved in the request queue to be directly returned when it is obtained next time. There is no time delay caused by waiting for CommitLog to create the allocation again.

Contents of the ConsumeQueue folder (${ROCKET_HOME} / store/consumequeue)

For consumers, they are most concerned about all the messages under a certain topic, but in RocketMQ, messages under different topics are interlaced and mixed in the same file. In order to improve the query speed, it is necessary to build files similar to search indexes, so there is a consumption queue ConsumeQueue file.

In terms of actual physical storage, ConsumeQueue corresponds to the files under each Topic and QueuId. In the figure above, 00000000000012000000 is the ConsumeQueue file under the theme of sim-online-orders,QueueId 1. The size of a single file is about 5.72m, each file consists of 30W pieces of data, and the default size of each file is 6 million bytes, or 20 bytes per piece of data. When a file of type ConsumeQueue is full, the next file is written.

The internal storage logic of the ConsumeQueue file is shown in the figure:

Contains the offset of the message in the commitLog file, the message length, and the HashCode of the message tag. A single ConsumeQueue file can be thought of as an array of ConsumeQueue entries, with the subscript being the logical offset of the ConsumeQueue.

Message consumption queues are index files built by RocketMQ for message subscriptions to improve the speed at which topics and message queues retrieve messages.

Contents of the Index folder (${ROCKET_HOME} / store/index)

In order to query the real entity content of the message through the message key value, RocketMQ introduces the Hash index mechanism. On the actual physical storage, the file name is named after the timestamp at the time of creation, the size of a fixed single IndexFile file is about 400m, and an IndexFile can hold 2000W indexes.

Let's first look at the internal storage logic of the Index index file:

IndexFile consists of three parts: IndexHead,Hash slot and Index entry.

1.IndexHead, which contains 40 bytes, records some statistics:

BeginTimestamp: this index file contains the minimum storage time for messages.

EndTimestamp: the maximum storage time for messages is contained in this index file.

BeginPhyoffset: this index file contains the minimum physical offset of the message (commitlog file offset).

EndPhyoffset: this index file contains the maximum physical offset of the message (commitlog file offset).

HashslotCount: the number of hashslot is not the number of hash slots used, so it doesn't make much sense here.

IndexCount: the number of Index entries currently used, and Index entries are stored sequentially in the Index entry list.

2.Hash slots, default 5 million slots, each slot stores the number of subscript of the latest Index entry corresponding to the HashCode of the message key.

List of 3.Index entries. By default, an index file contains 20 million entries:

HashCode for hashcode:key.

Phyoffset: the physical offset corresponding to the message.

Timedif: the difference between the storage time of the message and the timestamp of the first message is less than 0. The message is invalid.

PreIndexNo: the Index index of an entry on this HashCode, the linked list structure that is built when a hash conflict occurs.

Do you understand this data structure? The design is really exquisite.

If you don't understand, I'll draw you a picture to experience the subtlety of this data structure:

First, the slot number is modeled according to the HashCode of key, and the slot is obtained, then the corresponding data are sequentially stored in the Index entry, and the number of entries is saved back to the corresponding slot.

If a Hash conflict is encountered, the Index entry builds the linked list structure through pre index no:

As shown in the second slot conflict, the pre index no of the fifth index entry stores the original second serial number. It's actually the deformed structure of HashMap.

Through the above structure, you can quickly locate the content of the message using the key of the message.

Memory mapping

If the above is that RocketMQ optimizes the data structure to improve the performance of distributed message queues, then here it is through the underlying operating system to optimize performance.

In Linux, the operating system is divided into "user mode" and "kernel state". When operating files in standard IO, the data is first copied from disk to kernel state memory, then copied from kernel state memory to user state memory to complete the read operation, then copied from user mode memory to network-driven kernel state memory, and finally copied to network card for transmission from network-driven kernel state memory to complete write-out operation.

This whole process involves four copies, so the efficiency can be seen to be low.

Therefore, in RocketMQ, "zero copy" is realized through MappedByteBuffer (mmap mode) in Java, which eliminates the memory copy to the user mode and improves the speed of message storage and network transmission.

Here we talk about what mmap memory mapping technology is.

Mmap technology can directly map an area in the private address space of a user process to a file object, so that the program can read / write files directly from memory. When a page fault occurs, the file is copied directly from the disk to the process space in the user mode, and the data is copied only once. For files with large capacity (file size generally needs to be limited to less than 1.5cm 2G), mmap is very efficient and efficient in reading / writing. As shown in the figure:

Restrictions on using Mmap:

The problem of freeing a.Mmap mapped memory space: because the mapped memory space itself does not belong to JVM's heap memory area (Java Heap), it is not under the control of JVM GC. Unloading this part of memory space needs to be achieved by calling the unmap () method. However, the unmap () method is a private method implemented in the FileChannelImpl class and cannot directly display the call. In RocketMQ, the clean () method of the Cleaner class under the "sun.misc" package is called through Java reflection to free up the memory space occupied by the mapping.

B.MappedByteBuffer memory mapping size limit: because it consumes virtual memory (non-JVM heap memory), the size is not limited by the-Xmx parameter of JVM, but its size is also limited by OS virtual memory size. Generally speaking, only 1.5 gigabytes of files can be mapped to virtual memory space in user mode at a time, which is why RocketMQ sets a single CommitLog log data file to 1G by default.

c. Other problems with using MappedByteBuffe: high memory usage and uncertain file closure

What are the ways to break through the performance bottleneck?

1. Simple and efficient data structure to improve retrieval speed

two。 Write disk sequentially to avoid out-of-order io competition and improve message storage speed

3. Pre-allocation mechanism to reduce file processing waiting time

4. Depending on pagecache mechanism, messages are read from disk and loaded into cache in batches to improve the reading speed.

5. Memory mapping mechanism reduces the number of copies between kernel states in user state and improves processing efficiency.

After reading the above, do you have any further understanding of how to break through disk performance bottlenecks in terms of RocketMQ message persistence design? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.