What is the message storage mechanism of Pulsar and the GC mechanism of Bookie? 07/04 Update SLTechnology News&Howtos

What is the message storage mechanism of Pulsar and the GC mechanism of Bookie?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the message storage mechanism of Pulsar and the principle of GC mechanism of Bookie". In daily operation, I believe that many people have doubts about the message storage mechanism of Pulsar and the principle of GC mechanism of Bookie. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods, hoping to help you answer the doubts about "what is the message storage mechanism of Pulsar and the principle of GC mechanism of Bookie?" Next, please follow the editor to study!

Pulsar message Store

Pulsar's messages are stored in BookKeeper, BookKeeper is a fat client system, the client part is called BookKeeper, and each storage node in the server-side cluster is called bookie. The broker of the Pulsar system serves as the client of the BookKeeper storage system, and stores the messages of Pulsar in the bookies cluster through the client SDK provided by BookKeeper.

Each partition of each topic in Pulsar (non-partition topic, as understood by partition 0, the number of partition topic starts from 0) corresponds to a series of ledger, and each ledger only stores messages under the corresponding partition. For each partition, only one ledger can be written to open at the same time.

When Pulsar produces messages and stores messages, it will first find the ledger used by the current partition, and then generate the entry ID,entry ID corresponding to the current message is incremented within the same ledger. In the case of non-batch production (this parameter can be configured on the producer side, which is batch by default), an entry contains a message. In batch mode, an entry may contain multiple messages. On the other hand, bookie will only write, find and retrieve according to the entry dimension.

Therefore, the msgID of messages under each Pulsar needs to be composed of four parts (the old version consists of three parts), which is (ledgerID,entryID,partition-index,batch-index), where partition-index is-1 for non-partition topic and-1 for non-batch messages.

Each ledger will be switched when the duration of existence or the number of saved entry exceeds the threshold, and the new messages under the same partition will be stored in the next ledger. Ledger is just a logical concept, a logical assembly dimension of data, and there is no corresponding entity.

After each bookie node in the BookKeeper cluster receives the message, the data is stored and processed in three parts: journal file, entryLog file and index file.

Among them, the journal file and entry data are written to the journal file in wal mode. Each journal file has a size limit. When the size limit of a single file is exceeded, it will switch to the next file to continue writing. Because the journal file is scrubbed in real time, in order to improve performance and avoid the interaction between reading and writing IO, it is recommended to distinguish the storage directory from the directory where the entrylog is stored. And mount a separate hard disk to the storage directory of each journal file (ssd hard drive is recommended). Only a few journal files will be saved, and files that exceed the number of configurations will be deleted. Entry storage to journal files is completely random, first-come, first-served, and journal files are designed to ensure that messages are not lost.

As shown in the figure below, after each bookie receives a request to add entry, it will be mapped to that journal directory and entry log directory according to ledger id, and the entry data will be stored in the corresponding directory. Currently, bookie does not support changing the storage directory during operation (adding or decreasing directories will cause some data to be found).

As shown in the following figure, when bookie receives an entry write request, it is written to the journal file and saved to write cache at the same time. The write cache is divided into two parts, one is the write cache being written, the other is the part being brushed, and the two parts are used alternately.

There is an index data structure in write cache, and you can find out that the index in the corresponding entry,write cache is memory-level, which is based on the ConcurrentLongLongPairHashMap structure defined by bookie itself.

In addition, the storage directory of each entorylog corresponds to a SingleDirectoryDbLedgerStorage class instance object, and each SingleDirectoryDbLedgerStorage object has an index structure based on RockDB implementation, through which index you can quickly find out which entrylog file each entry is stored in. Each write cache will be sorted when adding entry, and the data under the same write cache and the same ledger are ordered next to each other, so that when the data in the write cache is flush to the entrylog file, the data written to the entrylog file is partially ordered, this design can greatly improve the subsequent reading efficiency.

The index data in SingleDirectoryDbLedgerStorage will also be flushed to the index file along with the refresh of entry. When bookie is down and restarts, you can restore data through journal files and entry log files to ensure that the data is not lost.

Pulsar consumer performs multi-tier cache acceleration processing when consuming data, as shown in the following figure:

The order in which data is obtained is as follows:

Get it from the entry cache on the broker side, if not continuing; from the part where the write cache of bookie is being written, if not, then continue; from the part where the write cache of bookie is being brushed, if not, then continue; from the read cache of bookie, if not, continue; and read the entry log file on the disk through the index.

For each of the above steps, if the data can be obtained, it will be returned directly, skipping the following steps. If the data is obtained from the disk file, the data will be stored in read cache when it is returned. In addition, if it is a disk read operation, it will be read more on the disk, because there is local orderly processing during storage, and the probability of obtaining adjacent data is very high. This processing will greatly improve the efficiency of subsequent data acquisition.

In the process of using it, we should try our best to avoid or reduce the scenario of consuming too old data, that is, triggering to read messages in disk files, so as not to affect the performance of the overall system.

GC Mechanism of BookKeeper

Each bookie in BookKeeper performs data cleaning operations periodically. By default, it is checked and processed every 15 minutes. The main process of cleaning is as follows.

Clean up the ledger id stored in bookie (compare the ledger id stored in bookie with the ledger id stored on zk, and delete the ledger id stored in bookie if it is not on zk); count the proportion of surviving entry in each entry log, delete the entry log; according to the metadata information of entry log when the number of ledger surviving in entry log is 0, and clean up the entry log file (when all ledger id contained in entry log is invalid)

Compress the entry log file, respectively, the ratio of entry surviving under the current entry log file is 0.5-default cycle 1 day (major gc) or 0.2-default cycle 1 hour (minor gc), Compaction entry log file, transfer the existing entry in the old file to the new file, and then delete the old entry log file, a single GC may take a long time if the entry log file is larger.

Through the above process, we can understand the general process of bookie when cleaning up entrylog files.

It needs to be noted that whether ledger can be deleted is entirely client-side triggered and broker-triggered in Pulsar.

There is a periodic processing thread on the broker side (default is 2 minutes) to clean up the ledger mechanism where the consumed messages are located, obtain the messages last confirmed by the cursor contained in the topic, and delete all the previous id (not including the current ledger id) in the list of ledger contained in the topic (including the metadata in the zk, and notify the bookie to delete the corresponding ledger).

Analysis of problems encountered in Operation

In the process of application, we have encountered the scene of insufficient disk space in bookie many times, and a large number of entry log files are stored in bookie. There are two typical reasons.

Reason 1:

Production messages are too scattered, for example, in an extreme scenario, 1w topic, one per topic, 1w topic sequential production. In this way, the ledger corresponding to each topic will not be switched because of the length of time or storage size in a short time, and the ledger id of the active state is scattered in a large number of entry log files. These entry log files cannot be deleted or compressed in time.

If you encounter such a scenario, you can restart it and force ledger to switch to handle it. Of course, if the consumption fails to keep up at this time, the ledger where the consumption last ack is located is also in the active state and cannot be deleted.

Reason two:

GC time process, if there are a large number of existing enrylog files, and a large number of them meet the minor or major gc threshold, so that a single minor gc or major gc time is too long, during this period of time can not clean up expired entry log files.

This is due to the sequential execution of a single cleanup process, and the next round will be executed only after the last round of execution. At present, this part also proposes to optimize the process to avoid the implementation of the sub-process is too long, which has an impact on the whole.

At this point, the study on "what is the message storage mechanism of Pulsar and the principle of GC mechanism of Bookie" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.