How to parse the message storage model of Apache Pulsar 07/09 Update SLTechnology News&Howtos

How to parse the message storage model of Apache Pulsar

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, the editor will show you how to parse the message storage model of Apache Pulsar. The knowledge points in the article are introduced in great detail. Friends who feel helpful can browse the content of the article with the editor, hoping to help more friends who want to solve this problem find the answer to the problem. Follow the editor to learn more about "how to parse the message storage model of Apache Pulsar".

Guide reading

Apache Pulsar, a top-level project of the Apache Software Foundation, is a next-generation cloud native distributed message flow platform that integrates message, storage and lightweight functional computing. It adopts a separate computing and storage architecture, supports multi-tenant, persistent storage, and multi-room cross-regional data replication, and has streaming data storage features such as strong consistency, high throughput, low latency and high scalability.

Background

In the community, we can often see users' confusion about policies such as Backlog,storage size and retention, and some common problems, such as:

I did not set the Retention policy, why can I see that storage size is much larger than backlog size through topics stats? My msg backlog size is small, but storage size is growing all the time.

...

Message Model of Pulsar

First, let's take a look at Pulsar's message model.

As shown in the figure above, Pulsar provides the most basic processing model for pub-sub.

Producer

First, the Producer side produces the message and appends the message to the Topic in the form of append. Which Topic is distributed here will vary depending on whether the msg key is set for the message.

If msg key is set, messages will be hash based on key to distribute messages to different partitions

If msg key is not set, messages will be distributed to different partitions in the form of round robin

In the model of message distribution, Pulsar is similar to Kafka.

Consumer

In addition to Consumer, Pulsar abstracts a subscription layer for subscribing to Topic. Through the abstraction of the subscription layer, Pulsar can flexibly support two types of message queues: Queue and Streaming. Each sub can get the complete copy of all the data in this Topic, which is somewhat similar to consumer group in Kafka. Depending on the type of subscription, there can be one or more Consumer under each subscription to receive messages.

Currently, Pulsar supports the following four message subscription models:

ExclusiveFailoverShared

Key_Shared

Storage model

Messages are stored only once in the distributed log of each Partition Topic

This means that when Producer successfully sends a message to Topic, the message is stored only once in the storage layer, and no matter how many Subscription subscriptions you have to the Topic, you are actually operating on the same data. On this basis, we can see that Apache Pulsar's top-to-bottom level of abstraction is shown in the following figure:

First of all, the first layer of abstraction is Topic (Partition), which is used to store the messages information appended by Producer. Under Topic, the corresponding ledger,ledger is divided into fragments, and smaller-grained ertries,entries is stored with messages of [a] or [a batch].

Tips: in Pulsar, a batch is treated as a message on the broker side, and the specific logic of batch parsing is operated when the message is received on the consumer side.

Node: in Bookkeeper, the smallest unit that manipulates data operates at the granularity of segment.

Why do you need to do hierarchical abstraction?

The most straightforward explanation here is to ensure that the data is scattered enough and evenly distributed in each bk node. This is also one of the benefits of hierarchical sharding architecture design.

Ack mechanism

Two kinds of Ack mechanisms are supported in Pulsar, which are single Ack and batch Ack. A single Ack (AckIndividual) means that Consumer can perform Ack operations on a specific message according to the messageID of the message; batch Ack (AckCumulative) refers to Ack multiple messages at a time.

Subscription mechanism

To better understand Strorage Size and Backlog, we first need to understand the subscription mechanism in Pulsar, as shown in the following figure:

When there is a backlog of messages, you can clear the backlog of messages through clear-backlog. Clearing the backlog of messages in backlog is a relatively dangerous operation, so you will be prompted to confirm whether you want to delete messages in backlog. Clear-backlog provides a parameter of-f (--force) to block the prompt.

Producer continues to send messages to the Topic in the form of appends. The Consumer side will create a Subscription to subscribe to the Topic. When the subscription is successful, it will initialize a Cursor pointing to the location of the specific message. By default, it is Latest.

Cursor is used to store status information of consumption in a subscription.

In the figure above, we can see the message that the Topic below the subscription has successfully Receive and the Ack has lost M4. Then all message states, including M4, will be marked as deletable. In Pulsar, use MarkDeletePosition to mark this location. All messages after that represent messages that have not been consumed by this subscription.

With the passage of time, suppose that in the scenario of AckCumulative, the Consumer in the above subscription consumes some more messages. At present, the location of Cursor has moved to the position of M8, which means that all messages before M8 can be deleted.

Suppose that in the AckIndividual scenario, the Consumer in the above subscription only consumes the message M7 and sends the Ack request, and the messages M5 and M6 are still not successfully consumed, then the messages currently in the deleted state are the message before M4 and the message before M7. That is, in this scenario, there is an Ack hole in the middle of the Topic due to the use of a single Ack.

Cursor = Offset + IndevidualDeletes, Ack will trigger the movement of Cursor, but will not delete any messages

Over time, in the case of a single Ack, the Ack hole may disappear by itself, as shown in the following figure:

Above we described the movement of cursor in a single subscription in the case of a mixture of a single Ack and a batch Ack. Suppose there are multiple Subscription subscribers to this Topic, then each Subscription can get the complete Copy of the data in this Topic, that is, a Subscription will initialize a new Cursor in this Topic. The progress of consumption between each Cursor does not intersect and does not affect each other, so the situation in the following figure may occur:

In the figure above, there are two subscriptions for this Topic: Subscription-1 and Subscription-2. Consumer in Subscription-1 consumes messages before M4, and Consumer in Subscription-2 consumes messages before M8. Although the four messages between m4-m8 are consumed by Subscription-2, Subscription-1 has not yet consumed this part of the data, so this part of the message cannot be deleted. The messages currently in the deletable state are those before M4, that is, the messages consumed by the Subscription with the slowest progress of consumption in this Topic. Then there will be a problem. Suppose my current Subscription-1 is offline, and the location of its Cursor has not changed, which will cause the data in this Topic to remain undeletable all the time.

In view of the above scenario, Pulsar introduces the concept of TTL, which allows users to set the time of TTL. When the message reaches the threshold Cursor specified by TTL and still does not move, then the mechanism of TTL will be triggered and the Cursor will be moved back to the specified location automatically. One thing to note here is that what we have been emphasizing is that TTL will move the location of Cursor. So far, we have not mentioned the concept of message deletion, so don't confuse the two. All TTL will do is move the location of the Cursor without any logic associated with message deletion.

Backlog

In order to better express the data that is not consumed in Topic, Pulsar introduces the concept of Backlog to describe this part of the message. Backlog can be divided into two forms:

Topic Backlog: the slowest collection of subscribed Backlog

Subscription Backlog: a collection of unconsumed data for a single subscription level

As shown in the following figure: Backlog A belongs to Topic Backlog;Backlog A belongs to Subscription-1 Backlog;Backlog B belongs to Subscription-2 Backlog.

The Backlog changes over time, as shown in the following figure:

One thing to note here is that the backlogSize here records a message with batch, that is, a batch will be treated as a message. Because parsing the whole batch on the broker side will bring a certain burden to the broker and waste a lot of CPU resources, so the parsing of the specific batch logic is put on the Consumer side to deal with. So Backlog essentially records the number of entries we mentioned above.

In Pulsar, there are two metrics for Backlog, as follows:

MsgBacklog: records a collection of all entries that have not been Ack

BacklogSize: records the size of all messages that have not been Ack

Retention mechanism

In Apache Pulsar, BookKeeper is used as the storage layer to allow users to persist messages. In order to ensure that messages will not be persisted indefinitely, Pulsar introduces the mechanism of Retention, which allows users to configure message persistence policies. By default, the persistence mechanism is turned off, that is, after the message is Ack, it will enter the logic of deletion.

When configuring a Retention policy, you can specify the following two parameters:

Size: the threshold for persistence size. 0 means no Retention size policy is configured, and-1 means the size set is infinite.

Time: the threshold of persistence time. 0 means no Retention time policy is configured, and-1 means infinite time.

After the introduction of the Retention policy, the view represented by the entire Topic is shown below. M0-m5 represents messages that have been acknowledged by all subscriptions and have exceeded the threshold of the Retention policy, that is, these messages are ready to be deleted. Note that what I am describing here is [ready to delete] whether it can be deleted or not, which is not sure yet.

In the beginning, we abstracted step by step from the top Topic to a concrete msg. (here, for convenience of description, we ignore the concept of batch, that is, a msg is equivalent to an entry.) now we superimpose all the concepts back. Because in bk, the smallest unit allowed to operate is a segment, there is no way to delete a message at a specific msg (entry) level, and deletion operations need to be performed against a segment. As shown in the following figure:

Suppose m0-m3 belongs to segment3;m4-m7, belongs to segment2;m8-m11, belongs to segment1. According to the description in the figure above, all m0-m5 messages can be deleted, but segment 2 contains M6, and M7 does not reach the Retention threshold, so segment cannot be deleted yet.

Storage Size

In order to express the storage space occupied by current messages more conveniently, Pulsar introduces storageSize to describe the whole concept. As shown in the following figure: when the messages identified by backlog B and storageSize are the same, backlogSize is equivalent to storageSize.

When a single Ack,Retention policy is introduced and Bookkeeper is deleted based on segment, it is likely to cause scenarios where Storage Size is greater than backlog Size, as shown in the following figure:

Messages are stored only once in the distributed log of each Partition Topic

Cursor is used to store the consumption status of a subscribed Consumer.

Cursor is equivalent to offset (kafka) + individualDeletes

Ack will update the location of Cursor in Topic

When a message is Ack by all subscribers, the message enters a state that can be deleted

All messages that have not been confirmed will always be saved in Subscription backlog.

TTL can automatically update the location of Cursor by setting a time threshold

The Retention policy is used to manipulate what should be done with messages that have been Ack.

Messages are deleted in segment, not entry.

Thank you for reading, the above is the whole content of "how to parse the message storage model of Apache Pulsar", learn friends to hurry up to operate it. I believe that the editor will certainly bring you better quality articles. Thank you for your support to the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.