How to analyze the tiered storage of Apache Pulsar 07/02 Update SLTechnology News&Howtos

How to analyze the tiered storage of Apache Pulsar

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to analyze the hierarchical storage of Apache Pulsar. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

In some stream data use case scenarios, the user wants to store the data in the stream for a long time. Although Apache Pulsar has no limit on the size of topic backlog, it is expensive to store all data in Pulsar for a long time.

The tiered storage features of Apache Pulsar (available in version 2.1 and later) are described below, which enables older data to be moved to long-term storage without affecting the end user.

In the recommendation service, developers do not want to limit the size of the backlog. Take the music service as an example, each time the end user listens to a song, a message is added to the topic. Using this topic training recommendation algorithm, the music that the end user may like is recommended based on the music that the end user has heard. Then, the calculation results are recommended to the user, and the process is recycled.

Recommendation algorithms are not immutable. Data scientists of music services have been constantly optimizing recommendation algorithms to better predict users' favorite music, so as to improve users' satisfaction and participation in recommendation services.

However, if only the user data after the modified time point is run every time the algorithm is modified, not only the accuracy of prediction will be affected, but also it will take a long time to judge the modification effect of the algorithm. In order to solve this problem, the algorithm needs to run as much user history data as possible.

Pulsar allows users to store topic backlog of any size. When the cluster is about to run out of space, users only need to add new storage nodes, and the system will automatically rebalance the data. However, after such an operation has been running for a period of time, the operation and maintenance cost is very expensive.

Pulsar reduces cost / size losses by providing tiered storage (a feature added since Apache Pulsar 2.1). Tiered storage provides users with unlimited backlog without adding storage nodes; the cost of offloading older topic data to long-term storage is an order of magnitude lower than that of storage in Pulsar clusters. For end users, there is no significant difference in consuming topic data stored in Pulsar clusters or tiered storage. Topic in Pulsar clusters and tiered storage produces and consumes messages in exactly the same way.

Pulsar implements tiered storage through a sharding architecture. The message log of Pulsar topic consists of a series of fragments. The last shard in the sequence is the one currently written by Pulsar. All shards before the current sequence are encapsulated, that is, the data in these shards is immutable. Because the data is immutable, it can be easily copied to another storage system without having to worry about consistency. After the replication is complete, the data pointer in the message log metadata can be updated immediately, and the copy of the data stored by Pulsar in Apache BookKeeper can be deleted.

Using tiered storage in Pulsar

Pulsar currently supports long-term storage through Amazon S3, GCS (Google Cloud Storage) and Filesystem. To use S3 for tiered storage, the administrator needs to first create a bucket (bucket) in S3; then configure broker with the bucket and the area where the bucket was created.

ManagedLedgerOffloadDriver=S3s3ManagedLedgerOffloadRegion=eu-west-3s3ManagedLedgerOffloadBucket=pulsar-topic-offload

Users do not configure authentication directly in Pulsar. The `DefaultAWSCredentialsProviderChain` used by Pulsar can find verification information in multiple locations.

The easiest way to configure validation information is to set environment variables in pulsar-env.sh.

For more information about configuring authentication methods, see the tiered storage documentation:

Http://pulsar.apache.org/docs/en/concepts-tiered-storage/

Once all the broker is configured, you can start using tiered storage. You can configure tiered storage to unload data to run automatically, or you can trigger it manually.

Automatically migrate data to long-term storage

Administrators can set a size threshold policy for namespaces. After configuring the size threshold policy, if the data size of any topic in the namespace on the Pulsar cluster exceeds the threshold, topic will unload the shard to long-term storage until the data size on the Pulsar cluster is within the threshold.

For example, when the data size on the Pulsar cluster exceeds 1 GB, you can use the following command to specify the topic unloading shard in the namespace:

Pulsar-admin namespaces set-offload-threshold-size 1G my-tenant/my-namespace

When any topic in the namespace exceeds the threshold, the topic moves the data to long-term storage, freeing up storage space on the Pulsar cluster.

Manual uninstall

In addition to configuring automatic unloading of data, the uninstall operation can be triggered manually on a single topic through the REST interface or the command line interface. To be triggered through the command line interface, the user must specify the maximum amount of data reserved for topic on the Pulsar cluster. If the topic data size on the Pulsar cluster exceeds the set threshold, the shards on this topic are moved to long-term storage until the data size on the Pulsar cluster is within the threshold. When moving data, move the older shards first.

Pulsar-admin topics offload-size-threshold 10m my-tenant/my-namespace/topic1 on how to analyze Apache Pulsar tiered storage to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.