What is the system architecture and design concept of Apache Pulsar? 07/04 Update SLTechnology News&Howtos

What is the system architecture and design concept of Apache Pulsar?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about the system architecture and design concept of Apache Pulsar, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

We will introduce some of the system architecture and design concepts behind Apache Pulsar, and finally compare it with the architecture of Apache Kafka.

Hierarchical architecture of Pulsar

The most fundamental difference between Apache Pulsar and other messaging systems is the hierarchical architecture. An Apache Pulsar cluster consists of two layers: a stateless service layer, which consists of a set of Broker that receives and delivers messages, and a stateful persistence layer, which consists of a set of Apache BookKeeper storage nodes called bookies, which can store messages persistently. The following figure shows a typical deployment of Apache Pulsar.

The Producer & Consumer interface is provided in the Pulsar client, and the application uses the Pulsar client to connect to the Broker to publish and consume messages.

The Pulsar client does not interact directly with the storage tier Apache BookKeeper. The client also does not have direct Zookeeper access. This isolation provides the basis for Pulsar to implement a secure multi-tenant unified authentication model.

Apache Pulsar provides support for clients in multiple languages, including Java,C + +, Python,Go and Websockets.

Apache Pulsar also provides a set of Kafka-compatible API that allows users to migrate existing Kafka applications by simply updating dependencies and pointing clients to the Pulsar cluster, so that existing Kafka applications can be used with Apache Pulsar immediately without any code changes.

Broker layer-stateless service layer

Broker clusters form a stateless service layer in Apache Pulsar. The service layer is "stateless" because Broker does not actually store any message data locally. Messages about Pulsar topics are stored in a distributed log storage system (Apache BookKeeper). We'll talk more about BookKeeper in the next section.

Each topic Topic Partition is assigned to a Broker by Pulsar, which is called the owner of that topic partition. Pulsar producers and consumers connect to the owner of the topic partition, Broker, to send messages to the owner agent and consume messages.

If a Broker fails, Pulsar automatically moves the topic partition it owns to one of the remaining available Broker in the cluster. One thing to say here is that because Broker is stateless, when a migration of Topic occurs, Pulsar simply transfers ownership from one Broker to another Broker, and no data replication occurs in the process.

The following figure shows a Pulsar cluster with four Broker, with four topic partitions distributed across four Broker. Each Broker owns and provides a message service for a topic partition.

BookKeeper layer-persistent storage layer

Apache BookKeeper is the persistent storage layer of Apache Pulsar. Each topic partition in Apache Pulsar is essentially a distributed log stored in Apache BookKeeper.

Each distributed log is divided into Segment segments. Each Segment segment, as a Ledger in the Apache BookKeeper, is evenly distributed and stored in multiple Bookie (storage nodes of the Apache BookKeeper) in the BookKeeper cluster.

The timing of Segment creation includes the following: configuration-based Segment size, configuration-based scrolling time, or when the owner of the Segment is switched.

Through Segment segmentation, messages in the topic partition can be evenly and evenly distributed across all Bookie in the cluster. This means that the size of the topic partition is not only limited by the capacity of one node; instead, it can be extended to the total capacity of the entire BookKeeper cluster.

The following figure illustrates a topic partition divided into x Segment segments. Each Segment segment stores 3 copies. All Segment are distributed and stored in 4 Bookie.

Segment centric Stora

The hierarchical architecture of storage services and Segment-centric storage are two key design concepts of Apache Pulsar (using Apache BookKeeper). These two foundations provide many important benefits for Pulsar:

Unlimited topic partition storage

Immediate expansion without data migration

Seamless Broker failure recovery

Seamless cluster expansion

Seamless storage (Bookie) failure recovery

Independent scalability

Next, let's look at several benefits separately.

Unlimited topic partition storage

Because the topic partition is divided into Segment and stored in a distributed manner in Apache BookKeeper, the capacity of the topic partition is not limited by the capacity of any single node. Instead, the topic partition can be extended to the total capacity of the entire BookKeeper cluster, and the cluster capacity can be expanded simply by adding Bookie nodes. This is the key for Apache Pulsar to support the storage of unlimited stream data and the ability to process data in an efficient and distributed manner. Distributed log storage using Apache BookKeeper is critical for unified messaging services and storage.

Immediate expansion without data migration

Because the message service and message storage are divided into two layers, moving a topic partition from one Broker to another can be done almost instantly without any data rebalancing (replicating data from one node to another). This feature is critical for many aspects of high availability, such as cluster expansion, and rapid response to Broker and Bookie failures. I will use examples to explain in more detail below.

Seamless Broker failure recovery

The following figure shows an example of how Pulsar handles Broker failures. In the example, Broker 2 is disconnected for some reason, such as a power outage. Pulsar detects that Broker 2 is closed and immediately transfers ownership of Topic1-Part2 from Broker 2 to Broker 3. In Pulsar, data storage and data services are separated, so when Agent 3 takes over ownership of Topic1-Part2, it does not need to copy Partiton's data. If new data arrives, it is immediately attached and stored as Segment x + 1 in Topic1-Part2. Segment x + 1 is distributed and stored on Bookie1, 2 and 4. Because it does not need to re-copy the data, the transfer of ownership occurs immediately without sacrificing the availability of the topic partition.

Seamless cluster capacity expansion

The following figure illustrates how Pulsar handles the capacity expansion of a cluster. When Broker 2 writes messages to Segment X of Topic1-Part2, Bookie X and Bookie Y are added to the cluster. Broker 2 immediately discovered the newly added Bookies X and Y. Broker will then try to store messages for Segment X + 1 and X + 2 in the newly added Bookie. The newly added Bookie is used immediately, and the traffic increases immediately without replicating any data. In addition to rack-and zone-aware policies, Apache BookKeeper provides a resource-aware placement policy to ensure that traffic is balanced across all storage nodes in the cluster.

Seamless storage (Bookie) failure recovery

The following figure illustrates how Pulsar (through Apache BookKeeper) handles disk failures in bookie. There is a disk failure that destroys Segment 4 stored on bookie 2. The Apache BookKeeper background detects this error and copies it to fix it.

Replica repair in Apache BookKeeper is a many-to-many fast repair at the Segment (or even Entry) level, which is finer than recopying the entire topic partition, replicating only the necessary data. This means that Apache BookKeeper can read messages in Segment 4 from bookie 3 and bookie 4 and fix Segment 4 at bookie 1. All copy fixes are done in the background and are transparent to Broker and applications.

Even if an error occurs in the Bookie node, by adding a new available Bookie to replace the failed Bookie, all Broker can continue to accept writes without sacrificing the availability of the topic partition.

Independent scalability

Because the message service layer and persistent storage layer are separate, Apache Pulsar can extend the storage layer and service layer independently. This independent expansion is more cost-effective:

When you need to support more consumers or producers, you can simply add more Broker. Topic partitions will be immediately balanced in Brokers, and ownership of some topic partitions will immediately be transferred to the new Broker.

When you need more storage space to keep messages longer, you just need to add more Bookie. With intelligent resource awareness and data placement, traffic is automatically switched to the new Bookie. Unnecessary data migration is not involved in Apache Pulsar, and old data is not replicated from the existing storage node to the new storage node.

Comparison with Kafka

Both Apache Kafka and Apache Pulsar have similar message concepts. The client interacts with the message system through the topic. Each topic can be divided into multiple partitions. However, the fundamental difference between Apache Pulsar and Apache Kafka is that Apache Kafka takes the partition as the storage center, while Apache Pulsar uses Segment as the storage center.

The above figure shows the difference between partition-centric and Segment-centric systems.

In Apache Kafka, partitions can only be stored on a single node and copied to other nodes, and their capacity is limited by the minimum node capacity. This means that capacity expansion requires rebalancing the partition, which in turn requires recopying the entire partition to balance the data and traffic of the newly added agents.

Retransmitting data is very expensive and error-prone, and consumes network bandwidth and Icano. Maintenance personnel must be very careful when performing this operation to avoid damaging the production system.

Recopying of partition data in Kafka occurs not only on cluster extensions in partition-centric systems. Many other things can trigger data re-copying, such as copy failure, disk failure, or computer failure. During data re-replication, partitions are usually not available until the data re-replication is complete. For example, if you configure a partition to be stored as 3 copies, if you lose one copy, you must re-copy the entire partition before the partition can be available again.

This defect is usually ignored before the user encounters a failure, because in many cases, only the data cached in memory is read in a short period of time. When the data is saved to disk, users will inevitably encounter the problems of data loss and fault recovery, especially when the data needs to be saved for a long time.

In contrast, in Apache Pulsar, the partition is also used as the logical unit, but the Segment is used as the physical storage unit. Partitions are segmented over time and evenly distributed across the cluster, designed to scale effectively and rapidly.

Pulsar is Segment-centric, so data rebalancing and copying are not required when expanding capacity, and old data is not replicated, thanks to the use of scalable Segment-centric distributed log storage systems in Apache BookKeeper.

By leveraging distributed log storage, Pulsar maximizes Segment placement options for high write and read availability. For example, with BookKeeper, the copy setting is equal to 2, and as long as any 2 Bookie starts, the subject partition can be written. For read availability, as long as one copy of the topic partition is active, the user can read it without any inconsistencies.

After reading the above, do you have any further understanding of the system architecture and design philosophy of Apache Pulsar? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.