How to analyze the next Generation distributed message queuing Apache Pulsar 07/06 Update SLTechnology News&Howtos

How to analyze the next Generation distributed message queuing Apache Pulsar

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Many novices are not very clear about how to analyze the next generation distributed message queue Apache Pulsar. In order to help you solve this problem, the following editor will explain it in detail. People with this need can come and learn. I hope you can get something.

Introduction to Pulsar

Apache Pulsar is an enterprise-class distributed messaging system, originally developed by Yahoo and open source in 2016, and is currently being incubated under the Apache Foundation. Plusar has been in use in Yahoo's production environment for more than three years, mainly serving KV storage for Mail, Finance, Sports, Flickr, the Gemini Ads platform, Sherpa and Yahoo.

Pulsar can be called a next-generation message queue mainly because of the following features:

Linear expansion. The capacity can be expanded smoothly to hundreds of nodes (Kafka expansion requires a lot of system resources to copy data between nodes, while Plusar does not use it at all)

High throughput. Has been tested in Yahoo's production environment, with millions of messages per second

Low latency. Low latency even with large volume of messages (< 5ms)

Persistence mechanism. Plusar's persistence mechanism is built on top of Apache BookKeeper, which provides IO isolation before writing and reading

Location-based replication. Plusar supports replication of multiple regions / availability zones as the primary feature. Users only need to configure the availability zone, and messages will be continuously copied to other availability zones. When an availability zone hangs or a network partition occurs, plusar will try again and again later.

The diversification of deployment methods. It can not only run on bare metal, but also support some containerization solutions such as Docker, K8S and different cloud vendors. At the same time, when developing locally, you only need one command to start the entire environment.

Topic supports multiple consumption modes: exclusive, shared, failover

Architecture Overview

From the top layer, a Plusar unit consists of several clusters, and the clusters in the unit can copy data before each other. There are usually the following components in plusar:

Broker: responsible for processing messages from Producer and distributing them to consumers. A global ZK cluster is used to handle a variety of collaborative tasks, such as location-based replication. The messages are stored in BookKeeper, and a set of ZK clusters are needed within a single cluster to store some metadata.

BookKeeper cluster: contains multiple bookies internally to persist messages.

ZooKeeper set

Broker

In Kafka and RocketMQ, Broker is responsible for the storage of message data and consumer consumption displacement, while broker in Plusar is different from the two of them. Broker in plusar is a stateless node, which is mainly responsible for three things:

Expose REST interface to execute commands of administrators and queries of topic owners, etc.

An asynchronous TCP server for communication between nodes. The protocol currently uses Protocol Buffer, which was open source before Google.

In order to support regional replication, broker publishes the message where its cluster resides to other availability zones.

The message is first published to BookKeeper, and then a copy is cached in Broker local memory, so generally speaking, the read of the message is read from memory, so finding the topic owner mentioned in the first article means that because a ledger in BookKeeper only allows one writer, we can call the rest API to get the current owner of a topic.

BookKeeper

BookKeeper is a scalable, error-tolerant, low-latency distributed storage service. The most basic unit in BookKeeper is records, which is actually a byte array, while the array of records called ledger,BK copies records to multiple bookies, and the node that stores ledger is called bookies, thus achieving higher availability and error tolerance. From the design phase, BK takes into account a variety of failures, Bookies can be down, lose data, dirty data, but it is correct that there are enough Bookies services in the whole cluster.

In Pulsar, each partition topic is composed of several ledger, while ledger is a data structure of append-only, which only allows each record in a single writer,ledger to be copied to multiple bookies. When a ledger is closed (for example, broker is down or reaches a certain size), it only supports reading, and when the data in the ledger is no longer needed (for example, all consumers have consumed the messages in this ledger), it will be deleted.

The main advantage of Bookkeeper is that it ensures read consistency in ledger in the event of a failure. Because ledger can only be written by one writer at the same time, because there is no competition, BK can write more efficiently. When rebooting after a Broker outage, Plusar initiates a restore operation, reading the last written Ledger from the ZK and the last committed record, and then all consumers are guaranteed to see the same content.

We know that Kafka stores consumption progress in ZK before version 0.8, but ZK is essentially a central service based on a single log. To put it simply, the performance of ZK will not increase linearly as you add more nodes, but will only decrease on the contrary, because more nodes mean that logs need to be synchronized to more nodes, and performance will decline accordingly, so QPS will also be affected by stand-alone performance. Therefore, after version 0.8, the consumption progress is stored in the Topic of Kafka, and the original version of RocketMQ is similar, there are several different implementations such as ZK, database, etc., the current version is stored in the native file system, while Plusar adopts the same idea as Kafka, and Plusar stores the consumption progress in the ledger of BK.

Meta data

The metadata in Plusar is mainly stored in ZK, for example, configurations related to different availability zones are stored in the global ZK, and ZK within the cluster is used to store data such as those of a certain topic written to those Ledger, some of the current buried data of Broker, and so on.

Core concepts of Plusar

Topic

The core concept in the publish and subscribe system is topic. To put it simply, topic can be understood as a pipeline. Producer can throw messages into this pipeline, and consumer can read messages from the other end of this pipeline, but there can be multiple consumer reading messages from this pipeline at the same time.

Each topic can be divided into multiple partitions, and different partitions under the same topic contain different messages. After each message is added to a partition, a unique offset is assigned, and the messages are ordered in the same partition, so the client can carry out a hash module according to, for example, the user ID, so that the entire user's messages are sent to the entire partition, thus avoiding the problem of race condition to a certain extent.

Through partitioning, a large number of messages are distributed to different nodes for processing to achieve high throughput. By default, the topic of plusar is non-partitioned, but you can create a certain number of partitions of topic through cli or interface.

By default, Plusar automatically balances Producer and Consumer, but sometimes clients want to route according to their own business rules. Plusar supports the following rules by default: single partition, polling, hashing and customization (that is, you implement relevant interfaces to customize routing rules)

Consumption pattern

Consumption determines exactly how messages are distributed to consumers, and Plusar supports several different consumption patterns: exclusive, shared, failover. The figure is as follows:

Exclusive: a topic can only be consumed by one consumer. Plusar defaults to this mode.

Shared: shared mode or polling mode, where multiple consumers can connect to the same topic, and messages are distributed to consumers in turn. When a consumer goes down or actively disconnects, messages sent to that consumer without ack will be rescheduled and distributed to other consumers.

Failover: multiple consumers can connect to the same topic and sort them in lexicographical order. The first consumer will start consuming messages called master. When the master disconnects, all messages that are not ack and the messages left in the queue will be distributed to another consumer.

Plusar currently supports another Reader interface, which supports passing in a message ID, such as Message.Earliest, to start consumption from the earliest message.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.