How to do Apache Pulsar Analysis 07/02 Update SLTechnology News&Howtos

How to do Apache Pulsar Analysis

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to carry out Apache Pulsar analysis, in view of this problem, this article introduces the corresponding analysis and answers in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Apache Pulsar (Incubator Project) is an enterprise-level publish and subscribe (pub-sub) messaging system, originally developed by Yahoo and open source at the end of 2016, and is now an incubator project of the Apache Software Foundation. Pulsar has been running in Yahoo production environment for more than three years, helping the main applications of Yahoo, such as Yahoo Mail, Yahoo Finance, Yahoo Sports, Flickr, Gemini advertising platform and Yahoo distributed key storage system Sherpa.

Concepts and terminology

Applications that send data to Pulsar are called producers (producer), while applications that read data from Pulsar are called consumer. Consumers are sometimes called subscribers. Topic is the core resource of Pulsar. A topic can be regarded as a channel to which consumers send data and consumers pull data.

Figure 1: producers, consumers, and themes

Pulsar is built to support multi-tenancy (multi-tenant) application scenarios. Pulsar's multi-tenancy mechanism contains two kinds of resources: assets (property) and namespaces (namespace). Assets represent tenants in the system. Suppose you have a Pulsar cluster to support multiple applications (like Yahoo), and each asset in the cluster can represent an organization's team, a core function, or a product line. An asset can contain multiple namespaces, and a namespace can contain any topic.

Phellodendron mandshurica (Thunb.)

Figure the relationship between the components of 2:Pulsar

Namespaces are the most basic snap-in of Pulsar. At the namespace level, we can set permissions, adjust replication options, manage data replication across clusters, control the expiration of messages, or perform other critical tasks. Topics in the namespace inherit the configuration of the namespace, so we can configure all topics in the same namespace at once. There are two types of namespaces:

-local-- the local namespace is only visible within the cluster.

-global-Namespace is visible to multiple clusters, either within the same data center or across regional data centers. This feature depends on whether cluster replication is enabled.

Although local and global namespaces have different scopes, they can be shared across different teams or organizations. If the application gets write permission for the namespace, it can write data to all topics within that namespace. If the written topic does not exist, the topic is created.

Each namespace can contain one or more topics, each topic can have multiple subscribers, and each subscriber can receive all messages published to the topic. To provide more flexibility to applications, Pulsar provides three subscription types that can co-exist on the same topic:

-exclusive subscriptions-there can be only one consumer at a time.

-shared subscriptions-can be subscribed by multiple consumers, and each consumer receives some of the messages.

-invalid backup (failover) subscription-allows multiple consumers to connect to the same topic, but only one consumer can receive messages. Other consumers begin to receive messages only when the current consumer fails.

Figure 3 shows these three types of subscriptions. Pulsar's subscription mechanism decouples the producers and consumers of messages, providing more flexibility for applications without increasing complexity and development effort.

Cymbals

Figure 3: different types of Pulsar subscriptions

Data partition

The data written to the topic may be only a few MB, or it may be a few TB. So, in some cases, the throughput of the theme is very low and sometimes high, depending entirely on the number of consumers. So how to deal with some topics where the throughput is very high and some are very low? To solve this problem, Pulsar distributes data from a topic across multiple machines, known as partitions.

In order to ensure high throughput, partitioning is a very common means when dealing with large amounts of data. By default, Pulsar themes are not partitioned, but you can easily create partition themes and specify the number of partitions through command-line tools or API.

After the partitioning theme is created, Pulsar can automatically partition the data without affecting producers and consumers. That is, an application writes data to a topic, and after partitioning the topic, there is no need to modify the application's code. Partitioning is just an operation, and the application does not need to care about how partitioning is done.

The partitioning of the topic is handled by a process called broker, and each node in the Pulsar cluster runs its own broker.

Figure 4: dividing a topic into multiple broker

Topic partitioning does not affect the application, in addition, Pulsar also provides several message routing strategies to help us better distribute data across partitions and consumers.

-single partition-the producer randomly picks a partition and writes data to that partition. This strategy provides the same guarantees as non-partitioned topics, but it can be useful if multiple producers write data to the same topic.

-round robin partitions-producers distribute data evenly across partitions by polling. For example, the first message is written to the first partition, the second message is written to the second partition, and so on.

-hash partition-each message comes with a key, and which partition is written depends on the key it carries. This partitioning method can guarantee the order.

-Custom partition-the producer uses a custom function to generate the corresponding value for the partition, and then writes the message to the corresponding partition based on this value.

Persistence

After receiving the message and confirming it, Pulsar broker must ensure that the message is not lost under any circumstances. Unlike other messaging systems, Pulsar uses Apache BookKeeper to ensure persistence. BookKeeper provides persistent storage with low latency. Before acknowledging the message, the node forces the log to be written to the persistent storage, so that even if there is a power failure, the data will not be lost. Because Pulsar broker sends data to multiple nodes, it sends confirmation messages to producers only after most nodes (quorum) confirm that the write was successful. Pulsar is in this way to ensure that data will not be lost even if there is a hardware failure, network failure or other failure. In the following articles, we will delve into the details of this aspect.

Production environment practice

Pulsar is currently helping the main applications of Yahoo, such as Yahoo Mail, Yahoo Finance, Yahoo Sports, Gemini advertising platform and Yahoo distributed key storage system Sherpa. Many scenarios require strong persistence guarantees, such as zero data loss, while requiring high performance. Pulsar has been deployed to production since 2015 and is now running on a large scale in Yahoo's production environment.

-Pulsar is deployed in more than 10 data centers with full grid replication capability

-process more than 100 billion messages a day

-supports 1.4 million themes

-the overall message release delay is less than 5 milliseconds

In this article, we briefly introduced some of the concepts of Apache Pulsar and explained how Pulsar ensures persistence by submitting data before sending confirmation messages, improving throughput through partitioning, and so on. We will delve into the overall architecture and feature details of Pulsar, and we will provide some guidance on how to make better use of Pulsar.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.