Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to analyze the Mechanism of Pulsar Connector

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article is to share with you about how to analyze the mechanism of Pulsar Connector. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Apache Pulsar, the next generation of Yahoo open source distributed messaging system, graduated from the Apache Software Foundation in September 2018 and became a top project. Pulsar's unique hierarchical and fragmented architecture not only ensures the performance and throughput of big data's message flow system, but also provides high availability, high scalability and easy maintenance.

The sharding architecture reduces the storage granularity of message flow data from partition to fragmentation, and the corresponding hierarchical storage, which makes Pulsar the best choice for unbounded streaming data storage. This enables Pulsar to better match and adapt to the batch-stream integrated computing model of Flink.

1. Brief introduction to Pulsar

1.1 Features

With open source, enterprises in various industries can give Pulsar richer functions according to different needs, so at present, it is no longer just the function of middleware, but gradually developed into an Event Streaming Platform (event flow processing platform), with Connect (connection), Store (storage) and Process (processing) functions. In terms of ■ Connect connection, Pulsar has its own separate Pub/Sub model, which can meet the application scenarios of both Kafka and RocketMQ. At the same time, the function of Pulsar IO, in fact, is Connector, which makes it very convenient to import data sources into Pulsar or export from Pulsar.

In addition, in Pulsar 2.5.0, we added an important mechanism: Protocol handler. This mechanism supports adding additional protocol support to broker customization, which ensures that you can enjoy some of the advanced features of Pulsar without changing the original database. So Pulsar also postponed exhibitions such as KoP, ActiveMQ, Rest and so on. After ■ StorePulsar provides a way for users to import, it is necessary to consider storing on Pulsar. Pulsar uses distributed storage, which was initially carried out on Apache BookKeeper. Later, more hierarchical storage was added, and storage was selected through a variety of modes such as JCloud and HDFS. Of course, hierarchical storage is also limited by storage capacity. ■ ProcessPulsar provides an infinite storage abstraction to facilitate better batch-stream fusion computing on third-party platforms. That is the data processing ability of Pulsar. The data processing ability of Pulsar is actually divided according to the difficulty and effectiveness of your data calculation. At present, Pulsar includes the following integrated fusion processing methods:

The function processing of Pulsar Function:Pulsar can be calculated and applied to Pulsar by writing functions on different system sides.

Pulsar-Flink connector and Pulsar-Spark connector: as batch-stream fusion computing engines, both Flink and Spark provide the mechanism of flow computing. If you are already using them, congratulations. Because Pulsar also supports both calculations, there is no need for you to do unnecessary operations.

Presto (Pulsar SQL): some friends will use SQL more in the application scenario, do interactive queries, and so on. Pulsar is well integrated with Presto and can be processed in Pulsar with SQL.

1.2 subscription model from the point of view of usage, the usage of Pulsar is similar to that of traditional messaging systems and is based on the publish-subscribe model. Users are divided into two roles: producer (Producer) and consumer (Consumer). For more specific needs, they can also consume data in the role of Reader. Users can publish data under a specific topic as a producer or subscribe to a specific topic (Subscription) as a consumer to obtain data. In this process, Pulsar implements data persistence and data distribution, and Pulsar also provides Schema function to validate data.

As shown in the figure below, there are several subscription modes in Pulsar:

Exclusive subscription (Exclusive)

Failover subscription (Failover)

Shared subscription (Shared)

Key ordered shared subscription (Key_shared)

Topics in Pulsar are divided into two categories, one is Partitioned Topic, and the other is non-partition (Not Partitioned Topic).

Partition topics are actually made up of multiple non-partition topics. Topics and partitions are logical concepts. We can think of a topic as a large infinite stream of events, which is divided into several small infinite streams of events by partitions.

Correspondingly, physically, Pulsar adopts a hierarchical structure. Each event stream is stored in a Segment, and each Segment includes a number of Entry,Entry containing one or more message entities sent by the user. Message is not only the data stored in Entry, but also the data obtained at one time by consumers in Pulsar. In addition to byte stream data, Message also includes Key attributes, two time attributes, and MessageId, and other information. MessageId is the unique identification of the message, including the information of ledger-id, entry-id, batch-index and partition-index. As shown in the figure below, the Segment, Entry, Message and Partition storage locations of the message in the Pulsar are recorded respectively, so the information content of the Message can also be found physically.

2. Pulsar architecture

A Pulsar cluster consists of a Brokers cluster and a Bookies cluster. Brokers are independent of each other and are responsible for providing services on a particular topic to producers and consumers. Bookies is also independent of each other, responsible for storing Segment data, and is the place where messages are persisted. In order to manage configuration information and agent information, Pulsar also uses Zookeeper as a component. Both Brokers and Bookies are registered on zookeeper. The structure of Pulsar is described from the specific read and write path of the message (see figure below).

In the write path, the producer creates and sends a message to the topic, which may be routed to a specific partition with some algorithm (such as Round robin), and Pulsar will select a Broker to serve that partition, and the message for that partition will actually be sent to this Broker. When Broker gets a message, it writes the message to Bookies as Write Quorum (Qw). When the number of successful writes to the Bookies reaches the set number, the Broker receives a completion notification, and the Broker returns the notification producer that the write was successful.

In the read path, the consumer first initiates a subscription before it can connect with the Broker corresponding to the topic, and the Broker requests data from the Bookies and sends it to the consumer. When the data is accepted successfully, the consumer can choose to send a confirmation message to the Broker so that the Broker can update the consumer's access location information. As mentioned earlier, for the data just written, the Pulsar is stored in the cache, so it can be read directly from the Brokers cache, shortening the read path. Pulsar separates storage from service and achieves good scalability. At the platform level, it can adjust the number of Bookies to meet different needs. At the user level, you only need to communicate with Brokers, and Brokers itself is designed to be stateless. When a Broker cannot be used due to failure, a new Broker can be dynamically generated to replace it.

3. Internal mechanism of Pulsar Connector

First of all, Pulsar Connector is relatively simple to use, which is composed of a Source and a Sink. The function of source is to transfer messages under one or more topics to the Source of Flink. The function of Sink is to obtain data from the Sink of Flink and put it into certain topics. In terms of usage, similar to Kafa Connector, you need to set some parameters when using.

StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment (); Properties props = new Properties (); props.setProperty ("topic", "test-source-topic") FlinkPulsarSource source = new FlinkPulsarSource (serviceUrl, adminUrl, new SimpleStringSchema (), props); DataStream stream = see.addSource (source)

FlinkPulsarSink sink = new FlinkPulsarSink (serviceUrl, adminUrl, Optional.of (topic), / / mandatory target topic props, TopicKeyExtractor.NULL, / / replace this to extract key or topic for each record Person.class); stream.addSink (sink)

Now let's introduce the implementation mechanism of some features of Kulsar Connector. 3.1precisely because the MessageId in Pulsar is globally unique and ordered, which corresponds to the physical storage of messages in Pulsar, in order to implement Exactly Once,Pulsar Connector, MessageId is stored in Checkpoint with the help of Flink's Checkpoint mechanism.

For the Source task of the connector, each time the Checkpoint is triggered, the MessageId currently processed by each partition is saved to the state storage, so that when the task is restarted, each partition can find the corresponding message location of the MessageId through the Reader seek interface provided by Pulsar, and then read the message data from this location.

Through the Checkpoint mechanism, the notification of the use of the data can be sent to the node that stores the data, so that the expired data can be deleted accurately and the storage can be used reasonably. Dynamic Discovery considering that tasks in Flink run for a long time, users may need to dynamically add some topics or partitions in the process of running tasks. Pulsar Connector provides a solution for automatic discovery.

Pulsar's strategy is to start another thread to regularly check whether the set theme has changed and whether the partition has been added or deleted. If a new partition occurs, a new Reader task will be created to deserialize the data under the theme. Of course, if the partition is deleted, the read task will be reduced accordingly. 3.3 structured data in the process of reading the data under the topic, we can transform the data into structured records to process. Pulsar supports the conversion of data of Avro schema and avro/json/protobuf Message format type to Row format data in Flink. For the metadata that users care about, Pulsar also provides the corresponding metadata field in Row.

In addition, Pulsar has carried out a new development based on Flink 1.9, which supports a simple mapping between Table API and Catalog,Pulsar, as shown in the following figure, mapping the tenant / namespace of Pulsar to the database of Catalog and the topic to the specific tables in the library.

It was mentioned earlier that Pulsar stores data in Bookeeper and can also be imported into file systems such as Hdfs or S3, but for analytical applications, we often only care about some attributes of each piece of data in all data, so using column storage will improve the performance of both IO and the network. Pulsar is also trying to store it as columns in Segment. In the original read path, both Reader and Comsumer need to pass data through Brokers. If the new Bypass Broker method is adopted, the Bookie location stored in each Message can be found directly by querying metadata, which can read data directly from Bookie, shorten the reading path, and improve efficiency.

Pulsar is relative to Kafka, because the data is physically stored in a Segment, then in the process of reading, by improving parallelization, establishing multiple threads to read multiple Segment at the same time can improve the efficiency of the whole job, but this also requires that your task itself does not have strict requirements on the access order of each Topic partition, and the newly generated data is not stored in the Segement. Cached access is still needed to get the data, so parallel reading will become an option, providing users with more options. The above is how to analyze the Pulsar Connector mechanism. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report