How to query data streams with Apache Pulsar SQL 07/12 Update SLTechnology News&Howtos

How to query data streams with Apache Pulsar SQL

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to query data flow with Apache Pulsar SQL". In daily operation, I believe that many people have doubts about how to query data flow with Apache Pulsar SQL. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to query data flow with Apache Pulsar SQL". Next, please follow the editor to study!

Apache Pulsar is becoming more and more popular, especially after it has become a top project of the Apache Software Foundation.

Users not only use Pulsar to publish / subscribe messages, but also take advantage of its scalable storage architecture and hierarchical storage features to store data streams. After storing the data, the user needs to query the data stored in the Pulsar.

In addition, some users want to be able to query immediately after the data is stored in Pulsar, without having to wait until the data is moved to an external system such as the database.

This requirement led to the development of Pulsar SQL. The new framework Pulsar SQL is released for the first time in Apache Pulsar 2.2.0. Through Pulsar SQL, users can use SQL interface to efficiently query data streams stored in Pulsar.

This paper introduces the architecture, implementation and functions of Pulsar SQL from the technical level, as well as the background and use cases that promote the development of Pulsar SQL.

Introduction to the background

Apache Pulsar was originally developed as a next-generation publish / subscribe messaging system to improve the shortcomings of existing messaging systems and streaming systems. Apache Pulsar can handle more use cases than traditional publish / subscribe messaging systems.

Pulsar has an innovative architecture that separates services / computing from storage, making it easy for users to scale computing or storage resources independently, and to add additional storage resources more easily.

Because of the advantages of this architecture, users use Pulsar not only as a publish / subscribe system, but also as a storage platform for storing new and old stream data. With the addition of hierarchical storage in Pulsar, the practicability of "streaming storage" and "event storage" becomes more and more important.

With tiered storage, users can extend existing Pulsar clusters through cloud storage (for example, Amazon S3, Google Cloud Storage, etc.) to store almost unlimited amount of streaming data in the cloud at a very low unit cost.

Pulsar has the ability to store, archive and process data streams at the same time, which makes it possible to access both real-time data and historical data in a single system. Until now, simultaneous access to real-time data and historical data in a single system still requires multiple systems and tools.

Apache Pulsar supports simultaneous access to both types of data in one system. The implementation of Schema registry provides an easier way to support SQL queries.

Data streams are produced, consumed, and stored in Pulsar in a structured manner

Pulsar SQL is a query layer based on Apache Pulsar. Users can dynamically query all the new and old streams stored in Pulsar in Pulsar SQL. Users can further understand Pulsar SQL by querying new and historical data streams in a single system.

Another important use case for Pulsar SQL is that it can greatly simplify some data pipelines. Traditional ETL pipes (for example, for exporting data to a data lake) need to extract data from a set of external systems and perform a series of transformations to clean up the old format and add a new format before loading into the target system.

Format conversion is generally carried out sequentially as a separate step, and if a failure occurs in any step, the whole process will stop. This approach has two fatal drawbacks:

Each ETL step is specifically designed according to the framework in which it runs, for example, Sqoop or Flume is used to extract data, Hive and Pig scripts are used to transform data, and Hive or Impala processes load data into queryable tables.

In essence, the process of simplifying the data pipeline is batch-oriented, so the data loaded into the data lake is inconsistent with the incoming data flow. The longer the interval between batches, the less timely the data; accordingly, the less timely the data-based decisions will be.

With the help of Pulsar SQL,Apache Pulsar, we can extract, clear format, convert format, query data flow and other operations on the same system, so as to better deal with the above problems.

Pulsar's storage layer is extensible (because Pulsar uses Apache BookKeeper as its event storage layer), so Pulsar can operate on data in a single system and treat all data (streaming data and historical data) equally.

Pulsar SQL leverages the unique architecture of Presto and Pulsar to query in a highly scalable manner, regardless of the number of topic partitions that make up the flow. Next we will discuss the architecture of Pulsar SQL.

Frame structure

Presto Pulsar connector is an integration of Pular and Presto, and the connector runs on Presto worker in the Presto cluster. Presto worker uses connector to read the data in the Pulsar cluster and query the read data.

How does Presto Pulsar connector efficiently read data from Pulsar?

In Pulsar, producer writes messages to Pulsar, and the written messages are placed in different channels, also known as topic. The topic in Pulsar is stored in Apache BookKeeper as shards, and each topic shard is copied to Bookie. Bookie is a configurable number of BookKeeper nodes (the default is 2).

Overview of Pulsar SQL architecture

Pulsar SQL is designed to maximize data scanning speed, so Presto Pulsar connector can read data directly from Bookie (rather than from Pulsar Consumer API) to take advantage of the Pulsar sharding architecture. Consumer API is suitable for consuming messages in publish / subscribe use cases, but does not necessarily optimize batch reading.

To ensure sorting, only one Broker serves a single topic in the Pulsar, thus limiting the read throughput to the read throughput of one Broker. Users can take advantage of topic partitions to improve read throughput, but Pulsar wants users to query topic in a high-performance manner without modifying the existing topic. For query use cases, we don't care about sorting, we just need to read all the data.

Reading data directly from the shards that make up the topic is a better solution. Because shards and their copies are scattered across multiple BookKeeper Bookie, Presto worker can read shard data concurrently from multiple BookKeeper nodes to achieve high throughput. Users can also achieve higher throughput by configuring a larger number of topic replicas, which is simple and easy.

Worker reads from multiple copies in parallel for high throughput

Pulsar SQL can query not only the data in Bookie, but also the data unloaded to cloud storage. Through hierarchical storage, users can not only store data beyond the actual capacity of the physical cluster, but also query the data to obtain more valuable information.

Use case

Here are some common Pulsar SQL use cases. Pulsar simplifies the architecture in use cases, tasks that would otherwise require multiple systems to accomplish, and after adding Pulsar SQL, users can use Pulsar for log extraction and query.

Real-time analysis: Pulsar can query immediately after receiving a message, which makes it possible to merge the latest data into the real-time data dashboard or monitor the latest data through SQL query. Web analysis / mobile application analysis: Web and mobile applications generate data streams and interactive data streams that can be queried in real time to detect user habits, improve applications, optimize experience, and so on. Event logs and analysis: Pulsar can process and store event logs in user applications or system logs in the operating system. You can then use Pulsar SQL to query the stored logs, debug the application, search for faults, and so on. Event playback: you can use SQL queries to extract events in chronological order. For example, identify spikes in fraudulent transactions in a short period of time. These event streams can be captured and fraudulent activities can be simulated by playback when improving the fraud detection algorithm.

At this point, the study on "how to query data streams with Apache Pulsar SQL" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.