How to use Apache Pulsar + Hudi to build Lakehouse 07/16 Update SLTechnology News&Howtos

How to use Apache Pulsar + Hudi to build Lakehouse

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to use Apache Pulsar + Hudi to build Lakehouse". In the daily operation, I believe many people have doubts about how to use Apache Pulsar + Hudi to build Lakehouse. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "how to use Apache Pulsar + Hudi to build Lakehouse"! Next, please follow the editor to study!

About Apache Pulsar

Apache Pulsar, a top-level project of the Apache Software Foundation, is a next-generation cloud native distributed message flow platform that integrates message, storage and lightweight functional computing. It adopts a separate computing and storage architecture, supports multi-tenant, persistent storage, and multi-room cross-regional data replication, and has streaming data storage features such as strong consistency, high throughput, low latency and high scalability.

GitHub address: http://github.com/apache/pulsar/

The article is from: ApacheHudi, author: Guo Sijie, member of StreamNative CEO,Apache Pulsar PMC.

Typesetting of this issue: Tango@StreamNative

motivation

Lakehouse was first proposed by Databricks, it can be used as a low-cost, direct access to cloud storage and provide traditional DBMS management system performance and ACID transaction, version, audit, index, cache, query optimization data management system. Lakehouse combines the advantages of data lake and data warehouse: including low-cost data lake storage and open data format access, strong management and optimization capabilities of data warehouse. Delta Lake,Apache Hudi and Apache Iceberg are three technologies for building Lakehouse.

At the same time, Pulsar provides a range of features, including tiered storage, streaming offload, column offload, and so on, making it a storage layer that unifies batch and event streams. In particular, the feature of hierarchical storage makes Pulsar a lightweight data lake, but Pulsar still lacks some performance optimizations, such as indexes and data versions (which are very common in traditional DBMS management systems). The purpose of introducing column uninstallers is to narrow the performance gap, but it is not enough.

This proposal attempts to use Apache Pulsar as a Lakehouse, which only provides top-level design, detailed design and implementation are addressed in later sub-proposals (interested partners can continue to pay attention).

Analysis.

This section will analyze the key features needed to build a Lakehouse, and then analyze whether the Pulsar meets the requirements and what are the gaps in identification.

Lakehouse has the following key features:

Transaction support: a lot of data pipeliine in enterprise Lakehouse will read and write data concurrently. Supporting ACID transactions can ensure the consistency of concurrent read and write. In particular, the three data lake frameworks of SQL;Delta Lake, Iceberg and Hudi all implement the transaction layer based on low-cost object storage and support transactions. Pulsar in version 2.7.0 after the introduction of transaction support, and support cross-topic transactions; Schema constraints and governance: Lakehouse needs to support Schema constraints and evolution, support warehouse Schema paradigm, such as star / snowflake Schema, in addition, the system should be able to reason data integrity, and should have robust governance and audit mechanisms, the above three systems have this ability. Pulsar has a built-in Schema registration service that meets the basic requirements of Schema constraints and governance, but there may still be some areas for improvement. BI support: Lakehouses can use the BI tool directly on the source data, which reduces obsolescence, improves freshness, reduces wait time, and reduces the cost of having to operate two copies of data in both the data lake and the warehouse. The three data Lake frameworks integrate very well with Apache Spark while allowing Redshift,Presto/Athena to query source data, and the Hudi community has completed support for multiple engines such as Flink. Pulsar exposes segments in tiered storage for direct access, which can be tightly integrated with popular data processing engines. However, tiered storage in Pulsar itself still has performance gaps in serving BI workloads, and we will address these gaps in this proposal. Separation of storage and computing: this means that storage and computing use separate clusters, so these systems can be expanded horizontally and infinitely. All three boxes support the separation of storage and computing. Pulsar uses a multi-tier architecture deployment with storage and computing separation. Openness: open and standardized data formats such as Parquet are used, and they provide API, so various tools and engines (including machine learning and Python / R libraries) can access data "directly" and effectively, the three frameworks support Parquet format, Iceberg also supports ORC format, and the ORC format is being supported by the Hudi community. Pulsar does not support any open format, and column storage and unloading supports Parquet format. Support for a variety of data types from unstructured data to structured data: Lakehouse can be used to store, optimize, analyze, and access data types needed by many new data applications, including images, video, audio, semi-structured data, and text. It is not clear how Delta, Iceberg, and Hudi support this. Pulsar supports various types of data. Support a variety of workloads: including data science, machine learning, and SQL and analytics. Multiple tools may be required to support all of these workloads, but they all rely on the same data repository. The three frameworks are closely integrated with Spark, and Spark provides a wide choice of tools. Pulsar is also closely integrated with Spark. End-to-end flow: real-time reporting is the norm in many enterprises, convection support eliminates the need for separate systems dedicated to serving real-time data applications, and Delta Lake and Hudi provide streaming capabilities through change logs. But this is not really "flow". Pulsar is a true streaming system.

You can see that Pulsar meets all the conditions for building Lakehouse. However, there is a large performance gap in tiered storage today, such as:

Pulsar does not store data in an open and standard format, such as Parquet; Pulsar does not deploy any indexing mechanism for unloaded data; Plusar does not support efficient Upserts.

The purpose of this article is to solve the performance problems of the Pulsar storage layer so that Pulsar can be used as a Lakehouse.

Current plan

Figure 1 shows the storage layout of the current Pulsar stream.

Pulsar stores segment metadata in ZooKeeper; the latest segments are stored in Apache BookKeeper (faster storage tiers); and old segments are offloaded from Apache BookKeeper to tiered storage (cheap storage tiers). The metadata of the unloaded segment remains in the Zookeeper, referencing the objects unloaded in the tiered storage.

Figure 1

The current scheme has some shortcomings:

1. It does not use any open storage format to store unloaded data. This means that it is difficult to integrate with the broader ecosystem. two。 It keeps all metadata information in ZooKeeper, which may limit scalability.

New Lakehouse storage scheme

The new scheme recommends using Lakehouse to store unloaded data in tiered storage. The proposal recommends using Apache Hudi as Lakehouse storage for the following reasons:

Cloud providers provide good support on Apache Hudi; Apache Hudi has graduated as a top-level project; and Apache Hudi supports both Spark and Flink multi-engines. At the same time, there is a very active community in China.

New storage layout

Figure 2 shows the new layout of Pulsar topic.

The metadata of the latest fragment (ununloaded fragment) is stored in ZooKeeper; the data of the latest fragment (ununloaded fragment) is stored in BookKeeper; and the metadata and data of the unloaded segment are stored directly in hierarchical storage. Because it is only appended streams. We don't have to use Lakehouse repositories like Apache Hudi. But if we also store metadata in tiered storage, we use Lakehouse repositories to ensure that ACID makes more sense.

Figure 2 supports efficient Upserts

Pulsar does not directly support upsert. It supports upsert through topic compression. However, the current topic compression methods are neither scalable nor efficient.

1. Theme compression is done in the broker. It cannot support the insertion of large amounts of data, especially when the dataset is large. two。 Theme compression does not support storing data in tiered storage.

To support efficient and scalable Upsert, the proposal recommends using Apache Hudi to store compressed data in tiered storage. Figure 3 shows how to use Apache Hudi to support effective upserts in theme compression.

Figure 3

The idea is to implement theme compression services. The theme compression service can be run as a separate service (that is, the Pulsar function) to compress the theme.

1. The agent issues a subject compression request to the compression service. two。 The compression service receives the compression request, reads the message and inserts it up into the Hudi table. 3. After completing the upsert, advance the theme compression cursor to the last message it compresses.

Theme compression cursors store metadata for referenced locations in hierarchical storage that stores Hudi tables.

Treat Hudi table as Pulsar Topic

Hudi maintains the timeline of all operations performed on the table at different immediate times, which helps provide an immediate view of the table while effectively supporting data retrieval in _ arrival_ order. Hudi supports incremental pull of changes from tables. We can support the _ ReadOnly_ topic backed up through the Hudi table. This allows the application to stream changes to the Hudi table from the Pulsar agent. Figure 4 shows this idea.

Figure 4 Extensible metadata management

When we started storing all data in tiered storage, the proposal recommended not storing unloaded or compressed data metadata, but relying only on hierarchical storage to store unloaded or compressed data metadata.

The proposal proposes to organize unloaded and compressed data in the following directory layout.

-/-segments/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.