What are the storage and services for Apache Hudi unified batch and near real-time analysis 07/19 Update SLTechnology News&Howtos

What are the storage and services for Apache Hudi unified batch and near real-time analysis

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Apache Hudi unified batch and near real-time analysis of what the storage and services are, I believe that many inexperienced people do not know what to do, so this article summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

The following introduces the background and design of Hudi, which now seems to be very meaningful.

It is divided into background, motivation, design, use cases, demo several modules to explain.

Uber's itinerary has reached the scale of 200w + drivers in 700 cities, 70 countries and 200w + drivers in 2018.

In Uber, data can be divided into intake and query, and intake includes consuming data from kafka and hdfs; query includes data scientists using spark notebook, using Hive/Presto for ad hoc query and dashboard display, using Spark/Hive to build data pipeline or ETL tasks, and so on. The introduction of Hudi,Hudi can manage the original dataset, providing upsert, incremental processing semantics, and snapshot isolation.

Streaming and batch processing jointly consume data from message middleware (such as kafka). Streaming processing provides a result that is less than 1min latency, batch processing provides a result with a delay of about 1 hour, and batch processing results can modify streaming results. This is a typical Lambda architecture, that is, two systems need to be maintained, and the maintenance cost will be high.

When using the data lake, it provides the following advantages:

1. Support for Ad hoc queries on the latest data

two。 Near real-time processing (micro-batch), many business scenarios do not require full real-time

3. Data processing is more appropriate, such as checking file size, which is important for storage such as HDFS, without having to rewrite the processing of the entire partition

4. Lower maintenance costs, such as no replication of data or maintenance of multiple systems.

As an open source data lake framework of Uber, Hudi abstracts the storage layer (supporting changes to datasets, incremental processing); a Lib for Spark (any horizontal extension, supporting storage to HDFS); and open source (now hatched in Apache).

Hudi-based architecture design, support upsert, support incremental processing, support different views, etc., we can see that different from the typical Lambda framework, the Hudi-based analysis architecture only needs to maintain Hudi, and the capabilities provided by Hudi can meet the different needs of upper-level applications.

Hudi manages datasets on HDFS, including indexes, data files and metadata, and supports Hive/Presto/Spark queries.

Hudi provides three different types of views, read optimized view, real-time view, and incremental view. The community is refactoring these three definitions, namely, read optimized view, snapshot view, and incremental view.

For COW types, read optimized views are supported, and for MOR types, read optimized views and real-time views are supported, while for the latest releases, COW supports read optimized views and incremental views, and MOR supports read optimized views, real-time views and incremental views.

In COW mode, read optimized View only reads parquet data files. After batch 1upsert, read optimized View reads File1 and File2 files; after batch 2upsert, read optimized View reads File1 'and File2 files.

Many problems can be solved by using COW mode, but there are also some problems, such as the writing method, that is, the update delay is large (caused by copying the entire file).

The following workflow shows how to handle delayed updates, which are first reflected to the source table (Source table), then to ETL table An and then to ETL table B.

According to the above analysis, the following problems can be summarized, such as high community latency, write magnification, limited data freshness and small files.

Unlike copying the entire file when updating in COW mode, updates can be written to an incremental file, which reduces data intake latency and write magnification.

Read optimized view and real-time view are provided in MOR mode.

After the batch 1upsert, the read optimized view also reads the Parquet file, and after the batch 2upsert, the real-time view reads the result of the merge of the parquet file and the log file.

Compared with the trade-offs under different views on Hudi, the read optimized view under COW has Parquet native file read performance, but the data intake is slower; the read optimized view under MOR also has parquet native file read performance, but will read expired data (not updated); the real-time view under MOR has high data intake performance and will be merged when reading. Compaction converts log files to parquet files, from real-time views to read-optimized views.

For compaction (compression), Hudi provides a lock-free asynchronous compression method based on MVCC, so that data intake can be decoupled so that data intake is not affected.

Asynchronous compression merges the log file and the data file to form a new data file, and then reads the optimized view to reflect the latest data.

Hudi also provides concurrency guarantees such as snapshot isolation and atomicity of batch writes.

Hudi use case sharing

In Uber, the data in kafka is consumed through Marmaray developed by Uber, and then written into the Hudi data lake. There are more than 1000 data sets of 100TB data every day, and the size of the data set managed by Hudi has reached 10PB.

As for the typical small file problem of HDFS, Hudi automatically processes small files to reduce the pressure on namenode when ingesting data; supports writing large files; and supports incremental updates to existing files.

Hudi also considers the issue of data privacy, that is, how to delete data. Hudi provides two ways: soft deletion and hard deletion. Soft deletion does not delete key, only content, while hard deletion deletes key and content.

Use the incremental processing of Hudi to build incremental pipes and dashboard.

After reading the above, have you mastered the storage and service methods of Apache Hudi unified batch and near real-time analysis? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.