What are the frequently asked questions about ApacheHudi? 07/08 Update SLTechnology News&Howtos

What are the frequently asked questions about ApacheHudi?

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What are the common ApacheHudi problems, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

1. When is ApacheHudi useful for individuals and organizations

If you want to quickly extract data to HDFS or cloud storage, Hudi can help. In addition, if your ETL / hive/spark jobs are slow or resource-intensive, Hudi can help by providing an incremental way to read and write data.

As an organization, Hudi can help you build efficient data lakes, solve some of the most complex underlying storage management problems, and deliver data to data analysts, engineers, and scientists faster.

2. Goals that Hudi does not intend to achieve

Hudi is not designed for any OLTP case, in which case you are usually using an existing NoSQL / RDBMS data store. Hudi can't replace your in-memory analysis database (at least not yet! ). Hudi supports near-real-time ingestion in minutes, thus weighing delays for effective batch processing. If you do want sub-minute processing delays, use your favorite streaming solution.

3. What is incremental processing? Why does Hudi keep talking about it?

Incremental processing was first introduced by Vinoth Chandar in the O'reilly blog, which describes most of the work. In purely technical terms, incremental processing simply refers to the programming of micro-batch programs in a streaming manner. A typical batch job consumes all inputs and recalculates all outputs every few hours. A typical streaming job consumes new inputs continuously / every few seconds and recalculates new / changes to output. Although it may be easier to recalculate all output in batches, it is wasteful and expensive. Hudi has the ability to stream the same batch pipeline and run it every few minutes.

Although we can call it stream processing, we prefer to call it incremental processing to distinguish it from a pure flow processing pipeline built using Apache Flink,Apache Apex or Apache Kafka Streams.

4. What is the difference between COW-on-write and MOR-on-read storage types

Copy on write (Copy On Write): this storage type enables clients to ingest data in a column file format (currently parquet). When using the COW storage type, any new data written to the Hudi dataset is written to the new parquet file. Updating existing lines will cause the entire parquet file to be rewritten (these parquet files contain the affected lines to update). Therefore, all writes to such datasets are limited by parquet write performance, and the larger the parquet file, the longer it takes to ingest the data.

Merge On Read on read: this storage type allows clients to quickly ingest data into a row-based data format such as avro. When using the MOR storage type, any new data written to the Hudi dataset is written to new log / delta files that internally encode the data in avro. The Compaction process (configured to be embedded or asynchronous) converts the log file format to a column file format (parquet).

Two different formats provide two different views (read optimized view and real-time view), which depend on the read performance of column parquet files, while real-time views depend on the read performance of column and / or log files.

Updating existing lines will result in: a) writing log / incremental file updates corresponding to the underlying parquet file previously generated by compression (Compaction), or b) writing log / incremental file updates without compression. As a result, all writes to such datasets are limited by avro / log file write performance, which is much faster than parquet (replication is required when writing). Although, it is more expensive to read log / delta files than parquet files (merge is required when reading).

Click here to learn more.

5. How to select a storage type for a workload

The main goal of Hudi is to provide update functionality, which is several orders of magnitude faster than writing entire tables or partitions.

Select copy-on-write (COW) storage if the following conditions are met:

Find a simple way to replace existing parquet tables without real-time data.

The current workflow is to rewrite the entire table / partition to handle updates, while only a few files have actually changed in each partition.

Want to make the operation easier (no compression, etc.), and ingest / write performance is limited only by parquet file size and the number of files affected by updates

The workflow is simple and does not suddenly burst into a large number of updates or inserts into older partitions. COW has a merging cost when writing, so these sudden changes can block ingestion and interfere with normal uptake delay targets.

Select MOR storage if the following conditions are met:

It is hoped that the data will be absorbed as soon as possible and can be queried as soon as possible.

Workloads can suddenly peak / change patterns (for example, bulk updates to older transactions in the upstream database result in a large number of updates to old partitions on DFS). Asynchronous compression (Compaction) helps alleviate the write magnification caused by this situation, while normal extraction needs to keep up with the upstream changes.

No matter which storage you choose, Hudi will provide:

Snapshot isolation and atomic write batch records

Incremental pull

Deduplication capability

Click here to learn more

6. Is Hudi an analytical database?

A typical database has some long-running servers to provide read and write services. Unlike the architecture of Hudi, it highly decouples reads and writes, and can independently expand writes and queries / reads to meet the expansion challenge. Therefore, it may not always be like a database.

Nonetheless, Hudi is designed very much like a database and provides similar functionality (updates, change capture) and semantics (transactional writes, snapshot isolated reads).

7. How to model the data stored in Hudi

When writing data to Hudi, records can be modeled as on key-value storage: specifying key fields (unique to a single partition / entire dataset), partition fields (representing the partition where keys are to be placed), and preCombine/combine logic (used to specify how to handle duplicate records in a batch of writes). This model enables Hudi to enforce primary key constraints, just like on database tables. See the example here.

When querying / reading data, Hudi simply displays itself as a hierarchical table similar to json, and everyone is used to using Hive/Spark/Presto to query Parquet/Json/Avro.

8. Whether Hudi supports cloud storage / object storage

In general, Hudi can provide this capability on any Hadoop file system implementation, so you can read and write datasets on Cloud Store (Amazon S3 or Microsoft Azure or Google Cloud Storage). Hudi is also specifically designed to make it very easy to build Hudi datasets on the cloud, such as consistency checking for S3 and zero movement / renaming involved in data files.

9. Which versions of Hive/Spark/Hadoop are supported by Hudi

Starting from September 2019, Hudi can support Spark 2.1 +, Hive 2.x Hadoop 2.7 + (non-Hadoop 3).

10. How Hudi actually stores data in a dataset

At a higher level, Hudi is based on the MVCC design and writes data to the parquet/ base file and different versions of the log file that contains changes to the base file. All files are stored in the partitioned mode of the dataset, which is very similar to the layout of Apache Hive tables on DFS.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.