How to compare ApacheHudi with other similar systems 04/26 Update SLTechnology News&Howtos

How to compare ApacheHudi with other similar systems

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to compare Apache Hudi with other similar systems, in response to this problem, this article details the corresponding analysis and solutions, hoping to help more partners who want to solve this problem find a simpler and easier way.

Apache Hudi fills a huge gap in processing data on DFS and can coexist well with some big data technologies. However, it would still be useful to compare Hudi to a number of related systems to understand how Hudi fits into the current big data ecosystem, and to understand the different trade-offs these systems make in their design.

Kudu

Apache Kudu is a storage system with similar goals to Hudi, which provides real-time analysis of petabytes of data through support for upgrades. A key difference is that Kudu also tries to act as a data store for OLTP workloads, whereas Hudi does not want to do so. Therefore, Kudu does not support incremental pulling (as of early 2017), while Hudi supports incremental processing.

Kudu is quite different from Distributed File System Abstraction and HDFS in that it has its own set of storage servers that communicate with each other via RAFT. Hudi, in contrast, is designed to work with the underlying Hadoop-compatible file system (HDFS, S3 or Ceph) and does not have its own storage server farm, relying instead on Apache Spark to do the heavy lifting. As a result, Hudi scales easily like other Spark jobs, while Kudu requires hardware and operational support, especially data storage systems such as HBase or Vertica. So far, we haven't done any direct benchmarking to compare Kudu and Hudi. However, if we were to use CERN, we would expect Hudi to have superior performance on ingesting parquet files.

Hive Affairs

Hive Transaction/ACID is another similar effort that attempts to merge layers of storage at read time implementation on top of the ORC file format. Understandably, this feature is closely related to other jobs like Hive and LLAP. Hive transactions do not provide the read-optimized storage options or incremental pulls provided by Hudi. In terms of implementation choice, Hudi takes full advantage of Spark like processing frameworks, while Hive transaction features are implemented under Hive tasks/queries initiated by the user or Hive Metastore. Based on our production experience, embedding Hudi as a library into an existing Spark pipeline is much easier and less cumbersome than other methods. Hudi is also designed to work with non-Hive engines such as Presto/Spark, and plans to introduce file formats other than parquet.

HBase

Although HBase is ultimately the key-value store layer for OLTP workloads, users often tend to associate HBase with analytics due to similarities to Hadoop. Given that HBase is rigorously write-optimized to support sub-second updates out of the box, Hive-on-HBase allows users to query that data. However, hybrid columnar storage formats such as Parquet/ORC can easily outperform HBase in terms of the actual performance of analytical workloads, which are primarily read-heavy. Hudi bridges the gap between faster data and analytical storage formats. From an operational perspective, a repository that provides users with faster data delivery is more scalable than the HBase region server cluster used for administrative analytics. Finally, HBase doesn't have Hudi's focus on incremental processing primitives such as commit time and incremental pull.

streaming

A common question: "How does Hudi relate to stream processing systems? "We'll try to answer. In short, Hudi can be integrated with today's batch (copy-on-write storage) and stream (read-on-merge storage) jobs to store the results of calculations in Hadoop. For Spark applications, this can be achieved by integrating Hudi libraries directly with the Spark/Spark streaming DAG. In the case of non-Spark processing systems (e.g. Flink, Hive), it can be processed in the corresponding system and then sent to the Hudi table via the Kafka theme/DFS intermediate file. Conceptually, a data processing pipeline consists of only three parts: input, processing, output, and the user ultimately runs queries against the output in order to consume the pipeline's results. Hudi can act as an input or output for storing data on DFS. The suitability of Hudi on a given stream processing pipeline ultimately boils down to the suitability of your query on Presto/SparkSQL/Hive.

More advanced use cases revolve around the concept of incremental processing, and Hudi is even used inside the processing engine to speed up typical batch pipelines. For example: Hudi can be used as a state store within a DAG (similar to [rocksDB(https://ci.apache.org/projects/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend) used by Flink). This is a project on the roadmap and will eventually take the form of Beam Runner.

Iceberg & Delta

For a comparison to Iceberg and Delta, see the following comparison chart (provided by Qubole Tech Blog until September 2019).

The Hudi community doesn't want to document differences between Iceberg and Delta, which are both open source frameworks for data lakes, because this may make developers feel that Hudi is not neutral. In order to maintain a more neutral position, the community prefers to leave this comparison to developers and let them choose the framework that suits them.

About how to carry out ApacheHudi and other similar system comparison questions to share the answer here, I hope the above content can be of some help to everyone, if you still have a lot of doubts not solved, you can pay attention to the industry information channel to learn more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.