Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the typical application scenarios of Apache Hudi

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what are the typical application scenarios of Apache Hudi. The content of the article is of high quality. Therefore, Xiaobian shares it with you as a reference. I hope that after reading this article, you will have a certain understanding of relevant knowledge.

1. near real-time uptake

Extracting data from external sources such as event logs and databases into a Hadoop data lake is a common problem. In most Hadoop deployments, hybrid extraction tools are typically used and the problem is solved in a piecemeal fashion, even though the data is valuable to the organization.

For RDBMS ingestion, Hudi provides faster load through Upserts rather than expensive and inefficient bulk loads. For example, you can read MySQL binlog logs or Sqoop incremental imports and apply them to Hudi tables on DFS, which is faster/more efficient than batch merge jobs or complex manual merge workflows.

For NoSQL databases like Cassandra / Voldemort / HBase, where even small clusters can store billions of rows of data, bulk loading is simply not feasible, and more efficient methods are needed to match the ingestion rate with the more frequent updates.

Even for immutable data sources like Kafka, Hudi enforces a minimum file size on DFS, solving age-old problems in the Hadoop realm to improve NameNode health. This is especially important for event streams, as event streams (such as click streams) are typically large and can severely compromise Hadoop cluster performance if poorly managed.

For all data sources, Hudi provides atomic release of new data to consumers through submission, thus avoiding partial extraction failures.

2. Near real-time analysis

Real-time data marts are often supported by specialized analytics stores such as Druid, Memsql, or even OpenTSDB. This is perfect for smaller scale (as opposed to Hadoop installation) data that requires sub-second query responses, such as system monitoring or interactive real-time analytics. But because the data on Hadoop is unbearable, these systems often end up being abused by fewer interactive queries, resulting in underutilization and wasted hardware/license costs.

Interactive SQL solutions on Hadoop, on the other hand, such as Presto and SparkSQL, can complete queries in seconds. By reducing data update times to minutes, Hudi provides an efficient alternative and also enables real-time analysis of multiple larger tables stored on DFS. In addition, Hudi has no external dependencies (such as dedicated HBase clusters dedicated to real-time analysis), allowing faster analysis of more real-time data without increasing operational costs.

3. incremental processing pipeline

One of the basic capabilities Hadoop provides is to build derived chains based on tables and represent the entire workflow through DAGs. Workflows often depend on new data output from multiple upstream workflows, and traditionally a newly generated DFS folder/Hive partition indicates that new data is available. For example, the upstream workflow U may create a Hive partition every hour and contain data for that hour (event_time) at the end of each hour (processing_time), thereby providing 1 hour of data freshness. Downstream workflow D then starts immediately after U completes and is processed over the next hour, increasing the delay to 2 hours.

The example above ignores data arriving late, i.e. processing_time and event_time are separated. Unfortunately, in the post-mobile and pre-IoT era, delayed data arrival is very common. In this case, the only way to guarantee correctness is to repeatedly process the last few hours of data every hour, which seriously impairs the efficiency of the entire ecosystem. Imagine reprocessing terabytes of data every hour across hundreds of workflows.

Hudi solves this problem well by consuming new data in the upstream Hudi table HU at record granularity (rather than folder or partition), and the downstream Hudi table HD applies processing logic and updates/reconciles delayed data, where HU and HD can be scheduled continuously at more frequent times (e.g. 15 minutes) and provide 30 minutes end-to-end delay on HD.

To achieve this, Hudi introduced similar concepts from stream processing frameworks such as Spark Streaming, publish/subscribe systems such as Kafka, or database replication technologies such as Oracle XStream. If you are interested, you can find a more detailed explanation of the advantages of incremental processing over stream and batch processing here.

4. Data Distribution on DFS

The classic application of Hadoop is to process data and then distribute it to online storage for use by applications. For example, use Spark Pipeline to import Hadoop data into ElasticSearch for use by Uber apps. A typical architecture that uses queues to decouple between Hadoop and the service store to prevent crushing the target service store, Kafka is typically chosen as the queue, resulting in redundant storage of the same data on DFS (for offline analysis of computation results) and Kafka (for distribution).

Hudi can effectively solve this problem again by inserting the Spark Pipeline update output into the Hudi table, and then incrementally reading the table (just like the Kafka theme) to get new data and write it to the service store, i.e., using Hudi unified storage.

About Apache Hudi typical application scenarios have what to share here, I hope the above content can be of some help to everyone, you can learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report