Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Practice of how to build PB data Lake based on Apache Hudi in Uber

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly analyzes the relevant knowledge points of Uber how to build a PB-level data lake based on Apache Hudi. The content is detailed and easy to understand, and the operation details are reasonable, so it has a certain reference value. If you are interested, you might as well follow the editor and learn more about "Uber how to build a PB-level data lake practice based on Apache Hudi".

1. Introduction

From ensuring accurate estimated arrival times to predicting the best traffic routes, providing a secure, seamless transportation and delivery experience on the Uber platform requires reliable, high-performance, large-scale data storage and analysis. In 2016, Uber developed the incremental processing framework Apache Hudi, which empowers the key business data pipeline with low latency and high efficiency. A year later, we opened up the solution so that other organizations in need could take advantage of Hudi. Then, in 2019, we made good on our promise to further donate it to Apache Software Foundation, and almost a year and a half later, Apache Hudi graduated and became a top Apache Software Foundation program. To commemorate this milestone, we would like to share the construction, release, optimization and graduation journey of Apache Hudi for the benefit of the larger big data community.

two。 What is Apache Hudi?

Apache Hudi is a storage abstraction framework that helps organizations build and manage PB-level data lakes. Hudi brings streaming to batch-like big data by using primitives such as upsert and incremental pull. Through a unified service layer (data latency can be achieved in a few minutes or so), these functions help us get service data faster and fresher, thus avoiding the additional overhead of maintaining multiple systems. More flexibly, Apache Hudi can also run on Hadoop distributed File system (HDFS) or cloud storage.

Hudi enables atomicity, consistency, isolation, and persistence (ACID) semantics on the data lake. Two of the most widely used features of Hudi are upserts and incremental pull, which enables users to capture change data and apply it to the data lake. To achieve this, Hudi provides a pluggable index mechanism and a custom index implementation. Hudi has the ability to control and manage file layout in the data lake, which not only overcomes HDFS NameNode nodes and other cloud storage limitations, but is also important for maintaining a healthy data ecosystem by improving reliability and query performance. In addition, Hudi supports multiple query engines, such as Presto,Apache Hive,Apache Spark and Apache Impala.

Figure 1. Apache Hudi consumes change logs, events, and incremental streams by providing different views on the table to serve different application scenarios

Overall, Hudi is conceptually divided into three main components: raw data to be stored, index data to provide upsert functionality, and metadata to manage datasets. On the kernel side, Hudi maintains the timeline of all actions performed on the table at different time points, which is called instant in Hudi, which provides an immediate view of the table and effectively supports sequential data retrieval. Hudi ensures that the operations on the timeline are atomic and based on immediate time, consistent with the time of changes in the database. Using this information, Hudi provides different views of the same Hudi table, including a read-optimized view for fast column file performance, a real-time view for fast data intake, and an incremental view for reading the Hudi table as a change log stream, as shown in figure 1 above.

Hudi organizes the data tables into a directory structure under the basic path (basepath) on the distributed file system. The table is divided into multiple partitions, and within each partition, the files are organized into filegroups that are uniquely identified by the file ID. Each filegroup contains several file slices, each of which contains basic data files (* .parquet) generated instantly at a particular commit / compression (commit/compaction) and a set of log files (* .log) that insert / update the basic data files. Hudi uses Multiversion Concurrency Control (MVCC), where the compression operation merges logs and base files to produce new slices, while the cleanup operation removes unused / older slices to reclaim space on the file system.

Hudi supports two table types: copy on write and merge on read. The replication table type uses only the column file format (for example, Apache Parquet) to store data when writing. By copying on write, you can simply update the version and rewrite the file by performing a synchronous merge during the write process.

The merge table type on reading uses a combination of column (such as Apache Parquet) and row-based (such as Apache Avro) file format to store data. Update the record to an incremental file, and then generate a new version of the column file in synchronous or asynchronous compression.

Hudi also supports two types of queries: snapshot queries and incremental queries. A snapshot query is a request to "snapshot" a table starting with a given commit or compression operation. When using snapshot queries, the replication table type on write exposes only the basic / column files in the latest file slices, and guarantees the same column query performance compared with non-Hudi tables. Replication on write provides an alternative to the existing Parquet table, as well as upsert/delete and other features. For read-time merge tables, snapshot queries provide near-real-time data (minutes) by dynamically merging the basic and incremental files of the latest file slices. For replicated tables on write, incremental queries provide new data written to the table and a stream of changes to enable incremental data pipelines since a given commit or compression.

3. The use of Apache Hudi in Uber

At Uber, we use Hudi in a variety of scenarios, from providing fast and accurate data about itinerary on the Uber platform, from detecting fraud to providing restaurant and food recommendations on our UberEats platform. To demonstrate how Hudi works, let's learn step by step how to ensure that the travel data in Uber Marketplace is up-to-date on the data lake, thereby improving the user experience for riders and drivers on the Uber platform. The typical life cycle of a journey begins with the journey proposed by the rider, and then continues with the journey until the end of the journey and the rider reaches the final destination. The core travel data of Uber is stored in Uber's extensible data storage Schemaless in tabular form. A single itinerary entry in the itinerary may undergo many updates during the life cycle of the itinerary. Before Uber used Hudi, large Apache Spark jobs periodically rewrote the entire dataset to HDFS to obtain upstream online table inserts, updates, and deletions, reflecting changes in travel status. In terms of background, in early 2016 (before building Hudi), some of the biggest tasks were to use 1000 executors and process more data than 20TB, a process that was inefficient and difficult to scale. The company's teams rely on fast and accurate data analysis to provide a high-quality user experience, and in order to meet these requirements, our current solution cannot be extended for incremental processing on the data lake. When moving data to HDFS using snapshot and reload solutions, these inefficient processing is being written to all data pipelines, including downstream ETL using this raw data, and we can see that these problems only worsen as the scale increases.

In the absence of other viable open source solutions, we built and launched Hudi for Uber at the end of 2016 to build a transactional data lake that facilitates large-scale, fast, reliable data updates. Uber's first generation of Hudi took advantage of the copy-on-write table type, which increased job processing speed to 20GB every 30 minutes and reduced write magnification by 100x. By the end of 2017, all of Uber's original data sheets were in Hudi format, running one of the largest transaction data lakes on the planet.

Figure 2. Hudi's write-time replication function enables us to perform file-level updates, thus greatly improving the freshness of the data

4. Improved Apache Hudi

With the growth of Uber data processing and storage requirements, we begin to encounter the limitations of Hudi's write-time replication function, mainly due to the need to continue to improve the processing speed and freshness of data. Even with the Hudi "write-time replication" feature, the updates received by some of our tables are scattered in 90% of the files, resulting in the need to rewrite data from any given large table in the data lake. The amount of rewritten data is about 100TB. Because replication on write even rewrites the entire file for a single modified record, the write replication feature results in higher write magnification and compromised freshness, resulting in unnecessary I hand O on the HDFS cluster and faster consumption of disk space. In addition, more data table updates mean more file versions and a surge in the number of HDFS files, which in turn leads to HDFS Namenode node instability and higher computing costs.

To address these growing concerns, we implemented the second table type, read-time merge. Because read-time merging uses near-real-time data by dynamically merging data, in order to avoid the computational cost on the query side, we need to use this mode reasonably. " The merge-on-read deployment model includes three separate jobs, including an ingest job, including new data consisting of inserts, updates, and deletions, a secondary compression job that actively compresses a small number of updates / deletions of the latest partition asynchronously, and a major compression job that slowly and steadily compresses updates / deletions in a large number of old partitions. Each of these jobs runs at a different frequency, and secondary and extraction jobs run more frequently than primary jobs to ensure that the data in its latest partition is quickly available in column format. With this deployment model, we are able to provide fresh data for thousands of queries in columns and limit our query-side merge costs to the nearest partition. With read-time merge, we can solve all three problems mentioned above, and the Hudi table is almost unaffected by any updates or deletions to the data lake. Now, in Uber, we use Apache Hudi's write-time copy-time and read-time merge features at the same time depending on different scenarios.

Figure 3. Uber's Apache Hudi team developed a data compression strategy for merging tables while reading to frequently convert the nearest partition into column storage, thus reducing the computational cost on the query side.

With Hudi,Uber inserting more than 500 billion records into more than 150PB data lakes every day, using more than 30000 core, more than 10000 tables and thousands of data pipelines per day, Hudi provides more than 1 million queries per week in our various services.

5. Summary of Apache Hudi experience

Uber opened up Hudi in 2017, bringing the benefits of this solution to others, which can extract and manage data storage on a large scale, thus introducing streaming processing to big data. When Hudi graduated from a top-level project under the Apache Software Foundation, Apache's big data team summarized the various considerations that prompted us to build Hudi, including:

How to improve the efficiency of data storage and processing?

How to ensure that the data lake contains high-quality tables?

As the business grows, how to continue to provide low-latency data on a large scale and effectively?

How do we unify the service layer in a minute-level scenario?

Without good standardization and primitives, the data lake will soon become an unusable "data swamp". Such a swamp not only takes a lot of time and resources to coordinate, clean up, and repair tables, but also forces service owners to build complex algorithms to adjust, reorganize, and trade, thus bringing unnecessary complexity to the technology stack.

As mentioned above, Hudi fills these gaps by seamlessly ingesting and managing large analytical datasets on distributed file systems to help users control their data lakes. The establishment of a data lake is a multifaceted problem, which requires trade-offs in data standardization, storage technology, file management practice, and tradeoff performance between data intake and data query. When we talked to other members of the big data community when we founded Hudi, we learned that these problems were common in many engineering organizations. We hope that in the past few years, open source and cooperation with the Apache community, based on Hudi, will enable others to have a better understanding of big data's operations in different industries. In addition to Uber, Apache Hudi has been used for production in a number of companies, including Aliyun, Tencent Cloud, AWS, Udemy and so on.

6. Plan for the future

Figure 4. Apache Hudi scenario includes data analysis and infrastructure health monitoring

Hudi provides high-quality insights by forcing schema on datasets to help users build a more powerful and fresh data lake.

At Uber, having one of the largest transaction data lakes in the world provides us with opportunities for a variety of Apache Hudi use case scenarios, and there is a direct incentive to go deeper because solving problems and improving efficiency at this scale can have a significant impact. At Uber, we have used advanced Hudi primitives, such as incremental pull, to help build chained incremental pipelines, thus reducing the computing space for jobs that are supposed to perform large scans and writes. We adjust the compression strategy of the merge table when reading according to specific use case scenarios and requirements. In recent months since we donated Hudi to the Apache Foundation, Uber has contributed features such as embedded timeline services for efficient file system access, removal of renames to support cloud-friendly deployments and improved incremental pull performance.

In the coming months, Uber plans to contribute more new features to the Apache Hudi community. Some of these features can help reduce costs by optimizing computing usage and improving the performance of data applications, and we will take a closer look at how to improve storage management and query performance based on access patterns and data application requirements.

For more information about how we plan to achieve these goals, you can read some RFC, including smart metadata that supports column indexes and O (1) query plans, efficient bootstrapping of Parquet tables to Hudi, and record-level index support for faster inserts, these RFC are presented to the Apache community by Uber's Hudi team.

With Apache Hudi graduation becoming a top-level Apache project, we are pleased to contribute to the ambitious roadmap of the project. Hudi enables Uber and other companies to use the open source file format to prove the speed, reliability and trading ability of their data lake in the future, thus eliminating many of big data's challenges and building rich and portable data applications.

On "Uber based on Apache Hudi how to build PB-level data lake practice" is introduced here, more related content can search previous articles, hope to help you answer questions, please support the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 258

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report