What is the solution of Cloud data Lake in Apache Hudi 02/08 Update SLTechnology News&Howtos

What is the solution of Cloud data Lake in Apache Hudi

2026-02-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What is the solution of Cloud data Lake in Apache Hudi? I believe many inexperienced people don't know what to do about it. Therefore, this article summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

1. Introduce

The open source Apache Hudi project provides streaming capabilities for large organizations such as Uber to process billions of records on the data lake every day.

As organizations around the world adopt this technology, the Apache open source data lake project has matured.

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is a data lake project that streams data on Apache Hadoop-compatible cloud storage systems, including Amazon S3 and Aliyun OSS.

The project was first developed in Uber in 2016, became open source in 2017, and entered the Apache incubator in January 2019. As a result of open source feedback, Hudi has been adopted by major technology providers such as Alibaba, Tencent, AWS,Uber and Kyligence.

On June 4th, Hudi (pronounced "Hoodie") officially became a top-level project of the Apache Software Foundation (ASF), a milestone that marked that the project had reached a high level of code maturity and developer community participation. ASF is home to Hadoop,Spark,Kafka and other widely used database and data management programs.

2. How to realize the Cloud data Lake of Uber by Hudi

Hudi is now an open source project used by many organizations, of which Uber has always been a committed user.

Tanvi Kothari, data engineering manager at Uber, said Uber uses Hudi to process more than 500 billion records in the 150PB data lake every day.

Kothari operates the Uber Global data Warehouse team, which is responsible for providing core data sheets for all Uber businesses. She pointed out that Hudi supports Uber to incrementally process the reads and writes of more than 10000 tables and thousands of data pipes.

"Hudi removes many of the challenges in dealing with big data," says Kothari. "it can help you expand the ETL [Extract,Transform,Load] pipeline and improve data fidelity."

3. Hudi as the cornerstone of cloud data lake analysis

Big data analyst supplier Kyligence Solutions uses Apache Hudi as part of its product. Shi Shaofeng, partner and chief architect of Kyligence, which has offices in Shanghai, China and San Jose, Calif., says his company uses many Apache open source projects, including Apache Kylin,Hadoop and Spark technologies, to help companies manage their data.

Shi Shaofeng said that Apache Hudi provides Kyligence with a way to manage change datasets directly on the Hadoop distributed File system (HDFS) or Amazon S3.

Kyligence began using Hudi for US customers in 2019, during which time AWS introduced integration with Hudi and Amazon Elastic MapReduce (EMR) services. The Kyligence Cloud service now also supports Hudi as the data source format for online analysis and processing by all its users.

Shi said he was pleased to see Hudi's graduation as a top-level project for Apache. "Hudi has an open and enthusiastic community and even translated a series of Hudi articles into Chinese, making it easier for Chinese users to understand the technology," he said.

4. How to enable cloud data lake flow processing by Hudi

ASF Apache Hudi co-founder and VP Vinoth Chandar says Hudi provides the ability to use data streams and enables users to update datasets.

Chandar treats Hudi-enabled streaming as a form of data processing in which the data lake administrator can process incremental data and then use that data.

"A good way to really think about Hudi is as a data store or database that provides transaction capabilities on top of data stored in [AWS] S3, [Aliyun] OSS," says Chandar.

Chandar went on to say that Hudi's status as a top-level project also reflects the maturity of the project. However, although Hudi is now a top-level project for Apache, this work has not yet reached version 1.0, and the latest update is the 0.5.2 milestone released on March 25 (version 0.5.3 was released after graduation).

Hudi developers are currently working on version 0.6.0, which Chandar says will be released at the end of June. Chandar said the release will be an important milestone with performance enhancements and improved data migration capabilities to help users bring data into the Hudi data lake. "our plan is to release at least one major version every quarter, and then we want to release the bugfix version on top of the major version every month," he said. "

After reading the above, have you mastered the solution of Cloud data Lake in Apache Hudi? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.