How to build a high-performance data lake through Apache Hudi and Alluxio 07/11 Update SLTechnology News&Howtos

How to build a high-performance data lake through Apache Hudi and Alluxio

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to build a high-performance data lake through Apache Hudi and Alluxio, I believe that many inexperienced people are at a loss about this. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Overview of 1.T3 Travel data Lake

T3 travel is still in the period of business expansion, and different business lines will choose different storage systems, transmission tools and processing frameworks before building the data lake. As a result, serious data isolated islands have emerged, which makes the complexity of mining data value very high. Due to the rapid development of business, this inefficiency has become our engineering bottleneck.

We turned to a unified data lake solution based on Alibaba OSS (object storage similar to AWS S3) to provide a centralized location to store structured and unstructured data following the design principles of multi-cluster, shared data architecture (Multi-cluster,Shared-data Architecture). In contrast to different data silos, all applications access OSS storage as a source of fact. This architecture enables us to store data as is without having to structure the data first, and run different types of analysis to guide better decision-making, through big data processing, real-time analysis and machine learning to build dashboards and visualization.

two。 Efficient near real-time analysis using Hudi

The intelligent travel service of T3 travel promotes the need for near real-time data processing and analysis. Using traditional data warehouses, we face the following challenges:

Long-tailed updates lead to frequent cold data and cascading updates with very long business windows, resulting in high backtracking costs for order analysis, random updates and late data unable to predict data intake Pipeline can not guarantee reliability in distributed data Pipeline lost data cannot reconcile warehouse data intake delay is very high

Therefore, we use Apache Hudi on top of OSS to solve these problems. The following figure shows the architecture of Hudi:

2.1 enable near real-time data uptake and analysis

T3 travel data lake supports Kafka message, Mysql binlog, GIS, business log and other data sources to enter the lake in near real time. More than 60% of the company's data has been stored in the data lake, and this proportion is still expanding. T3 travel reduces the data intake time to a few minutes by introducing Hudi into the data pipeline, and combined with big data's interactive query and analysis framework (such as Presto and SparkSQL), more real-time insight and analysis of the data can be achieved.

2.2 enable incremental processing pipeline

T3 travel relies on the incremental query capability provided by Hudi. For the multi-tier data processing scenarios with frequent changes, only the incremental changes can be fed back to the downstream derived tables, and the downstream derived tables only need to apply these change data to quickly complete the local data update of the multi-layer link, thus greatly reducing the efficiency of data update in the frequent change scenarios. It effectively avoids the update of full partition and cold data in the traditional Hive data warehouse.

2.3 use Hudi as a uniform data format

Traditional data warehouses usually deploy Hadoop to store data and provide batch analysis, while Kafka is used alone to distribute data to other data processing frameworks, resulting in data duplication. Hudi effectively solves this problem, and we always use the Spark-kafka pipeline to insert the latest updated data into the Hudi table, and then read the updates of the Hudi table incrementally. In other words, Hudi unifies storage.

3. Using Alluxio for efficient data caching

In earlier versions of the data lake, Alluxio,Spark was not used to process data received from Kafka in real time, and then wrote it to OSS using the Hudi DeltaStreamer task. When performing this process, the network latency when Spark writes directly to OSS is usually very high. Because all data is stored in OSS, resulting in a lack of locality of the data, OLAP queries on Hudi data are also very slow. To solve the delay problem, we deployed Alluxio as a data orchestration layer, co-located with computing engines such as Spark and Presto, and used Alluxio to speed up reading and writing to the data lake, as shown in the following figure:

Most of the data in Hudi,Parquet,ORC and JSON formats are stored on OSS, accounting for 95% of the data. Computing engines such as Flink,Spark,Kylin and Presto are deployed in isolated clusters. When each engine accesses OSS, Alluxio acts as a virtual distributed storage system to accelerate data and coexist with each compute cluster. The following is a case study of the use of Alluxio in T3 travel data lake.

3.1 data into the lake

We co-deploy Alluxio with the computing cluster. Before the data enters the lake, mount the corresponding OSS path to the alluxio file system, and then set the "--target-base-path" parameter of Hudi from oss://... Change to alluxio://.... When the data enters the lake, we use the Spark engine to pull up the Hudi program to continuously ingest the data, and the data flows in the alluxio at this time. After the Hudi program is pulled up, it is set to synchronize data asynchronously from the Allxuio cache to the remote OSS every minute. In this way, Spark changes from writing remote OSS to writing local Alluxio, which shortens the time for data to enter the lake.

3.2 data analysis on the lake

We use Presto as a self-help query engine to analyze the Hudi table on the lake. The Alluxio is co-located in each Presto worker node. When Presto is running in conjunction with the Alluxio service, Alluxio may cache input data locally to Presto worker and provide next retrieval at memory speed. In this case, Presto can use Alluxio to read data from local Alluxio worker storage (called short-circuit reads) without any additional network transmission.

3.3 concurrent access across multiple storage system

In order to ensure the accuracy of the training samples, our machine learning team often synchronizes the desensitized data in production to the offline machine learning environment. During synchronization, data flows across multiple file systems, from the production OSS to the offline data lake cluster HDFS and finally to the HDFS of the machine learning cluster. For data modelers, the data migration process is not only inefficient, but also can lead to errors due to misconfiguration, because multiple file systems with different configurations are involved. So we introduced Alluxio to mount multiple file systems under the same Alluxio, unifying the namespace. End-to-end docking uses separate Alluxio paths, which ensures seamless access and transmission of data by applications with different API. This data access layout can also improve performance.

3.4 benchmark test

Overall, we observed the following advantages of Alluxio:

Alluxio supports hierarchical and transparent caching mechanism; Alluxio supports caching promote mode while reading; Alluxio supports asynchronous write mode; Alluxio supports LRU recycling strategy; Alluxio has pin and TTL features

After comparison and verification, we choose to use Spark SQL as the query engine, query the Hudi table, the storage layer is Alluxio + OSS, OSS, HDFS these three different file systems. During the pressure test, it is found that when the amount of data is greater than a certain order of magnitude (2400W), the query speed of using alluxio+oss exceeds the speed of mixed deployment of HDFS, and when the amount of data is greater than 1E, the query speed begins to increase exponentially. When the 6e data is reached, it is 12 times higher than the query native oss and 8 times higher than the query native HDFS. The larger the data scale, the more significant the performance improvement, and the multiple of the improvement depends on the machine configuration.

4. Prospect

With the expansion of the data lake ecosystem for T3 travel, we will continue to face the key scenarios of computing and storage isolation. As T's demand for data processing grows, our team plans to deploy Alluxio on a large scale to enhance data lake query capabilities. Therefore, in addition to the deployment of Alluxio on the data lake computing engine (mainly Spark SQL), a layer of Alluxio will be installed in the OLAP cluster (Apache Kylin) and ad_hoc cluster (Presto). Alluxio will cover the whole scene, and the Alluxio interconnection between each scene will improve the reading and writing efficiency of the data lake and the ecology around the lake.

After reading the above, have you mastered how to build a high-performance data lake through Apache Hudi and Alluxio? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.