The method of Integrated Transformation of Lake and Warehouse based on Huami Science and Technology in Apache Hudi 07/06 Update SLTechnology News&Howtos

The method of Integrated Transformation of Lake and Warehouse based on Huami Science and Technology in Apache Hudi

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of Apache Hudi based on Huami technology application lake warehouse integration transformation method, the content is detailed and easy to understand, the operation is simple and fast, and has a certain reference value. I believe that after reading this Apache Hudi based on Huami technology application lake warehouse integration transformation method article will have a harvest, let's take a look at it.

1. Introduction of application background and pain points

Huami Technology is a cloud-based health service provider with the world's leading smart wearable technology. At Huami Technology, data construction is mainly focused on two types of data: equipment data and APP data, which have the characteristics of delayed upload, high and wide update frequency, and can be deleted. Based on these characteristics, the previous data warehouse ETL mainly adopts the historical full volume + increment mode to update the data every day. With the continuous development of business, the existing data warehouse infrastructure has been difficult to adapt to the continuous growth of data, which brings significant problems such as rising costs and reduced output efficiency.

In view of the problems existing in the existing infrastructure of the warehouse, we have analyzed the main factors affecting cost and efficiency as follows:

The update mode is too heavy, and there is a long tail in the distribution of redundant update incremental data with more data, so daily warehouse updates need to load full amount of historical data to do incremental data integration and update. there are a large number of redundant reading and rewriting of historical data in the whole update process, which brings excessive cost waste and affects the update efficiency.

The backtracking cost is high, and the storage waste caused by multiple copies of full storage. In order to ensure that users can access the historical state of data in a certain period of time, multiple copies of full data will be retained according to the update date. Therefore, a large number of unchanged historical cold data will be repeatedly stored, resulting in storage waste.

In order to solve the above problems and ensure the goal of reducing cost and improving efficiency, we decided to introduce a data lake to reconstruct the data warehouse structure as follows:

Real-time business data sources are connected to Kafka,Flink and Kafka to build ODS real-time incremental data layer. The real-time ODS incremental layer mainly serves two aspects:

Rely on ODS real-time incremental data (keep the original format, no cleaning and conversion) to build the ODS layer offline lake warehouse every day, and the ODS layer data will be used as the backup of business data to meet the needs of full data redo in DWD layer.

The ODS real-time incremental data is cleaned and converted. After coding, the daily incremental data is written offline to the DWD layer to build the DWD layer offline lake warehouse.

The DWS layer is defined as the topic common wide surface layer, which mainly integrates the table information of the DWD layer and the DIM dimension layer according to the business requirements, so as to provide easier-to-use model data for business and analysts.

The OLAP layer will provide a strong ability to quickly query data. As a unified query entrance to the outside world, users can query and analyze all the table data in the lake warehouse impromptu directly through the OLAP engine.

The ADS layer relies on other layers of data to provide customized data services for the business.

two。 Technical solution selection HudiIcebergDelta engine supports Spark, FlinkSpark, FlinkSpark atomic semantic Delete/Update/MergeInsert/MergeDelete/Update/Merge streaming writing supports file formats Avro, Parquet, ORCAvro, Parquet, ORCParquetMOR capabilities do not support Schema Evolution support Cleanup capabilities automatic manual Compaction automatic / manual small file management automatic manual

Based on the above we are more concerned about the indicators for comparison. Hudi can merge small files during task execution, greatly reducing the complexity of file governance. Based on the comprehensive considerations of atomic semantics, small file management complexity and community activity required by business scenarios, we choose Hudi to carry out integrated transformation of lakes and warehouses.

3. Problem and solution 3.1. Incremental data field alignment problem

Huami data cloud will generate table Schema change requirements due to business reasons, so as to avoid the high computing cost of redoing historical Base data due to Schema changes. However, due to the disorder of the relative position of the data entity fields caused by the new addition, an exception occurs in the process of synchronizing Hive into the lake. To solve this problem, Huami big data team is also working with the community to solve the problem of data field alignment. Before the community supports a better Schema Evolution, the solution of the current Huami big data team is to rearrange the incremental data Schema order according to the Schema order of the historical Schema data, and then unify the increment into the lake. The specific processing flow is shown as follows: the Schema order of historical Base data is {id, fdata, tag, uid}, and the Schema {id, fdata, extract, tag, uid} of incremental data. It can be seen that the order of the new extract fields disrupts the Schema of the original historical Base data, and the new data can be adjusted according to the Schema order of the historical data read:

Change {id, fdata, extract, tag, uid} to {id, fdata, tag, uid, extract}, then call Schema Evolution to add an extract field to the Schema of the historical Base data, and finally write the adjusted incremental data to the historical Base.

3.2 Global storage compatibility issues

Huami big data Storage involves a variety of storage (HDFS,S3,KS3). Huami big data team added support for KS3 storage and incorporated into the community code to support KS3 storage after the Hudi0.9 version.

3.3 Unification of CVM time zone

As Huami data centers around the world expand their nodes on an on-demand basis, the applied CVMs may have inconsistent node time zones, resulting in commit failure. We have modified the Hudi source code to unify the time zone (UTC) time of Timeline in the hudi source code to ensure the unity of time zones and avoid Commit failures caused by commitTime backtracking.

3.4 upgrade of the new version

During the upgrade of Hudi0.9 to version 0.10, you will find that the version failed to update data due to version inconsistencies. The inconsistency has been reported to the community, and the relevant students in the community are solving it. Now we use to rebuild the metadata table (directly delete the metadata target) to solve this problem. When you execute the assignment again, Hudi will automatically rebuild the metadata table.

3.5 performance issues with multi-partition Upsert

Hudi on Spark needs to collect the index files of files according to the partition where the incremental data is located. If there are too many updated partitions, the performance is poor. At present, we deal with this problem at two levels:

Promote upstream data governance and control the upload of delayed and duplicate data as much as possible

The code layer optimizes, sets the time range switch, controls the daily data entering the lake within the set time range, avoids the long delayed data entering the lake, reduces the daily update performance of the table, and regularly enters the lake after the long delayed data collection, thus reducing the overall task performance overhead.

3.6 adaptation of data characteristics

From the performance test of data entry into the lake, the performance of Hudi is closely related to the strategy of data organization, which is embodied in the following aspects:

The order of multiple fields of joint primary keys determines the sorting of data in Hudi and affects the performance of subsequent data entering the lake; the order of primary key fields determines the organization of data in hudi, and the data close to sorting will be centrally distributed together, which can be combined with the distribution characteristics of updated data, so as to reduce the base file data in the lake as much as possible and improve the performance of entering the lake.

The adaptive relationship between the number of file block records and Bloom filter parameters in the data lake affects the performance of index construction. When using the Bloom filter, the official default number of entries stored in the Bloom filter is 60, 000 (assuming that the maxParquetFileSize is 128 MB averageRecordSize is 1024). If the data is sparse or the data compressibility is high, the number of records stored in each file block may be much more than 60, 000, resulting in more base files being scanned during each index search, which greatly affects performance. It is suggested that the value should be properly adjusted according to the characteristics of the business data, and the performance of entering the lake should be improved.

4. On-line income

Starting from the business scenarios and analysis requirements, we mainly compare the costs and benefits of the real-time data lake model and the offline data lake model, and the real-time cost is much higher than the offline model. In view of the fact that the real-time demand for business is not very high, Huami Digital Warehouse temporarily adopted the offline update mode of Hudi + Spark when it introduced data Lake to build the original ODS layer and DWD detail layer of Huancang. From the test comparison and launch situation, the benefits are summarized as follows:

4.1 cost side

After the introduction of Hudi data lake technology, the overall cost of the data warehouse has been reduced to a certain extent, and it is expected to reduce the cost of 1max, 4x4x3. Mainly lies in the use of Hudi data lake to provide technical capabilities, can better solve the application background of the two pain points, saving data warehouse Merge update and storage of two parts of the cost.

4.2 efficiency

Hudi uses the index update mechanism to avoid updating the table data every time, so that each update of the warehouse table avoids a large number of redundant data reading and writing operations, so the update efficiency of the table has been improved to a certain extent. From the perspective of the overall chain of our data warehouse + BI report, the overall report output time will be in advance to a certain extent.

4.3 Stability level

For the time being, there is no detailed evaluation on the program stability level. Let's talk about the current situation according to the actual scenario:

The introduction of Hudi for medium and large table updates will be relatively stable. Based on the Aws Spot Instance mechanism, for a table with too much data, the amount of data in each full shuffle is too large, which will lead to too long time to pull data, Spot machine offline, program retry or even failure, or fetch failure caused by memory, resulting in the instability of the task. After the introduction of Hudi, the amount of data per shuffle can be greatly reduced and this problem can be effectively alleviated.

The functional stability of the Metadata table mechanism of Hudi needs to be further improved, which will affect the stability of the program when it is turned on. In order to improve the performance of the program, the Metadata table was opened earlier, and an error will occur after the program has been running for a period of time, and the error has been reported to the community. Turn off the function temporarily and turn it on again when it is stable.

4.4 query performance level

When Hudi writes to a file, it is sorted according to the primary key field, and the records in each Parquet file are sorted according to the primary key field. When using Hive or Spark query, it can make good use of the Parquet predicate push-down feature to filter out invalid data quickly, which has better query efficiency than the previous warehouse table.

This is the end of the article on "the integrated transformation of lake and warehouse based on Huami technology in Apache Hudi". Thank you for reading! I believe you all have a certain understanding of the knowledge of "Apache Hudi based on Huami technology application lake warehouse integration transformation method". If you want to learn more knowledge, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.