Example Analysis of Apache Hudi combined with Flink 04/18 Update SLTechnology News&Howtos

Example Analysis of Apache Hudi combined with Flink

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the example analysis of Apache Hudi combined with Flink, which is very detailed and has certain reference value. Friends who are interested must finish it!

1. Real-time data landing requirement evolution

After the real-time platform is online, the main requirement is to develop real-time reports, that is, to extract all kinds of data sources for real-time etl, and then spit out the real-time indicators to the oracle database for display and query.

With the stability and opening up of the real-time platform, various users have a wider demand:

For real-time development, real-time sql data need to be checked by etl debugging, data sampling and so on.

Data analysis and business hope to combine the existing data system of data warehouse to analyze and gain insight into real-time data, such as user behavior, real-time data and some models of data warehouse, instead of just looking at some highly aggregated reports.

The business hopes to drive the real-time data as a part of the business process and realize the closed loop of the business.

To meet some of the needs, after landing the real-time data and combining with other warehouse data, T-1 will run offline and approve the report.

In addition to the main requirements listed above, there are also some piecemeal requirements.

Generally speaking, the real-time platform outputs highly aggregated data to users, which can no longer meet the needs, and users yearn for more detailed, more primitive, more independent, and more possible data.

This requires the platform to land real-time data into the offline data warehouse system. Therefore, based on the evolution of these requirements, the real-time platform began the exploration and practice of real-time data landing.

two。 Application practice of Real-time data Landing based on Spark+Hudi

The first type selection is the popular Spark + Hudi system, and the overall landing structure is as follows:

This set is mainly based on the following considerations:

Data warehouse development does not need to write Scala/Java and Jar package to do task development.

ETL logic can be embedded in data tasks

Unified development entry

At that time, we made a general data channel, which was composed of Spark task jar package and Shell script, and the data warehouse development entrance was a unified scheduling platform, which transformed the requirements of falling data into corresponding Shell parameters, and completed the landing of the data after starting the script.

3. Practice of customizing real-time data landing based on Flink

Since our real-time platform was based on Flink at that time, and there were some problems with Spark+Hudi 's support for high-traffic tasks, such as increased latency when burying data, tasks often OOM, etc., we decided to explore the path of Flink data.

At that time, the Flink+Hudi community had not yet been realized, and we referred to the process of falling data in Flink+ORC, and realized the landing of real-time data, mainly the parameterized definition of Schema of falling data, so that the colleagues of data developers could realize the landing of data in shell.

4. Practice of landing data based on Flink + Hudi

After the Hudi integrated Flink version came out, the real-time platform began to prepare for compatibility and incorporated Hudi into the real-time platform development content.

Take a look at the overall architecture after access.

The real-time platform connects all kinds of data sources and Sink terminals with various plug-ins. We refer to the Sink process of HudiFlinkTable and connect Hudi to our real-time development platform.

In order to improve usability, we have mainly done the following auxiliary functions

Automatic synchronization and update of Hive metadata

Hudi schema automatic splicing

Task monitoring, Metrics data access, etc.

The actual use process is as follows

After the whole system is launched, various business line report development, real-time online analysis and other aspects are used, better enable the service, a total of 26 online links, a single day data fall into about 300 million

5. Planning and Prospect of follow-up Application

The follow-up mainly focuses on the following aspects

5.1 replace offline report to improve real-time and stability of report

The offline report is characterized by T-1, the number of runs in the early morning, and the overall dependent link length of the report. Two characteristics lead to low timeliness, on the other hand, when the data dependence link is long, problems in the intermediate data can easily lead to subsequent overall dependence delay, and many anomalies need to wait until the report task is actually running. to be exposed. And the problem of running batches is exposed in the early hours of the morning, the limitation of solution and resource coordination are to be reduced to a level, which is unacceptable for reports requiring stability and timeliness, especially for financial companies, by migrating the reports to the real-time platform, it not only improves the timeliness of the reports, but also because the decimation and report etl have been running in real time, the stability of the report data has been greatly improved. This is one of the plans for the application of Hudi real-time data.

5.2 improve the monitoring system and improve the stability of data tasks

At present, it only monitors the falling data task, that is, whether the task is running normally, whether it throws an exception, and so on. But the actual users are more concerned about the monitoring of the whole link of data from upstream to Hive. For example, whether there is data delay, whether there is back pressure, whether there is data source consumption, whether falling data is lost, whether there are bottlenecks in each task, etc. Generally speaking, users want to have a more comprehensive and detailed understanding of the operation of the task, which is also the goal that needs to be improved in the following monitoring.

5.3 Exploration of Visualization of Intermediate process of falling data

This is similar to the monitoring above, the user wants to make sure that a piece of data is connected from the data source and processed by various operators, some of its details. For example, whether the data should be filtered, which window it is in, the processing time of each operator, etc., otherwise, the whole data SQL processing process is a black box for users.

The above is all the contents of the article "sample Analysis of Apache Hudi combined with Flink". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.