In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the example analysis of Apache Hudi combined with Flink, which is very detailed and has certain reference value. Friends who are interested must finish it!
1. Real-time data landing requirement evolution
After the real-time platform is online, the main requirement is to develop real-time reports, that is, to extract all kinds of data sources for real-time etl, and then spit out the real-time indicators to the oracle database for display and query.
With the stability and opening up of the real-time platform, various users have a wider demand:
For real-time development, real-time sql data need to be checked by etl debugging, data sampling and so on.
Data analysis and business hope to combine the existing data system of data warehouse to analyze and gain insight into real-time data, such as user behavior, real-time data and some models of data warehouse, instead of just looking at some highly aggregated reports.
The business hopes to drive the real-time data as a part of the business process and realize the closed loop of the business.
To meet some of the needs, after landing the real-time data and combining with other warehouse data, T-1 will run offline and approve the report.
>
In addition to the main requirements listed above, there are also some piecemeal requirements.
Generally speaking, the real-time platform outputs highly aggregated data to users, which can no longer meet the needs, and users yearn for more detailed, more primitive, more independent, and more possible data.
This requires the platform to land real-time data into the offline data warehouse system. Therefore, based on the evolution of these requirements, the real-time platform began the exploration and practice of real-time data landing.
two。 Application practice of Real-time data Landing based on Spark+Hudi
The first type selection is the popular Spark + Hudi system, and the overall landing structure is as follows:
This set is mainly based on the following considerations:
Data warehouse development does not need to write Scala/Java and Jar package to do task development.
ETL logic can be embedded in data tasks
Unified development entry
At that time, we made a general data channel, which was composed of Spark task jar package and Shell script, and the data warehouse development entrance was a unified scheduling platform, which transformed the requirements of falling data into corresponding Shell parameters, and completed the landing of the data after starting the script.
3. Practice of customizing real-time data landing based on Flink
Since our real-time platform was based on Flink at that time, and there were some problems with Spark+Hudi 's support for high-traffic tasks, such as increased latency when burying data, tasks often OOM, etc., we decided to explore the path of Flink data.
At that time, the Flink+Hudi community had not yet been realized, and we referred to the process of falling data in Flink+ORC, and realized the landing of real-time data, mainly the parameterized definition of Schema of falling data, so that the colleagues of data developers could realize the landing of data in shell.
4. Practice of landing data based on Flink + Hudi
After the Hudi integrated Flink version came out, the real-time platform began to prepare for compatibility and incorporated Hudi into the real-time platform development content.
Take a look at the overall architecture after access.
The real-time platform connects all kinds of data sources and Sink terminals with various plug-ins. We refer to the Sink process of HudiFlinkTable and connect Hudi to our real-time development platform.
In order to improve usability, we have mainly done the following auxiliary functions
Automatic synchronization and update of Hive metadata
Hudi schema automatic splicing
Task monitoring, Metrics data access, etc.
The actual use process is as follows
After the whole system is launched, various business line report development, real-time online analysis and other aspects are used, better enable the service, a total of 26 online links, a single day data fall into about 300 million
5. Planning and Prospect of follow-up Application
The follow-up mainly focuses on the following aspects
5.1 replace offline report to improve real-time and stability of report
The offline report is characterized by T-1, the number of runs in the early morning, and the overall dependent link length of the report. Two characteristics lead to low timeliness, on the other hand, when the data dependence link is long, problems in the intermediate data can easily lead to subsequent overall dependence delay, and many anomalies need to wait until the report task is actually running. to be exposed. And the problem of running batches is exposed in the early hours of the morning, the limitation of solution and resource coordination are to be reduced to a level, which is unacceptable for reports requiring stability and timeliness, especially for financial companies, by migrating the reports to the real-time platform, it not only improves the timeliness of the reports, but also because the decimation and report etl have been running in real time, the stability of the report data has been greatly improved. This is one of the plans for the application of Hudi real-time data.
5.2 improve the monitoring system and improve the stability of data tasks
At present, it only monitors the falling data task, that is, whether the task is running normally, whether it throws an exception, and so on. But the actual users are more concerned about the monitoring of the whole link of data from upstream to Hive. For example, whether there is data delay, whether there is back pressure, whether there is data source consumption, whether falling data is lost, whether there are bottlenecks in each task, etc. Generally speaking, users want to have a more comprehensive and detailed understanding of the operation of the task, which is also the goal that needs to be improved in the following monitoring.
5.3 Exploration of Visualization of Intermediate process of falling data
This is similar to the monitoring above, the user wants to make sure that a piece of data is connected from the data source and processed by various operators, some of its details. For example, whether the data should be filtered, which window it is in, the processing time of each operator, etc., otherwise, the whole data SQL processing process is a black box for users.
The above is all the contents of the article "sample Analysis of Apache Hudi combined with Flink". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.