In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
Delta Lake how to achieve CDC real-time access to the lake, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.
What is CDC?
Change Data Capture (CDC) is used to track and capture data source changes, and synchronize these changes to the target storage (such as data lake or data warehouse) for data backup or subsequent analysis. The synchronization process can be minute / hour / day granularity or real-time synchronization. CDC schemes can be divided into two types: intrusive manner and non-intrusive manner.
Intrusive
The intrusive scheme directly requests the data source system (such as reading data through JDBC), which will put performance pressure on the data source system. Common scenarios are as follows:
Last updated (Last Modified)
The source table needs a modification time column, and the synchronization job needs to specify the last modification time parameter, indicating that the data changed after a certain point in time is synchronized. This method cannot delete record changes synchronously, and multiple changes in the same record can only be recorded for the last time.
Self-increasing id column
The source table needs to have a self-incrementing id column, and the synchronization job needs to specify the maximum id value of the last synchronization, and synchronize the new record rows since the last synchronization. This method can not delete the changes of records synchronously, and the changes of old records are not aware of.
Non-invasive
Non-intrusiveness generally records the data changes of the data source through logs (such as the binlog of the database), and the source database needs to enable the function of binlog. Every operation of the data source will be recorded in binlog (such as insert/update/delete, etc.), and can track data insertion / deletion / multiple data updates / DDL operations in real time.
Example:
Insert into table testdb.test values ("hangzhou", 1); update testdb.test set baked 2 where a = "hangzhou"; update testdb.test set baked 3 where a = "hangzhou"; delete from testdb.test where a = "hangzhou"
Through the orderly playback of the binlog log to the target storage, the data export synchronization function of the data source is realized.
Common implementation of CDC Scheme
There are two common CDC scenarios for open source:
Sqoop offline synchronization
Sqoop is an open source data synchronization tool, it can synchronize the data of the database to HDFS/Hive, support full synchronization and incremental synchronization, users can configure hour / day scheduling jobs to regularly synchronize data.
Sqoop incremental synchronization is an intrusive CDC scheme that supports both Last Modified and Append modes.
Disadvantages:
Directly jdbc requests the source database to pull data, affecting the performance of the source database.
Hour / day scheduling, low real-time performance
Cannot synchronize the delete operation of the source database, and Append mode does not support data update operation
Binlog real-time synchronization
Binlog logs can be synchronized to message middleware such as kafka in real time through some tools, and then binlog can be played back to target storage (such as Kudu/HBase, etc.) in real time through streaming engines such as Spark/Flink.
Disadvantages:
The operation and maintenance cost of Kudu/HBase is high
Kudu has a stability problem when there is a large amount of data, and HBase does not support high throughput analysis.
The logic of Spark Streaming playback binlog is complex, and there is a certain threshold to use java/scala code.
Streaming SQL+Delta Lake real-time entry scheme into the lake
Two common CDC schemes were introduced earlier, each of which has some disadvantages. Aliyun E-MapReduce team provides a new CDC solution, which makes it easy for CDC to enter the lake in real time using self-developed Streaming SQL with Delta Lake. This solution also provides an one-stop lake entry experience through Aliyun's newly released data Lake Construction (Data Lake Formation,DLF) service.
Streaming SQL
Spark Streaming SQL provides SQL capability on top of Spark Structured Streaming, which lowers the threshold of real-time business development and makes real-time offline services more simple and convenient.
Here is an example of real-time consumption of SLS:
# create loghub source table spark-sql > CREATE TABLE loghub_intput_tbl (content string) > USING loghub > OPTIONS > (...) # create delta target table spark-sql > CREATE TABLE delta_output_tbl (content string) > USING delta > OPTIONS > (..); # create streaming SCANspark-sql > CREATE SCAN loghub_table_intput_test_stream > ON loghub_intput_tbl > USING STREAM # insert loghub source table data into delta target table spark-sql > INSERT INTO delta_output_tbl SELECT content FROM loghub_table_intput_test_stream;Delta Lake
Delta Lake is an open source data lake format of Databricks, which provides ACID transaction / metadata management capabilities based on parquet format. At the same time, it has better performance than parquet and can support richer data application scenarios (such as data update / schema evolution, etc.).
The E-MapReduce team has made many functional and performance optimizations based on open source Delta Lake, such as small file merging, Optimize/DataSkipping/Zorder,SparkSQL/Streaming SQL/Hive/Presto deep integration, Delta, and so on.
Streaming SQL+Delta Lake CDC enters the lake in real time.
Spark Streaming SQL provides the syntax of Merge Into, combined with the real-time writing ability of Delta Lake, it is very convenient to realize the real-time entry of CDC into the lake.
As shown in the image above, only SQL is needed to complete the real-time entry of CDC into the lake.
After reading the above, have you mastered how Delta Lake realizes the real-time entry of CDC into the lake? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.