Data on the cloud, should choose full extraction or incremental extraction? 07/01 Update SLTechnology News&Howtos

Data on the cloud, should choose full extraction or incremental extraction?

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Author: transferred to Shifu: Alibaba data Zhongtai official website https://dp.alibaba.com Overview data extraction refers to the data extracted from the source data, is the first step in the construction of the data center. Data sources are generally relational databases. In recent years, with the vigorous development of mobile Internet, other types of data sources have emerged, such as website browsing date, APP browsing log and IoT device log. In terms of technical implementation, obtaining data from a relational database can be subdivided into two ways: full extraction and incremental extraction. It is common for business systems to add a timestamp field to the source table when creating and modifying table records, while modifying the value of the timestamp field. When the extraction task is running, the full table is scanned, and the data extracted is determined by comparing the business time and timestamp fields of the extraction task. This data synchronization method has two disadvantages in terms of accuracy: 1. It can only obtain the latest status and cannot capture process change information, such as e-commerce shopping scenarios. If the customer pays soon after placing an order, when the incremental data is extracted the next day, only the latest payment status can be obtained, and the status at the time of placing the order may have been lost. In order to solve this problem, it is necessary to comprehensively determine whether backtracking status is needed according to business requirements. 2. The records that have been delete will be lost. If you are in a business system, the record is physically deleted. It is impossible to carry out incremental extraction. In general, the business system is required not to delete records, but only to mark records. If the business system maintenance timestamp uses traditional relational databases such as Oracle and DB2, the business system needs to maintain the timestamp field. When the business system updates the business data, the timestamp field is updated in the code. This method is common, but due to the need for coding implementation, the workload will become larger, there may be omissions, triggers maintain timestamps typical relational databases, all support triggers. When there is a change in the database record, a specific function is called to update the timestamp field. Typical examples are as follows:

Database maintenance timestamp MySQL can automatically realize the maintenance of change fields, which reduces the development workload to a certain extent. The specific implementation example is as follows: create a record

The final results are as follows:

Update record

The final result is as follows: the database automatically changes the timestamp field:

Analysis of MySQL binlog logs in recent years, with the vigorous development of the Internet, Internet companies generally use MySQL as the main database. Because it is an open source database, many companies have done customized development. One of the most important features is the separation of read and write and real-time synchronization between master and slave by subscribing to MySQL binlog logs. The typical diagram is as follows:

Parsing binlog logs brings a new method to data synchronization. After parsing, the results are sent to big data platforms such as Hive/MaxCompute to achieve data synchronization with a second delay. The technology of parsing binlog log incremental synchronization is very advanced, which has three great advantages: 1. The data delay is small. In Alibaba double 11 scene, under the huge amount of data, the second delay can be achieved; 2. Without losing data, the situation of data delete can be captured. There are no additional requirements for the business table, and the timestamp field can be missing; of course, this synchronization method also has some disadvantages: 1. The technical threshold is very high. In general, the technical reserve of the company is not enough to complete the whole system on its own. At present, China is also limited to the head of Internet companies, large-scale state-owned enterprises, central enterprises. However, with the rapid development of cloud computing, tools and services have been opened on Ali Cloud, which can directly achieve real-time synchronization. The classic combinations are MySQL, DTS, Datahub and MaxCompute; 2. The resource cost is relatively high, so it requires a system to receive the binlog log of the business database in real time, which is running all the time and takes up more resources. There needs to be a primary key in the business table for data sorting and analysis. Oracle Redo Log log Oracle is a very powerful database that parses Redo Log logs in real time through Oracle GoldenGate and publishes the parsed results to the designated system. Full extraction is to extract the data from the table or view in the data source intact from the database and write it to big data platforms such as Hive, MaxCompute and so on. It is somewhat similar to data migration between business libraries. Full synchronization is relatively simple and is often used in offline synchronization scenarios with small amounts of data. However, this synchronization method also has two disadvantages, exactly the same as incremental offline synchronization: 1. Can only get the latest status 2. It will lose the records that have been recorded by delete. From the business perspective, offline data table synchronization can be subdivided into four scenarios. The overall architecture chart is as follows:

In principle, in the data cloud segment, it is recommended that only data mirroring synchronization be carried out. No business-related data conversion is carried out. There are three starting points for the transformation from ETL strategy to ELT strategy: 1. The cost of machines. Conversion outside the library requires additional machines, bringing new costs; 2. Communication costs. The developers of the business system, but also the users of the data center, these technicians are very familiar with the original business library table, if they make additional conversion, they need to learn other tools and products; 3. Execution efficiency. The performance of the conversion machine outside the database is generally lower than that of MaxCompute and Hadoop clusters, which increases the execution time. During synchronization, it is recommended to cloud all fields of the whole table to reduce the cost of later change. The source data of the small data scale is fully updated daily, and the full data is extracted by database direct connection, and written into the daily / monthly partition table. The original log increment of the log table is extracted to the daily increment table and stored in daily increments. Because the log data shows that there will be only additions but no modifications, there is no need to save the full table. Big data scale database is directly connected to extract incremental data to today's incremental partition table through business timestamp, and then write today's full partition table to today's incremental partition table merge. The hour / minute increment table / irregular full source data is updated frequently, reaching the minute / hour level. The incremental data is extracted from the source database by timestamp to the hour / minute incremental partition table, the N-hour / minute incremental partition table merge is entered into the daily incremental partition table, and then the full partition table of today's incremental partition table merge is written into today's full partition table. For more information, please see Alibaba data official website https://dp.alibaba.com Alibaba data Center team, committed to the output of Aliyun data intelligence best practices, to help each enterprise build its own data center, and then work together to achieve intelligent business in the new era! Alibaba data center solution, core product: Dataphin, with Alibaba big data core methodology OneData as the core driver, provides one-stop data construction and management capabilities; Quick BI, a collection of Alibaba's data analysis experience, provides one-stop data analysis and presentation capabilities Quick Audience, which integrates Alibaba's consumer insight and marketing experience, provides one-stop crowd selection, insight and marketing capabilities to connect Alibaba's business and achieve user growth.

The original link to this article is the original content of Yunqi community and may not be reproduced without permission.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.