Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of ETL loading Strategy

2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of ETL loading strategy, which has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, let the editor take you to understand it.

Overview of ETL data loading Mechanism

ETL is the abbreviation of Extract, Transform and Load. Its function is to extract the required data from the data source, clean and transform the data, and finally load the data into the data warehouse according to the predefined data warehouse model, which is the most important step in building the data warehouse.

In the process of loading data into the database, it is divided into full load (update) and incremental load (update).

Full load: the way in which the data is loaded after the full table is deleted.

Incremental loading: the target table updates only the data that changes in the source table.

Full loading is technically much simpler than incremental loading. Generally, you only need to empty the target table before the data is loaded, and then import all the data from the source table. However, due to the real-time requirements of the amount of data, system resources and data, we need to use incremental loading mechanism in many cases.

The difficulty of incremental loading lies in that correct and effective methods must be designed to extract the changed data from the data source and the source data that is not changed but affected by the changed data. at the same time, these changed and unchanged but affected data are updated to the data warehouse after completing the corresponding logical transformation. The excellent incremental extraction mechanism not only requires ETL to accurately capture the changed data in the business system at a certain frequency, but also can not put too much pressure on the business system and affect the existing business, but also to meet the logical requirements in the process of data conversion and the data correctness of the target table after loading. At the same time, the performance of data loading and the maintainability of recoverable restart after job failure are also very important considerations.

The incremental extraction mechanism is more suitable for data tables with the following characteristics:

A target table with a huge amount of data.

The source table changes the data more regularly, such as increasing or decreasing according to time series.

The change data of the source table is relatively small compared to the total data.

The target table needs to record expired or redundant information.

The business system can provide delta data directly.

If more than 1Comp4 service source data need to be updated in each extraction, we should consider changing the loading method of ETL from incremental extraction to full extraction. In addition, full extraction is also suitable for systems with small amount of data and low update frequency.

ETL data incremental loading mechanism:

ETL incremental loading mainly includes:

Syslog analysis mode

Trigger mode

Timestamp mode

Full table comparison mode

Source system incremental (delta) data is loaded directly or after conversion

One-to-one incremental updates to source and target tables

This kind of update is mainly common between the production database and the backup database, or in the data distribution business of different data marts. That is, the data correspondence between the source table and the target table is very simple, the data is exactly the same, or the source table only filters part of the data.

Syslog analysis can be used for this type of incremental update.

Syslog analysis mode

This method judges the changed data by analyzing the log of the database itself. Relational database systems will store all DML operations in log files to achieve database backup and restore functions. By analyzing the log of the database and extracting the DML operation information of the relevant source table after a specific time, the ETL incremental extraction process can know the data changes of the table since the last extraction time, so as to guide the incremental extraction action.

Advantages and disadvantages of system log analysis

Advantages: the implementation is simple. Good isolation, if a load failure occurs, it will not affect the cascading failure of the source table and its transactions.

Disadvantages: log table maintenance needs to be completed by the OLTP system, the OLTP system business operating procedures need to be modified to record log information. Log table maintenance is more troublesome, which has a great impact on the original system.

Trigger mode

There are two main ways to extract trigger increments:

Load the data directly

Incremental loading using incremental log table

The way to load data directly is to create a temporary table similar to the structure of the source table, and then create three types of triggers corresponding to insert, update, and delete operations. Whenever there is a data change in the source table, use the trigger to fill the changed data into the temporary table. Finally, by maintaining the temporary table, the corresponding data in the target table is modified during the ETL process. At the end of the ETL process, empty the temporary table.

Using the incremental log table for incremental loading is not to extract the source table data directly, but to write the operation content into an incremental log table (at the same time, the extracted data in the incremental log table should be marked or deleted in time). The incremental log table generally does not store all the field information of the incremental data, but only stores the name of the source table, the updated key value and the update operation type (insert, update or delete). The ETL incremental extraction process first extracts the corresponding complete record from the source table according to the source table name and updated key value, and then processes the target table according to the update operation type.

Because the incremental data itself is not fully recorded in the incremental log table, only the source of the incremental data is recorded. When carrying out incremental ETL, you only need to look up the source table to get the real incremental data according to the records in the incremental log table.

Advantages and disadvantages of trigger mode

Advantages: high performance of data extraction.

Disadvantages: require the business table to establish triggers, have an impact on the business system, need to modify the user database, can not operate on multiple tables and views, if the target table error will cause cascading transaction failure, this is unbearable in the production system, another disadvantage is that if there is a problem during the operation of the trigger, sometimes the entire table needs to be reloaded to resume the load job. This kind of method is suitable for incremental updates of tables that are one-to-one and the business logic is not complex.

The source and target tables are updated one-to-one incrementally, but some data conversion operations are required

There is not a simple data correspondence between some source tables and target tables, but there may be different data structures between them, which require some data transformations, such as summarization, row-to-column and other operations before they can be updated. The log analysis and trigger methods mentioned above are not suitable, so it is more appropriate to use timestamps.

Timestamp mode

The implementation principle is that in incremental extraction, the extraction process determines which data to extract by comparing the system time or the maximum timestamp of the last extraction of the source table with the value of the timestamp field of the extraction source table. In this way, a timestamp field needs to be added to the source table. When the table data is updated and modified, the value of the timestamp field is modified at the same time.

When using timestamp for incremental update, the source table needs to have the corresponding timestamp field, so the source table without timestamp needs to be reformed and the necessary timestamp field is added.

Advantages and disadvantages of timestamp

Advantages: the implementation of the logic is simple, can update data in large quantities. You can capture not only the data of one source table, but also the incremental data of multiple source tables.

Disadvantages:

a. The insert and update operations of the source table can be captured normally by using the timestamp method, but there is nothing we can do about the delete operation, which needs to be combined with other mechanisms to complete. At this time, we can design a data table which is the same as the source table to record the deleted data in the source table, record the timestamp when deleting, and read the record of this table at the same time when updating the target table.

Or logically delete the target table by marking it (update active_flag=1).

b. Data delay may occur when the system updates the timestamp field automatically or manually.

c. If the system updates the timestamp automatically, you need to pay special attention to keeping the value of this field unchanged during the operation of the whole Load table, otherwise the timestamp will be automatically loaded to the latest time of Load, affecting the incremental update logic of the whole table.

d. There are some limitations in application. That is, the source table needs to have a timestamp field. If part of the source table (or reference table) has no timestamp field, and some fields of the source table are updated (common in the definition update of the dimension table, such as region dimension, product dimension, etc.), we will face the problem of updating historical data. At this time, using the timestamp method can only update the dimension definition of the new data, but not the historical data. At this time, it is generally necessary to use the SQL statement or the full table comparison method described below to update the historical data. Or adjust the timestamp range to refresh the whole table data. This situation requires that the real-time requirement of the target table is not high, and it can be dealt with when the system is idle.

e. Since timestamp incremental updates are often applied to ETL processes with complex business logic, when multiple records in the source table are summarized into the target table, for example, the source table records R1 ~ 2 ~ R3 (the primary keys are all the same), and the record T1 is generated from the primary key summary to the target table, it is necessary to note that if a piece of data from the active table in the time window changes (R2), you need to use the primary key of R2 to generate R1 in the source table. R3 is also captured at the same time (although there is no data change within the time window) to resummarize and update the T1 records in the target table to avoid data loss or incorrect data logic.

Many-to-one incremental updates to source and destination tables

In the daily ETL update, we will also encounter that the data source of the target table comes from multiple source tables, and the update operation is carried out through the stitching of key fields. If multiple source tables have timestamp fields, you can use timestamps for incremental updates, and you can also use full table comparisons for incremental updates.

Full table comparison mode

Full table comparison means that in incremental extraction, the ETL process compares the records of the source table and the target table one by one, and reads out the new and modified records. After optimization, the MD5 check code is used. It is necessary to establish a temporary table with similar structure for the table to be extracted in advance. The temporary table records the primary key value of the source table and the MD5 check code calculated according to the data of all the fields of the source table. Each time the data is extracted, the MD5 check code is compared between the source table and the MD5 temporary table. If there is any difference, perform UPDATE operation: if the primary key value does not exist in the target table. Indicates that the record does not yet exist, then the INSERT operation is performed. Then, you need to DELETE the primary key values that no longer exist in the source table but still remain in the target table.

Advantages and disadvantages of full table comparison

The advantage is that it applies to:

a. It involves the extraction and transformation of multiple source tables and the incremental update of complex business logic.

b. The source table has no timestamp field and cannot be updated incrementally by timestamp.

The disadvantage is the use of full table comparison, for large data tables, the efficiency is not high.

Summary and analysis

Table 1. Comparison table of various incremental extraction methods

Syslog analysis mode, trigger mode, time stamp mode, full table comparison mode, new data to target table, cocoa, update data to target table, cocoa, delete data to target table, cocoa can not capture target table, data quantity is small, large, moderate, target table type all except view, all source tables, number 11, multi-source table processing logic, simple and complex system resources More, less, less, more data extraction performance, better data extraction performance, better difference source table, whether there is a timestamp field, no need, whether or not disaster tolerance is poor, general excellent.

When choosing the appropriate incremental loading mechanism, you need to be aware of the following:

1. Disaster recovery capability of incremental extraction. Mainly refers to the incremental loading update mode if the operation fails or the database is down and restarted, whether the incremental data can be reloaded or whether it needs to be recorded manually also affects the maintainability of incremental loading to a certain extent.

In the case of failure or non-operation, the system log analysis method will not be able to capture the incremental data of the source table and recover the loading of the historical incremental data, so the disaster recovery capability is the worst. Trigger mode and timestamp mode, if there is a corresponding log table, or the incremental log has not been cleared, you can reload the historical incremental data after starting again, and the disaster tolerance is general. On the other hand, the full table comparison mode, according to its implementation principle, will not be affected when it is restarted, and the incremental data can be fully captured with the best disaster tolerance.

two。 Performance factors of incremental extraction. It is shown in two aspects, one is the performance of the extraction process itself, and the other is the negative impact on the performance of the source system. Trigger mode, log table mode and system log analysis mode do not need to perform comparison steps in the extraction process, so the performance of incremental extraction is better. The whole table comparison method needs to go through a complex comparison process to identify the changed records, and the extraction performance is the worst.

If the business logic of the incremental update is complex and requires high machine performance, and the update may affect the performance of the existing business logic table, you can evaluate the importance of the incremental update. If it has little impact on the existing business, and the source table is relatively stable, customers can tolerate a temporary degree of data inconsistency. Then consider making incremental updates on a regular basis. For example, the update runs once a day in the early hours of the morning, or even once a week in free time.

In addition, the commonly used way in the industry is that the source system extracts incremental delta data files in its spare time, and the data warehouse system gets the delta files and then loads them into the warehouse system. In the design of data warehouse architecture, the direct data exchange between the production system layer and the ODS layer is this type.

3. The trigger mode requires the establishment of a trigger on the source table, which is rejected in some applications. There are also ways to create temporary tables, such as full table comparisons and log tables. It may not be implemented because of restrictions on database permissions open to the ETL process. The same can happen in a Syslog-based approach, as most database products allow only specific groups of users or even DBA to perform log analysis. For example, the Replication function of DB2 can only be maintained and modified by DBA, which cannot be operated by ordinary users.

4. In order to ensure the accuracy of incremental updates, it is recommended to establish a series of verification scripts to ensure the data consistency between source tables and data tables. Audit scripts can be run frequently according to their priorities.

5. The source systems that data extraction needs to face are not necessarily relational database systems. It is common for an ETL process to extract EXCEL or CSV text data from legacy systems that existed several years ago. At this time, all the incremental mechanisms based on relational database products can not work, and timestamp and full table comparison may be useful to a certain extent. In the worst case, we have to abandon the idea of incremental extraction and replace it with full table deletion and insertion (or require the source system to provide incremental delta data files).

Thank you for reading this article carefully. I hope the article "sample Analysis of ETL loading Strategy" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report