What are the modeling and ETL practice skills of data warehouse? 04/17 Update SLTechnology News&Howtos

What are the modeling and ETL practice skills of data warehouse?

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Today, I would like to talk to you about data warehouse modeling and ETL practice skills, many people may not understand, in order to make you better understand, the editor summarized the following content, I hope you can get something according to this article.

How to build a data warehouse, what methods and principles should be followed in this process, and what skills are there in the project practice.

First, the "heart" of data warehouse

First, let's talk about the data model. Models are simulations and abstractions of real-world features, such as maps, architectural sand tables, airplane models, and so on.

The data model DataModel is the abstraction of real-world data features.

In the construction of data warehouse project, the establishment of data model is of great significance. Customers' business scenarios, process rules and industry knowledge are all reflected through the data model, which builds a communication bridge between business personnel and technical personnel, so in some foreign data warehouse literature, the data model is called the heart of data warehouse "TheHeartoftheDataWarehouse".

The design of the data model has a direct impact on the data

Stability.

Ease of use

Access efficiency

Storage capacity

Maintenance cost

II. Overview of data models, data layering and ETL programs in data Warehouse

Data warehouse is an information system that integrates various external data sources in a (quasi) real-time / batch way and provides end-users with data consumption in a variety of ways.

In the face of a variety of upstream business systems, an important task of the data warehouse is to clean and integrate the data, form a standardized data structure, and provide a credible data basis for subsequent consistent data analysis.

On the other hand, the data in the data warehouse needs to be expressed in a variety of forms, including fixed reports for understanding enterprise production status, KPI cockpit for reporting to management, real-time data push for large screen display, data marts for departmental applications, and data laboratories for analysts. For different ways of data consumption, data needs to shift from a highly consistent basic model to a dimensional model that is convenient for data presentation and data analysis. Different stages of data therefore need to be matched with data models with different architectural characteristics, which is why the data is layered in the data warehouse.

The flow of data among each layer of data is from one data model to another. This transformation process needs the help of ETL algorithm. For example, data is the raw material in the data warehouse, while the data model is the mold of different product forms, and different data layers are the "workshops" of the warehouse. The pipelined transmission of data in each "workshop" depends on the scheduling tool, the process automation software, the client tool for executing SQL is the mechanical arm on the assembly line, and the ETL program is the core of the algorithm that drives the mechanical arm for product processing.

The image above is the hybrid radiation architecture of data warehouse in the book data Warehouse Toolbox-authoritative Guide to dimensional Modeling.

2.2 hierarchical model in the financial industry

The data warehouse in the financial industry is the industry with the highest requirements for model construction and the most mature industry. in the process of financial industry data warehouse project construction for many years, it has basically formed the buffer layer, the basic model layer, the summary layer (common processing layer), and the market layer. Different customers will rely on these four-tier model to do different evolution, may be merged to form three layers, may also be subdivided to form 5 or 6 layers. This article briefly introduces the most common four-tier model:

Buffer layer: some projects are also known as ODS layer. To put it simply, the model of this layer of data is attached to the source. For the users of the warehouse, a landing buffer zone of the upstream system is formed in the warehouse, and the original production data can be preserved and reflected in this layer, so the data retention period of this layer is relatively short. The most common use is to provide simple original access based on the source system structure directly. Such as audit, etc.

Basic layer: also known as core layer, basic model layer, PDM layer, and so on. After the data is divided and integrated according to the topic domain, the detailed data is saved for a long period. This layer of data is highly integrated and is the core area of the entire data warehouse and the basis of all subsequent data layers. The data saved in this layer is at least 13 months, and it is common for 2 to 5 years.

Market level: skip to the last floor first. The data model of the bazaar layer has strong business significance and is easy for business personnel to understand and use. it is used to meet the access and query of department users, business users and key management users. however, it is often connected to the data query of the front portal, the access of report tools, and the exploration of data mining and analysis tools.

Summary layer: the summary layer is not actually set up from the beginning. Often after the establishment of the basic layer and the bazaar layer, it is found that there are repeated queries and scans of the basic layer data during the collection, statistics and processing of the basic layer data, and the statistical algorithms of the data Marts of different departments actually have something in common, so the primary key between the two layers, the common summary results form an independent data layer, a connecting link between the preceding and the next, saving the entire system computing resources.

2.3Common ETL algorithms in data warehouse

Although the data models in the data warehouse vary widely for different industries and different business scenarios, in essence, the data processing from the buffer layer to the basic layer is how to efficiently add / total data to the data table of the basic layer, and form a reasonable information chain of data historical changes. From the basic layer to the aggregation layer and then to the market layer, it is how to carry out data processing by means of association, aggregation, aggregation and grouping. Therefore, after long-term accumulation, the data conversion algorithm between data levels can actually form a fixed ETL algorithm, which is also the reason why many data warehouse code generation tools on the market can automatically and intelligently form uncoded ETL scripts for data warehouse development. Here, because of the space, we only briefly list several common ETL algorithms from the buffer layer to the basic layer, and the SQL script corresponding to the specific algorithm can be introduced in detail at another time.

1. Full table coverage A1 algorithm description: delete all data from the target table, and then insert the amount of data from the current data source: full data applicable scenario: no need to retain historical tracks, only use the latest state data 2. Update insert (Upsert) A2 algorithm description: today's data is updated according to the primary key comparison, and the new data is inserted to increase the amount of data from the data source: incremental or full data applicable scenarios: no need to retain historical tracks, only use the latest status data 3. Historical zipper (Historychain) A3 algorithm states that the data are compared with the previous day's data according to the primary key, and the updated data are closed and opened on the same day, and the record source data of the day's open chain is added to the new data: incremental or full data applicable scenarios: data that needs to retain the historical track, this part of the data will ignore the deleted information, such as customer table, account table, etc. 4. Full zipper (FullHistorychain) A4 algorithm description: the total data of today is compared with the data of the previous day in the zipper table, the data that does not exist in the result is compared with the data that does not exist in the result, the updated data is closed and opened on the same day, and the record source data of the open chain of the day is added to the new data: full data applicable scenarios: data that needs to keep the track of historical changes. This part of the data will be compared by the data to delete information to link 5. 5. Incremental zipper with deletion (Fx:DeltaHistoryChain) A5 algorithm description: today's incremental data closes the deleted data according to the change flag in the increment, closes the chain and opens the chain as needed after comparing the updated and new data with the previous day's primary key, and increases the amount of record source data of the day's open chain for the new data: incremental data applicable scenario: data that needs to keep the track of historical changes This part of the data will be based on the CHG_CODE to determine the deletion message 6. 5. Additional algorithm (Append) A6 algorithm description: delete the incremental data of the current day / month, and insert the incremental data source data of the current day / month: incremental data applicable scenarios: pipeline or event data 3, GaussDB (DWS) and data warehouse

Huawei's GaussDB (DWS) is a distributed MPP database based on public cloud infrastructure. It is mainly oriented to massive data analysis scenarios. MPP database is the most mainstream database architecture for data warehouse systems in the industry. The main feature of this architecture is the Shared-nothing distributed architecture, which is composed of many logical nodes (that is, DN nodes) with independent and unshared CPU, memory, storage and other system resources.

In such a system architecture, business data is scattered on multiple nodes, and SQL is pushed to the nearest location of the data to execute, and large-scale data processing work is completed in parallel to achieve fast response to data processing. Based on Shared-Nothing non-shared distributed architecture, it can also ensure that the business processing capacity increases linearly with the expansion of the cluster scale.

After reading the above, do you have any further understanding of data warehouse modeling and ETL practice skills? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.