What are the practical skills of data warehouse modeling and ETL 04/18 Update SLTechnology News&Howtos

What are the practical skills of data warehouse modeling and ETL

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you what are the practical skills of data warehouse modeling and ETL, I believe most people do not know much about it, so share this article for your reference. I hope you will gain a lot after reading this article. Let's learn about it together.

I. the structure of Data warehouse

Data warehouse (Data Warehouse DW) is a relational Datcbase established by storing Data according to a specific pattern in order to facilitate multi-dimensional analysis and multi-angle presentation. Its Data is based on OLTP source Systam. The Data in the Data repository is detailed, integrated, and topic-oriented, aiming at the analysis requirements of OLAPSystam.

The architecture model of Data warehouse includes two modes: star architecture and snowflake architecture. The middle of the star architecture is the fact table, surrounded by the dimension table, similar to the stars; in comparison, the middle of the snowflake architecture is the fact table, and the dimension tables on both sides can have their associated sub-tables, thus expressing a clear dimensional hierarchy.

From the two aspects of the analysis requirements of OLAPSystam and the processing efficiency of ETL, the star structure converges quickly and the analysis efficiency is high, while the snowflake structure is clear and easy to interact with OLTPSystam. Therefore, in the actual project, we will use star architecture and snowflake architecture to design Data warehouse.

So, let's take a look at the process of building an enterprise-level Data repository.

Second, the five-step method of constructing enterprise-level Data warehouse

(1). Determine the theme

That is, determine the theme of the Data analysis or front-end presentation. For example, we hope to analyze the beer sales in a certain region in a certain month of a certain year, which is a theme. The topic should reflect the relationship between the analytical angles (dimensions) and the statistical value Data (measure) of a certain aspect, and should be taken into account when determining the topic.

We can vividly think of a theme as a star: the statistical Data (measure) exists in the middle of the star; the analytical angle (dimension) is the angle of the star; we will examine the measurement through the combination of dimensions. Then, the theme of "Beer sales in a certain region in a certain year and month" requires us to examine the measurement of sales through the combination of time and region. Thus, different themes come from different subsets in the Data repository, which we can call the Data Mart. Data bazaar embodies a certain aspect of Data warehouse information, and multiple Data bazaars constitute Data warehouse.

(2) to determine the measurement

After determining the theme, we will consider the technical indicators to be analyzed, such as annual sales. They are generally a number of value Data. We can either summarize the Data, or take the number of times, independent times, or the maximum and minimum value of the Data, which is called a measure.

Measurements are indicators to be counted, which must be selected properly in advance. Based on different measurements, complex key performance indicators (KPI) can be designed and calculated.

(3) determining the granularity of factual Data

After determining the metric, we need to take into account the summary of the metric and the aggregation of the metrics in different dimensions. Taking into account the degree of aggregation of the measurement, we will adopt the "minimum granularity principle", that is, the granularity of the measurement will be set to the minimum.

For example, suppose the current Data records a minimum of seconds, that is, the transaction volume of each second is recorded in the Datcbase. So, if we can confirm that in the future analysis requirements, the time only needs to be accurate to days, we can summarize the Data by day during ETL processing, at this time, the granularity of measurement in the Data warehouse is "days." Conversely, if we are not sure whether future analysis requirements need to be accurate to seconds in terms of time, then we need to follow the "minimum granularity principle" and keep the Data of each second in the fact table of the Data warehouse so that we can analyze the "seconds" later.

While adopting the "minimum granularity principle", we do not have to worry about the efficiency of aggregate analysis caused by massive Data, because in the subsequent establishment of multi-dimensional analysis model (CUBE), we will summarize the Data in advance to ensure the efficiency of producing analysis results. We will elaborate on the issues related to the establishment of multidimensional analysis model (CUBE) in the next column.

(4) determine the dimension

Dimension refers to all angles of analysis. For example, if we want to analyze by time, or by region, or by product, then the time, region and product here are the corresponding dimensions. Based on different dimensions, we can see the summary of the metrics, and we can also conduct cross-analysis based on all the dimensions.

Here we first determine the level (HierarChy) and level (Level) of the dimension. In the time dimension, we form a level according to "year-quarter-month", in which "year", "quarter" and "month" become the three levels of this level; similarly, when we establish the product dimension, we can divide "product category-product subcategory-product" into a level, which includes "product category", "product subcategory" and "product" level.

So, what is the existence form of the dimensions used in our analysis in the Data repository?

We can set the three levels to three fields in a Data table, such as the time dimension, or we can use three tables to store the product category, product subcategory, and product three-part Data, such as the product dimension.

In addition, value should mention that we should make full use of the surrogate key when creating dimension tables. A surrogate key is a numeric value ID number (for example, the first field of each table) that uniquely identifies each dimension member. More importantly, in aggregation, the matching and comparison of several value fields, JOIN is efficient and easy to aggregate. At the same time, the surrogate key plays an important role in slowly changing the dimension. when the original Data primary key is the same, it plays a role in identifying the new Data and the historical Data.

Here, we might as well talk about the problem of dimension tables changing over time, which we often encounter, which we call the slow change dimension.

For example, if we add a new product, or the ID number of the product is changed, or the product adds a new attribute, the dimension table will be modified or new record rows will be added. In this way, we have to consider the treatment of slowly changing dimensions in the process of ETL. For slowly changing dimensions, there are three situations:

1. Slowly change the dimension of the first TYPE:

The historical Data needs to be modified. In this case, we use UPDATEmethod to modify the Data in the dimension table. For example, the ID number of the product is 123. later, it is found that the ID number is wrong and needs to be rewritten to 456. then, when we deal with ETL, we directly modify the original ID number in the dimension table.

2. The second kind of TYPE of slow change dimension:

The historical Data is retained, and the new Data is also retained. At this point, to update the original Data and insert the new Data, we use UPDATE / INSERT. For example, an employee was in Department An in 2005 and transferred to Department B in 2006. Then the employee should be located to department A when counting the Data in 2005, and department B when counting the Data in 2006, and then when a new Data is inserted, it will be processed according to the new department (department B). In this way, our approach is to add the list of members of the dimension to the identification column, mark the historical Data as "expired" and the current Data as "current". Another kind of method is to timestamp the dimension, that is, the period of time when the historical Data takes effect as one of its attributes, and it will be associated according to the time period when matching with the original table to generate the fact table. The advantage of this kind of method is that the effective time of the member of the dimension is clear.

3. The third TYPE of slow change dimension:

The property has been changed by adding a new Data dimension member. For example, if a dimension member adds a new column, which cannot be browsed based on it in the historical Data, but can be browsed according to it in the current Data and future Data, then we need to change the dimension table properties, that is, to add a new field column. Then, we will use a stored procedure or program to generate new dimension properties, which will be viewed in subsequent Data based on the new attributes.

(5) create a fact table

After determining the fact Data and dimensions, we will consider loading the fact table.

When a large number of Data in the company piled up, we wanted to see what was inside, only to find that it was a record of production and transactions. So these records are the original Data of the fact table we are going to create, that is, the fact table about a subject.

Our approach is to associate the original table with the dimension table to generate the fact table. Note that when there is an empty Data (Data source dirty), we need to use an external join. After the connection, we take out the surrogate key of each dimension and put it in the fact table. In addition to the surrogate key of each dimension, the fact table also has each measure Data, which will come from the original table. There will be dimension surrogate key and each measure in the fact table, but there should be no descriptive information, that is, it accords with the principle of "thin height". That is to say, the number of Data items in the fact table is required to be as many as possible (the granularity is the smallest) and the descriptive information is as few as possible.

If you consider the extension, you can add a unique identification column to the fact table, assuming that the future extension will treat the fact as a snowflake dimension, but it is generally recommended not to do so when it is not necessary.

The factual Data table is the core of the Data warehouse and needs careful maintenance. After JOIN, we will get the factual Data table, which generally has a large number of records. We need to set up a compound primary key and index for it to present the integrity of Data and query performance optimization based on Data warehouse. In fact, the Data table is placed in the Data warehouse together with the dimension table. If the front end needs to connect to the Data warehouse for query, we also need to establish some relevant intermediate summary tables or materialized views to facilitate the query.

What is ETL

In the construction of Data warehouse, ETL runs through the project, and it is the lifeline of the entire Data warehouse, including Data cleaning, integration, convert, loading and other processes. If the Data warehouse is a building, then ETL is the foundation of the building. The quality of ETL extraction and Data integration has a direct impact on the final result presentation. Therefore, ETL plays a very key role in the whole Data warehouse project and must be placed in a very important position.

ETL is the abbreviation of Data extraction (ExtraCt), convert (Transform) and loading (Load). It refers to extracting Data from OLTPSystam, convert and integrating Data from different Data sources to get consistent Data, and then load it into Data repository.

So, in this convert process, we have completed three operations: the correction of the Data format, the merging of the Data fields, and the calculation of the new indicators. Similarly, we can improve the Data in the Data repository according to other requirements.

In short, with ETL, we can generate Data repositories based on the Data in the source Systam. ETL builds a bridge between OLTPSystam and OLAPSystam for us.

IV. Project practice skills

(1). The use of the preparation area

When building a Data warehouse, if the Data source is on one server and the Data warehouse is on another server, taking into account the frequent access to the data source server, and the large amount of Data, it needs to be constantly updated, so the preparation area Datcbase can be established. First, the Data is extracted into the preparation area, and then processed based on the Data in the preparation area. The advantage of this processing is to prevent frequent access in the original OLTPSystam, Data calculation or sorting and other operations.

For example, we can extract the Data into the preparation area according to the day. Based on the Data preparation area, we will convert and integrate the Data, and process the consistency of the Data of different Data sources. There will be original extraction tables, convert intermediate and temporary tables, and ETL log tables in the Data preparation area.

(2) the use of timestamps

The time dimension is very important for a fact topic, because different times have different statistical Data information, then the information recorded by time will play an important role. In ETL, timestamps play a special role. In the slowly changing dimensions mentioned above, we can use timestamps to identify dimension members; when recording the operations of Datcbase and Data repositories, we will also use timestamps to identify information. For example, when we perform Data extraction, we will extract the Data from the OLTPSystam according to the timestamp. For example, if we take the Data of the day before at 0:00 midnight, we will take the GETDATE to GETDATE minus one day according to the timestamp in OLTPSystam to get the previous day's Data.

(3) the use of log tables

When dealing with Data, it is hard to avoid Data processing errors and error messages, so how can we get the error information and correct it in time? Method is that we use one or more Log log tables to record error messages. In the log table, we will record the number of entries extracted each time, the number of successful entries, the number of failed entries, the Data of failed processing, the processing time, and so on. In this way, when an error occurs in Data, it is easy to find the problem, and then correct or reprocess the wrong Data.

(4) usage scheduling

Scheduling must be used when making incremental updates to the Data repository, that is, incremental update processing for the factual Data table. Consider the actual amount of Data before using scheduling and determine how often it takes to update. For example, if you want to view it by day, we'd better extract it by day. If the amount of Data is small, you can update Data on a monthly or half-yearly basis. If the dimension changes slowly, you need to take into account the update of the dimension table when scheduling, and update the dimension table before updating the factual Data table.

Scheduling is the key link of Data warehouse, which should be carefully considered. After the ETL process is set up, it should be run regularly, so scheduling is a key step in running the ETL process. Each scheduling not only writes the Data processing information into the Log log table, but also uses sending Email or alarm services, which makes it convenient for technicians to grasp the ETL process and enhance the security and accuracy of Data processing.

V. Summary

Building an enterprise-level Data repository requires five simple steps, and with these five-step method, we can build a powerful Data repository. However, there is a deep content to be studied and excavated in each step, especially in the actual project, we should consider it comprehensively. For example, if the Data source has a lot of dirty Data, we should first clean the Data before building the Data warehouse to remove unwanted information and dirty Data.

ETL is the bridge between OLTPSystam and OLAPSystam and the channel through which Data flows from the source Systam to the Data warehouse. In the implementation of the Data warehouse project, it is related to the Data quality of the whole project, so it must not be careless, it must be placed in an important position to build a solid foundation of the Data warehouse building.

These are all the contents of the article "what are the practical skills of data Warehouse Modeling and ETL". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.