In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces the ETL method and process example analysis, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian take you to understand.
ETL is the process of loading the data of the business system into the data warehouse after extraction, cleaning and transformation. The purpose is to integrate the scattered, disorderly and non-uniform data in the enterprise, and provide the analysis basis for the decision-making of the enterprise. ETL is an important part of BI project. Usually, in a BI project, ETL takes at least 1 to 3 times of the whole project, and the quality of ETL design is directly related to the success or failure of the BI project.
The design of ETL is divided into three parts: data extraction, data cleaning and conversion, and data loading. When designing ETL, we also start from these three parts. Data extraction is extracted from different data sources to ODS (Operational Data Store, operational data storage)-this process can also do some data cleaning and conversion). In the process of extraction, different extraction methods need to be selected to improve the efficiency of ETL as much as possible. Of the three parts of ETL, the one that takes the longest time is the part of "T" (Transform, cleaning and conversion), which is generally 2x3 of the whole ETL. Data loading is usually written directly into DW (Data Warehousing, data Warehouse) after data cleaning.
There are many ways to implement ETL, three of which are commonly used. One is realized with the help of ETL tools (such as Oracle's OWB, SQL Server 2000's DTS, SQL Server2005's SSIS service, Informatic, etc.), one is realized by SQL, the other is the combination of ETL tools and SQL. The first two methods have their own advantages and disadvantages. With the help of tools, the ETL project can be set up quickly, shielding complex coding tasks, improving the speed and reducing the difficulty, but lack of flexibility. The advantage of SQL method is that it is flexible and improves the running efficiency of ETL, but the coding is complex and the technical requirements are high. The third is the combination of the advantages of the previous two, which will greatly improve the development speed and efficiency of ETL.
1. Data extraction (Extract)
This part needs to do a lot of work in the research stage. first of all, we need to find out what DBMS the database server of each business system runs, whether there is manual data, how much manual data, whether there is unstructured data, and so on. After collecting these information, we can carry on the design of data extraction.
1. The same data source processing method as the database system where DW is stored.
This type of data source is relatively easy to design. In general, DBMS (SQLServer, Oracle) will provide database link function, which can be directly accessed by writing Select statements by establishing a direct link relationship between the DW database server and the original business system.
2. The processing method of data source which is different from DW database system.
For this type of data source, you can also establish a database link through ODBC-such as between SQL Server and Oracle. If you cannot establish a database link, there are two ways to do so, one is to export the source data to .txt or .xls files through the tool, and then import these source system files into ODS. Another way is to do it through a program interface.
3. For file type data sources (.txt, .xls)
Business personnel can be trained to import this data into a specified database using database tools, and then extract from the specified database. Or it can be achieved with the help of tools.
4. The problem of incremental update
For systems with a large amount of data, incremental extraction must be considered. In general, the business system records the time when the business occurs, and we can use it to mark the increment. Before each extraction, we first determine the maximum time recorded in the ODS, and then go to the business system to retrieve all records greater than this time. Using the timestamp of the business system, in general, the business system has no or part of the timestamp.
2. Data cleaning and conversion (Cleaning, Transform)
In general, data warehouse is divided into two parts: ODS and DW. The usual practice is to clean from the business system to ODS, filter out dirty data and incomplete data, convert from ODS to DW, and calculate and aggregate some business rules.
1. Data cleaning
The task of data cleaning is to filter the data that do not meet the requirements, give the filtering results to the business department, confirm whether it is filtered or corrected by the business unit, and then extract it.
The data that does not meet the requirements are mainly divided into three categories: incomplete data, wrong data and duplicate data.
(1) incomplete data: this kind of data is mainly the lack of some information that should have, such as the name of the supplier, the name of the branch, the lack of regional information of the customer, the mismatch between the main table and the parts list in the business system, and so on. Filter out this kind of data, write different Excel files according to the missing content and submit it to the customer, which is required to be completed within a specified time. After completion, it is written to the data warehouse.
(2) incorrect data: this kind of error occurs because the business system is not sound enough and is written directly to the backend database without judgment after receiving the input. For example, numerical data is input into full-width numeric characters, string data is followed by an enter operation, incorrect date format, date out of bounds, and so on. This kind of data should also be classified. For problems similar to full-width characters and invisible characters before and after the data, we can only find out by writing SQL statements, and then require customers to extract after the business system correction. Errors such as incorrect date format or out-of-line date will cause ETL to fail. This kind of error needs to be picked out by SQL in the business system database and submitted to the business department for correction within a time limit, and then extracted.
(3) duplicate data: for this type of data, especially in dimensional tables, all fields of the duplicate data record are derived for the customer to confirm and collate.
Data cleaning is an iterative process, which can not be completed in a few days, only constantly find and solve problems. For whether to filter or correct, customers are generally required to confirm that for the filtered data, write the filtered data into the Excel file or write the filtered data into the data table. In the early days of ETL development, you can send emails filtering data to business units every day to urge them to correct errors as soon as possible. At the same time, it can also be used as a basis for future data verification. Data cleaning needs to be careful not to filter out useful data, carefully verify each filtering rule, and ask the user to confirm.
2. Data conversion
The task of data conversion is mainly for inconsistent data conversion, data granularity conversion, and the calculation of some business rules.
(1) inconsistent data conversion: this process is an integration process, which unifies the same type of data from different business systems, just as a supplier's code is XX0001 in the settlement system and YY0001 in CRM, so that it is converted into a single code after extraction.
(2) conversion of data granularity: business systems generally store very detailed data, while the data in the data warehouse is used for analysis and does not need very detailed data. In general, business system data is aggregated according to the data warehouse granularity.
(3) calculation of business rules: different enterprises have different business rules and different data indicators, which sometimes cannot be completed simply by adding or subtracting. At this time, these data indicators need to be calculated in ETL and stored in the data warehouse for analysis.
Third, send ETL logs and warnings
1. ETL log
ETL logs fall into three categories.
One is the execution process log, this part of the log is in the ETL execution process of each step of the record, record each run each step of the start time, affected how many lines of data, the form of the current account.
One is the error log, which writes the error log when a module goes wrong, recording the time of each error, the module of the error and the information of the error.
The third type of log is the overall log, which only records whether the ETL start time and end time is successful. If you use the ETL tool, the ETL tool automatically generates some logs, which can also be used as part of the ETL log.
The purpose of logging is to know how ETL is running at any time, and if something goes wrong, you can know what went wrong.
2. Warning sending
If ETL goes wrong, not only form an ETL error log, but also send a warning to the system administrator. There are many ways to send warnings, generally commonly used is to send an email to the system administrator, and attach the error information to facilitate the administrator to troubleshoot the error.
ETL is not only a key part of the BI project, but also a long-term process. Only by constantly finding and solving problems can we make ETL more efficient and provide accurate and efficient data for the later development of the BI project.
Postscript
As a data warehouse system, ETL is a key link. To put it too far, ETL is a data integration solution, and if it is small, it is a tool to pour data. Recall that after working for such a long time, there is really a lot of work to deal with data migration and transformation. But those jobs are basically one-time jobs or a small amount of data. However, in the data warehouse system, ETL has risen to a certain theoretical level, which is different from the original small tools. What is the difference? as can be seen from the name, the process of dumping data has been divided into three steps. E, T and L represent extraction, conversion and loading respectively.
In fact, the ETL process is the process of data flow, from different data sources to different target data. But in the data warehouse,
There are several characteristics of ETL
First, data synchronization, it is not an one-time data to pull, it is a regular activity, according to a fixed cycle of operation, and even now there are people put forward the concept of real-time ETL.
Second, the amount of data is generally huge, so it is worth you to split the process of data flow into E, T and L.
There are many mature tools that provide ETL functionality, not to mention whether they are good or bad. From the perspective of application, the process of ETL is not very complex. These tools bring great convenience to data warehouse engineering, especially the convenience of development and maintenance. On the other hand, developers tend to get lost in these tools. For example, VB is a very simple language and a very easy-to-use programming tool, getting started very quickly, but how many real VB masters are there? Microsoft products usually have a principle of "treat the user as a fool". Under this principle, Microsoft things are really easy to use, but for developers, if you think of yourself as a fool, you are really stupid. The same is true of ETL tools, which provide us with a graphical interface that allows us to focus on rules with a view to improving development efficiency. In terms of effectiveness, it is true that using these tools can very quickly build a job to process a certain data, but overall, it is not necessarily much more efficient. The problem is not in the tools, but in the designers and developers. They are lost in the tools and do not explore the nature of ETL. It can be said that these tools have been used for such a long time, in so many projects, environments, it must have its success, it must reflect the essence of ETL. If we do not look at the ideas behind these tools through the simple use of these tools on the surface, what we end up with is a separate job, and there is still a huge amount of work to integrate them. We all know that "the combination of theory and practice". If it is surpassed in a field, it must reach a certain level in theory.
Thank you for reading this article carefully. I hope the article "sample Analysis of ETL methods and processes" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.