In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly explains "what are the subsystems in the ETL architecture". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn what are the subsystems in the ETL architecture.
Three simple letters, E-T-L, make it easy to overlook the importance of 38 ETL subsystems in the construction of a data warehouse.
The extract-transform-load (ETL) system, or "back-end system" informally, takes up 70% of the workload and time in building the entire data warehouse system. But this is not enough to illustrate the complexity of the ETL system. Everyone understands the meaning of these three letters: e, taking the data from the source system; T, processing the data; and L, loading it into the table accessed by the end user.
But when we ask how to break down these three steps, many designers will say, "specific problems, specific analysis." For example, this depends on different data sources; it depends on the nature of the data; it depends on the scripting language and the ETL tools available; it depends on the technical capabilities of the staff; and it also depends on the query and reporting tools used by the end user.
"specific situation, specific analysis" is a very dangerous thing, because it can easily be called an excuse for system confusion. With the experience of thousands of successful data warehouse projects, we have compiled a series of best practices.
For the last 18 months, we have been delving into ETL practices and ETL products. We identified 38 subsystems that are involved in the background part of each data warehouse project. The bad news is that the ETL system does take up most of the resources of the data warehouse project. The good news is that if you can master all these subsystems, you can easily use your experience to build a successful data warehouse system.
1. Decimation system (Extract System)
The main functions include the adapter of source data, the work scheduling of pushing / dragging / handling data, the filtering and sorting function of source data, the conversion of data format, and the temporary storage function of data after migrating to ETL environment.
two。 Change data capture system (Change Data Capture System)
The main functions include the reading function of source data log file, the filtering function of source data date and sequence number, and the record comparison function based on CRC algorithm.
3. Data profile Analysis system (Data Profiling System)
The main functions include field attribute analysis, such as reference domain analysis, structure analysis, such as primary and foreign key relationship analysis, data rule analysis, value rule analysis and so on.
4. Data cleaning system (Data Cleansing System)
The main functions include a typical data dictionary-driven system for parsing the name, address and other information of individuals and organizations, as well as products, places, etc.; a "De-duplication" system for identifying and removing individual and organizational information, as well as products and places A "Surviving" system, using specific data merge logic, is used to save the specified fields of a specific data source, and the data of this particular data source will become the final version of the data warehouse; maintain the correspondence of the background data for all data sources, such as natural keys and proxy keys.
5. Data consistency processing system (Data Conformer System)
The main functions include identifying and generating dedicated consistency dimension attributes and measurement attributes of consistency facts. as the basis of data integration work, these two sets of attributes are used to support data integration across multiple data sources.
6. Audit Dimension Generation system (Audit Dimension Assembler System)
The main function is to load the metadata related to the fact table into an audit dimension table so that the end user can view the metadata related to the fact table as if it were a normal dimension.
7. Data quality filtering system (Quality Screen Handler System)
The main function is to automatically detect all data quality problems in the process of ETL. The results of the detection will be entered into the error event handling system (see Subsystem 8 for details).
8. Error event handling system (Error Event Hander System)
The main function is to fully record and report all error events in ETL processing. It includes the branch processing logic of all kinds of errors and the real-time monitoring of data quality in ETL processing.
9. Proxy key generation system (Surrogate Key Create System)
The main function is to generate pipelined surrogate keys with a robust mechanism, and the generation rules do not depend on any dimension or any database instance, and can support distributed systems.
10. Slowly changing Dimension processing system (Slowly Changing Dimension Processor,SCD)
The main function is to deal with the changes in the properties of the dimension table over time, in the following ways: type 1 (direct override), type 2 (generate new rows), and type 3 (add new columns).
11. Late Dimension processing system (Late Arriving Dimension Handler)
The main function is to insert and update the dimension data when the change of the dimension data arrives at the data preparation area later than the corresponding factual data.
12. Fixed hierarchical structure Generation system (Fixed Hierarchy Dimension Builder)
The main function is to check and maintain the data validity of all kinds of hierarchical structures of many-to-one relations in the dimension table.
13. Variable hierarchical structure Generation system (Variable Hierarchy Dimension Builder)
The main function is to check the data validity and dimensions of all the hierarchical structures with variable depth in the dimension table, such as the hierarchical structure of the organization, the hierarchical structure of the part, and so on.
14. Multi-valued Dimension Bridge Table Generation system (Multivalued Dimension Bridge Table Builder)
The main function is to establish and maintain bridge tables to describe many-to-many relationships between dimensions.
15. Miscellaneous Dimension Generation system (Junk Dimension Builder)
The main function is to establish and maintain multiple low-cardinality flag fields, status fields and other small dimensions from multiple data sources into a miscellaneous dimension.
16. Transaction granularity fact table loading system (Transaction grain fact table loader)
The main function is to update the transaction granularity fact table, including the processing of data, indexes and partitions. It is usually used to deal with incremental data, that is, the latest data. The piping system needs to be replaced with a surrogate key (see subsystem 19 for details).
17. Periodic snapshot fact table loading system (Periodic snapshot grain fact table loader)
The main function is to update periodic snapshot fact tables, including processing of data, indexes, and partitions. Including the incremental update strategy for the current data. The piping system needs to be replaced with a surrogate key (see subsystem 19 for details).
18. Cumulative Snapshot fact Table loading system (Accumulating snapshot grain fact table loader)
The main function is to update the cumulative snapshot fact table, including the processing of data, indexes, and partitions, as well as updating dimension foreign keys and cumulative facts. The piping system needs to be replaced with a surrogate key (see subsystem 19 for details).
19. Surrogate key replacement piping system (Surrogate key pipeline)
The main function is to use multithreading technology to replace the natural key that comes to the data warehouse with the surrogate key.
20. Late fact processing system (Late arriving fact handler)
The main function is to deal with the insertion and update strategy of late fact records.
21. Aggregation Generation system (Aggregate builder)
The main function is to create and maintain the physical structure of the database, such as aggregation tables, to be used in conjunction with query-rewrite technology to improve database query performance. It also includes separate aggregation tables and physical and chemical tables.
twenty-two。 Multidimensional cube generation system (Multidimensional cube builder)
The main function is to create and maintain a star schema for loading multidimensional cube, including some proprietary work of cube technology, such as the maintenance of dimensional hierarchies.
23. Real-time partition generation system (Real-time partition builder)
The special logic of the three fact table types (refer to subsystem 16 ~ 17 ~ 18) maintains a "hot partition" in memory that contains only part of the incremental data that has been counted in the data warehouse table since the last time.
24. Dimension Management Subsystem (Dimension manager system)
As the name implies, it is a system for managing dimension tables. It is responsible for storing the dimensional consistency between the dimension table and the fact table from the center, please refer to subsystem 25.
25. Fact Management system (Fact table provider system)
Corresponding to the dimension table management system, it is a fact table management system that receives consistency dimensions sent from the dimension management system. It includes maintenance series work such as local key replacement, dimension version check, and aggregation table.
twenty-six。 Task scheduling system (Job scheduler)
It is responsible for scheduling and initiating ETL tasks. It can wait for a variety of system conditions, including dependence on the completion of high-priority tasks. Be able to send a warning for an exception.
twenty-seven。 Workflow Monitoring system (Workflow monitor)
Its main function is to have a console and a report system to monitor the execution status of ETL tasks after being started by the task scheduling system. This includes the number of records processed, error summaries, and activities performed.
twenty-eight。 Restore and redo the system (Recovery and restart system)
Restart after the task is paused during task execution, or resume execution to the state before the task execution. This subsystem relies heavily on the backup subsystem (refer to subsystem 38).
twenty-nine。 Parallel processing and Pipeline processing system (Parallelizing/pipelining system)
Its main function is to use multiprocessors, grid computing resources to improve performance, and to achieve data flow processing. Parallelization and plumbing are necessary when ETL is not written to the hard disk or waits for a condition to occur during execution.
thirty。 Abnormal amplification system (Problem escalation system)
Its main function is to be responsible for raising the level of errors under certain conditions to track and solve problems. It includes simple error logging, operator notification, administrator notification and system developer notification.
thirty-one。 Version control system (Version control system)
So that the archiving of metadata can have a solid snapshot function, and you can check the state before and after a change at a certain time. Ability to move in and out of all ETL modules and tasks. Source code comparison function to quickly show the difference before and after the change.
thirty-two。 Version Portable system (Version migration system)
So that programs can be quickly switched in development environment, test environment, and formal environment. The version control system is an interface for restoring migration and an interface for configuring complete database connection information. Makes surrogate key generation independent of the location of the database.
thirty-three。 Architecture and dependency Analysis system (Lineage and dependency analyzer)
For any selected data component, show its physical data source and all subsequent transformations, whether you select the component in the middle of the ETL pipeline or the final data result. For any selected data component, it is necessary to show its downstream data components and the field structure of the final data result that may change, regardless of whether the component in the middle of the ETL pipeline is selected or the data source is selected.
thirty-four。 Compliance reporting system (Compliance reporter)
Comply with the prescribed rules to prove the credibility of the system report. Prove that the data and transformations have not changed. Show who has accessed or changed any data.
thirty-five。 Security control system (Security system)
Role-based access control over all data and metadata is implemented in the ETL pipeline. Prove that the version of the module has not changed. Show who has made any changes.
thirty-six。 Backup system (Backup system)
Backup of data and metadata for future data recovery, restart, security, and compliance with specified requirements.
thirty-seven。 Metadata Management system (Metadata repository manager)
A system for capturing and maintaining metadata for all ETL, including all conversion logic. Including processing metadata, technical metadata and business logic metadata.
thirty-eight。 Project Management system (Project management system)
A tracking system developed for all ETL tasks.
Thank you for your reading, these are the contents of "what are the subsystems in the ETL architecture". After the study of this article, I believe you have a deeper understanding of what the subsystems in the ETL architecture have, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.