In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "what is the data warehouse design system of the database". In the daily operation, I believe that many people have doubts about what the data warehouse design system of the database is. The editor consulted all kinds of materials and sorted out the simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "what is the data warehouse design system of the database?" Next, please follow the editor to study!
In-depth analysis of data warehouse
Data Warehouse concept:
The English name is Data Warehouse, but it can be abbreviated as DW or DWH. The purpose of data warehouse is to build an analysis-oriented integrated data environment and provide decision support (Decision Support) for enterprises. It is created for analytical reporting and decision support purposes.
The data warehouse itself does not "produce" any data, and it does not need to "consume" any data. The data comes from outside and is open to external applications, which is why it is called a "warehouse" rather than a "factory".
Basic characteristics:
Data warehouse is a topic-oriented, integrated, non-volatile and time-varying collection of data to support management decisions.
Note: the article was first posted on the official account: learn from big data in five minutes.
Topic-oriented:
In the traditional database, the biggest feature is the application-oriented data organization, and each business system may be separated from each other. On the other hand, data warehouse is topic-oriented. Topic is an abstract concept, which is the abstraction of data synthesis, classification, analysis and utilization in enterprise information systems at a higher level. In the logical sense, it is the analysis object involved in a certain macro-analysis field in the corresponding enterprise.
Integration:
The data of the data warehouse is obtained by extracting, cleaning, transforming and summarizing the scattered, independent and heterogeneous database data, which ensures the consistency of the data in the data warehouse to the whole enterprise.
The comprehensive data in the data warehouse can not be obtained directly from the original database system. Therefore, before the data enters the data warehouse, it must be unified and integrated. This step is the most critical and complex step in the construction of the data warehouse. The tasks to be completed are as follows:
It is necessary to unify all contradictions in the source data, such as synonyms of the same name, synonyms of different names, different units, inconsistent word length, and so on.
Carry out data synthesis and calculation. The data synthesis work in the data warehouse can be generated when the data is extracted from the original database, but many of it is generated inside the data warehouse, that is, it is generated comprehensively after entering the data warehouse.
The following figure illustrates a simple process of comprehensive data processing for an insurance company, where the data related to the "insurance" topic in the data warehouse comes from several different operational systems. The names of the data within these systems may be different, and the data formats may be different. These inconsistencies need to be removed before data from different sources are stored in the data warehouse.
Data warehouse theme
Non-volatile (non-updatable)
The data of the data warehouse reflects the content of historical data over a long period of time, a collection of database snapshots at different points in time, and the exported data for statistics, synthesis and reorganization based on these snapshots.
The non-volatility of data is mainly for applications. Most of the data operations of the users of the data warehouse are data query or complex mining. Once the data enters the data warehouse, it is generally retained for a long time. There are generally a large number of query operations in data warehouses, but few modify and delete operations. Therefore, after data is processed and integrated into the data warehouse, it is rarely updated and usually only needs to be loaded and updated regularly.
Time variation
The data warehouse contains historical data of various granularities. The data in the data warehouse may be related to a particular date, week, month, quarter, or year. The purpose of data warehouse is to mine the hidden patterns by analyzing the business status of enterprises in the past period of time. Although users of the data warehouse cannot modify the data, it does not mean that the data of the data warehouse is immutable. The results of the analysis can only reflect the past situation, when the business changes, the mined patterns will lose timeliness. Therefore, the data of the data warehouse needs to be updated to meet the needs of decision-making. From this point of view, data warehouse construction is not only a project, but also a process. The data of the data warehouse changes over time in the following aspects:
(1) the data time limit of data warehouse is generally much longer than that of operational data.
(2) the operational system stores the current data, while the data in the data warehouse is historical data.
(3) the data in the data warehouse is appended in chronological order, and they all have time attributes.
1. The difference between data Warehouse and Database
The difference between a database and a data warehouse is actually the difference between OLTP and OLAP.
Operational processing, called online transaction processing OLTP (On-Line Transaction Processing,), can also be called transaction-oriented processing system. It is aimed at the daily operation of specific business online in the database, usually query and modify a small number of records. Users are more concerned about the response time of operations, data security, integrity and the number of concurrent users. Traditional database system, as the main means of data management, is mainly used for operational processing. Relational databases such as Mysql,Oracle generally belong to OLTP.
Analytical processing, called online analytical processing OLAP (On-Line Analytical Processing), generally analyzes the historical data of certain topics and supports management decisions.
First of all, it is important to understand that the emergence of data warehouse is not to replace the database. The database is designed for transaction, and the data warehouse is designed for topic. Databases generally store business data, while data warehouses generally store historical data.
Database design is to avoid redundancy, generally designed for a business application, such as a simple User table, recording user names, passwords and other simple data, in line with the business application, but not in line with the analysis. Redundancy is intentionally introduced into the design of data warehouse, and it is designed according to analysis requirements, analysis dimensions and analysis indicators.
The database is designed to capture data, and the data warehouse is designed to analyze data.
Take banking as an example. The database is the data platform of the transaction system, and every transaction made by the customer in the bank will be written into the database and recorded. Here, it can be simply understood as accounting with the database. Data warehouse is the data platform of the analysis system, it obtains data from the transaction system, and does summary, processing, to provide decision-making basis for decision-makers. For example, how many transactions take place in a branch of a bank in a month, and what is the current deposit balance of that branch? If there are more deposits and more consumer transactions, then it is necessary to set up an ATM in the region.
Obviously, the trading volume of banks is huge, usually calculated in millions or even tens of millions of times. The transaction system is real-time, which requires timeliness, and it is unbearable for customers to save a sum of money for dozens of seconds, which requires that the database can only store data for a very short period of time. The analysis system is ex post facto, it should provide all the valid data during the concerned period of time. These data are massive, and the aggregate calculation is slower, but as long as the effective analytical data can be provided, the goal will be achieved.
Data warehouse is produced in order to further mine data resources and make decisions when a large number of databases already exist. It is by no means a so-called "large database".
two。 Hierarchical architecture of data warehouse
According to the process of data inflow and outflow, data warehouse architecture can be divided into: source data, data warehouse, data application.
Paradigm modeling
According to Inmon, the construction method of the data warehouse model is similar to the enterprise data model of the business system. In the business system, the enterprise data model determines the source of the data, and the enterprise data model is also divided into two levels, namely the subject domain model and the logical model. Similarly, the subject domain model can be regarded as the conceptual model of the business model, while the logical model is the instantiation of the domain model on the relational database.
two。 Dimensional modeling method (Dimensional Modeling)
Dimensional model is advocated by another master in the field of data warehouse, Ralph Kimall, whose "data warehouse toolbox" is the most popular classic of data warehouse modeling in the field of data warehouse engineering. Dimensional modeling builds the model based on the requirements of analysis and decision-making, and the constructed data model serves the analysis requirements, so it focuses on how users complete the analysis requirements more quickly, as well as better response performance of large-scale complex queries.
Solid modeling
The picture above shows an abstract meaning, if we describe a simple fact: "Xiao Ming drives to school." Taking this business fact as an example, we can regard "Xiaoming" and "school" as an entity, "going to school" describes a business process, and here we can be abstracted as a concrete "event". While "driving to school" can be seen as an explanation of the event "going to school".
Dimensional modeling
Dimensional modeling is a method specially applied to the modeling of analytical database, data warehouse and data Mart. A data Mart can be understood as a "small data warehouse".
1. The type of table in dimensional modeling 1. Fact table
The measurable values generated by operational events that occur in the real world are stored in the fact table. At the lowest level of granularity, the fact table row corresponds to a measurement event, and vice versa.
The fact table represents the measurement of the analysis topic. For example, a purchase can be understood as a fact.
two。 Snowflake model
Snowflake pattern (Snowflake Schema) is an extension of the star pattern. Snowflake dimension tables can have other dimension tables, although this model is more standardized than the star model, but because this model is not easy to understand, the maintenance cost is high, and the performance needs to be associated with multi-layer dimension tables. Performance is also lower than the star model. So it's not very common in general.
Constellation model
3. Dimensional modeling process
We know that the table types of dimensional modeling are real tables, dimension tables; schemas have the concepts of star model, snowflake model and constellation model, but in actual business, they have given us a pile of data. how do we use these data to build data warehouses? based on more than 60 years of practical business experience, the author of the warehouse toolbox summed up the following four steps, please remember!
There are four steps to dimensional modeling in the warehouse toolbox:
Data hierarchical architecture
Concrete implementation of data layer
Use four diagrams to illustrate the specific implementation of each layer
Data source layer ODS
Data detail layer
Each row in the fact table corresponds to a measure, and the data in each row is a specific level of detail, called granularity. One of the core principles of dimensional modeling is that all measures in the same fact table must have the same granularity. This ensures that there is no problem of double-calculating metrics.
Dimension tables are generally a single primary key and a few are federated primary keys. Be careful not to duplicate data in the dimension table, otherwise there will be data divergence problems associated with the fact table.
Sometimes it is not possible to determine whether the column data is a fact attribute or a dimension attribute. The most practical facts to remember are numerical types and additive facts. So you can analyze whether the column is a measure that contains multiple values and acts as a participant in the calculation, in which case the column is often a fact; if the column is a description of a specific value, it is a text or constant, a constraint and a row identify the participant, this attribute is often a dimension attribute. However, it is still necessary to combine the business to determine whether it is a dimension or a fact.
Data mild summary layer DM
Data application layer
The table in the data application layer is provided to users, and the construction of the data warehouse is coming to an end, and then different data will be fetched according to different needs, such as directly displaying the report, or providing the data needed by colleagues for data analysis. or other business support.
At this point, the study of "what is the data warehouse design system of the database" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.