In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "what is the data warehouse in the database". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. let's follow the editor's train of thought to study and learn "what is the data warehouse in the database"!
1. What is a data warehouse?
Data warehouse, English name is Data Warehouse, can be abbreviated as DW or DWH. A data warehouse is a strategic collection that provides all types of data support for the decision-making process at all levels of the enterprise. It is created for analytical reporting and decision support purposes. Provide guidance for business process improvement, monitoring time, cost, quality and control for enterprises that need business intelligence.
two。 What can a data warehouse do?
The assignment of annual sales targets needs to be made according to previous historical reports, not on the head.
How to optimize business process
For example: an e-commerce website order completion includes: browsing, issuing orders, payment, logistics, in which logistics may cooperate with Zhongtong, Shentong, Yunda and other express companies. Each express company delivers an order, there will be a confirmation time for order delivery. According to the order delivery time, we can analyze which express company is more fast and efficient, so as to choose which express companies to cooperate with and which delivery companies to eliminate. Increase user-friendly.
In short, it is to collect octagonal data and provide external services after cleaning.
3. The characteristics of data warehouse 1. The data in the data warehouse is topic-oriented.
Corresponding to the characteristics of application-oriented data organization of traditional database, the data in data warehouse is organized by topic. What is the theme? First of all, the topic is an abstract concept, which is the abstraction of data synthesis, classification, analysis and utilization in enterprise information systems at a higher level. In the logical sense, it is the analysis object involved in a certain macro-analysis field in the corresponding enterprise. The topic-oriented data organization is a complete and consistent description of the data of the analysis object at a higher level, which can completely and uniformly describe the data of the enterprises involved in each analysis object, as well as the relationship between the data. The so-called higher level is relative to the application-oriented data organization, which means that the data organization according to the topic has a higher level of data abstraction. To put it bluntly, it's like writing a composition, what do you always have a theme?
two。 The data of the data warehouse is integrated
The data of the data warehouse is extracted from the original distributed database. There is a great difference between operational data and analytical data.
The source data corresponding to each topic of the data warehouse has many duplications and inconsistencies in the original distributed databases, and the data from different online systems are bundled with different application logic.
The comprehensive data in the data warehouse can not be obtained directly from the original database system. Therefore, before the data enters the data warehouse, it must be unified and integrated. This step is the most critical and complex step in the construction of the data warehouse. The tasks to be completed are as follows:
1. It is necessary to uniformly resolve all contradictions in the source data, such as synonyms of the same name, synonyms of different names, inconsistent units, inconsistent word length, and so on.
2. Carry out data synthesis and calculation. The data synthesis work in the data warehouse can be generated when the data is extracted from the original database, but many of it is generated inside the data warehouse, that is, it is generated comprehensively after entering the data warehouse.
3. In most cases, the general data warehouse is built by big data department, while other analysis business departments do not have the right to use online table directly.
3. The data in the data warehouse is not updatable.
The data of the data warehouse is mainly used for enterprise decision analysis, and the data operation involved is mainly data query, and generally there is no modification operation. The data of the data warehouse reflects the content of historical data over a long period of time, the collection of database snapshots at different points in time, and the exported data based on these snapshots for statistics, synthesis and reorganization, rather than online processing data. The data for online processing in the database is integrated and input into the data warehouse. Once the data stored in the data warehouse has exceeded the data storage period of the data warehouse, the data will be deleted from the current data warehouse. Because the data warehouse only carries on the data query operation, the data warehouse management system is much simpler than the database management system.
Many technical difficulties in the database management system, such as integrity protection, concurrency control and so on, can almost be omitted in the management of the data warehouse. However, because the amount of data queried by the data warehouse is often very large, it puts forward higher requirements for data query, which requires the use of a variety of complex indexing techniques; at the same time, because the data warehouse is oriented to the senior managers of commercial enterprises, they will put forward higher requirements for the interface friendliness of data query and data representation.
4. The data in the data warehouse changes over time.
The data in the data warehouse is not updatable for the application, that is to say, the users of the data warehouse do not update the data when they analyze and process the data. However, it is not said that all data warehouse data will never change in the whole data life cycle from data integration to data warehouse input to final deletion.
The data of data warehouse is constantly changing with time, which is the fourth feature of data warehouse data. This feature is shown in the following three aspects:
1. The data warehouse constantly adds new data content over time. The data warehouse system must constantly capture the changing data in the OLTP database and add it to the data warehouse, that is, it must constantly generate a snapshot of the OLTP database and add it to the data warehouse after unified integration; but for the database snapshot that does not change any more, if the new changing data is captured, only a new database snapshot will be generated and added, but the original database snapshot will not be modified. Figuratively speaking, it is the collection of data into the full amount of data every day.
2. The data warehouse constantly deletes the old data content over time. The data in the data warehouse also has a storage period, and once this period is exceeded, the expired data will be deleted. It's just that the data time limit in the data warehouse is much longer than that in the operational environment. In an operational environment, only 60 to 90 days of data are stored, while in a data warehouse, long-term data (such as 5-10 years) is needed to meet the requirements of trend analysis by DSS (Decision Support System).
3. The data warehouse contains a large number of comprehensive data, many of which are related to time, such as data are often synthesized according to the time period, or sampled at certain time slices, and so on. These data need to be constantly reintegrated over time. Therefore, the data characteristics of the data warehouse contain time items to indicate the historical period of the data.
4. Development of data Warehouse
The development of data warehouse has gone through three processes:
Simple report phase:
At this stage, the main goal of the system is to solve some reports needed by business personnel in their daily work, and to generate some simple summary data that can help leaders make decisions. Most of the manifestations of this stage are databases and front-end reporting tools.
Data Mart phase:
At this stage, mainly according to the needs of a business unit, certain data collection, collation, according to the needs of business personnel, multi-dimensional report display, can provide specific business guidance data, and can provide specific leadership decision-making data.
Data warehouse phase:
At this stage, the main purpose of this stage is to collect and organize the data of the whole enterprise according to a certain data model, and be able to provide cross-departmental and consistent business report data according to the needs of each business department. be able to generate business guidance data through the data warehouse, at the same time, provide comprehensive data support for leaders' decision-making.
Through the development stage of data warehouse construction, we can see that the important difference between the construction of data warehouse and the construction of data Mart lies in the support of data model. Therefore, the construction of data model is of decisive significance to the construction of our data warehouse.
5. Database VS data warehouse
Before you understand the difference between a database and a data warehouse, master three concepts. Database software, database, data warehouse.
Database software: it is a kind of software that can be seen and operated. Used to realize the logical function of the database. Belongs to the physical layer. Oracle, MySQL, Redis, MongoDB, Neo4j, JanusGraph and so on. There are also a number of visual interfaces such as Navicat, dbeaver, Hue and so on.
Database: a logical concept used to store data in a warehouse. It is realized by database software. The database consists of many tables, the table is two-dimensional, and there can be many fields in a table. The fields are lined up one by one, and the corresponding data is written into the table in a row. The table of the database is that it can represent multi-dimensional relations in two dimensions. At present, the popular databases in the market are two-dimensional databases. Such as: Oracle, DB2, MySQL and so on.
Data warehouse: it is the upgrade of database concept. Logically, there is no difference between the database and the data warehouse, both are the places where data are stored through the database software, but in terms of the amount of data, the data warehouse is much larger than the database. Data warehouse is mainly used for data mining and data analysis to assist leaders to make decisions.
In the IT architecture, the database must exist. There must be a place to store the data. For example, online shopping, Taobao, JD.com and so on. The inventory quantity of the goods, the price of the goods, the account balance of the user, and so on. These data are stored in the background database. Or the simplest understanding, we now Weibo, QQ and other accounts of the user name and password. In the background database, there must be a user table with at least two fields, that is, the user name and password, and then our data is stored on the table in a row. When we log in, we fill in the user name and password, and the data will be sent back to the background to match the data on the table, and once the match is successful, you can log in. If the match is not successful, the wrong password will be reported or the user name is not available. This is the database, which is used to work in the production environment. We use databases for everything that is linked to business applications.
Data warehouse is one of the technologies under business intelligence (Business Intelligence, abbreviated as BI). Because the database is linked to business applications, it is impossible for a database to hold all the data of a company. The table design of database is often designed for a certain application. For example, for the login function just now, there are only these two fields on this user table, and there are no other fields. But this table is in line with the application, no problem. But this table does not conform to the analysis. For example, I want to know in which time period, users log in the most? Which user does the most shopping in a year? Indicators like that. It's time to redesign the table structure of the database. For data analysis and data mining, we introduce the concept of data warehouse. The table structure of data warehouse is designed according to analysis requirements, analysis dimensions and analysis indicators.
5. OLTP and OLAP
The difference between a database and a data warehouse is actually the difference between OLTP and OLAP.
Operational processing: called online transaction processing OLTP (On-Line Transaction Processing), can also be called transaction-oriented processing system, it is aimed at the daily operation of specific business online in the database, usually query and modify a small number of records. Users are more concerned about the response time of operations, data security, integrity and the number of concurrent users. As the main means of data management, traditional database system is mainly used for operational processing.
Analytical processing: called online analytical processing OLAP (On-Line Analytical Processing) generally analyzes the historical data of certain topics and supports management decisions, ETL.
Operational processing (OLTP) analytical processing (OLAP) synthetic or refined entity-relation model star model or snowflake model access instantaneous data storage history data, does not contain the most recent data updatable read-only, only appends one unit operation one set at a time requires high performance Short response time, performance requirements, loose transaction-oriented analysis, small amount of data for one operation, large amount of data for daily operation support, daily operation support, small amount of data, large amount of data, customer order, inventory level, bank account query and other customer income analysis, market segmentation, etc. 6. Data warehouse architecture layering (key point) 1. Data warehouse architecture
Data warehouse standard can be divided into four layers: ODS (temporary storage layer), PDW (data warehouse layer), DM (data Mart layer) and APP (application layer).
The metadata of each system is synchronized to the operational data warehouse ODS through ETL, and the ODS data is modeled to form DW (data Warehouse). DM is to build a model for a certain business domain, and specific users (decision-making layer) view the reports generated by DM.
Temporary storage data operation layer: ODS (Operational Data Store):
The ODS layer is a data storage system that integrates data from different data sources (various operational databases, external data sources, etc.) through the ETL (Extract-Transform-Load) process into a topic-oriented, integrated, enterprise-wide, consistent data set (mainly the latest or recent detailed data and summary data that may be needed). The data granularity of the ODS layer is the finest in terms of data granularity. There are usually two types of tables in the ODS layer, one for storing the data that currently needs to be loaded, and one for storing the processed historical data. Historical data generally need to be cleared after 3-6 months to save space. However, different projects should be treated differently, and if the amount of data in the source system is small, it can be retained for a longer time, or even preserved in full.
Data warehouse layer: DW (Data Warehouse):
For the data warehouse layer, the data in the DW layer should be consistent, accurate and clean, that is, the data after cleaning the source system data (removing impurities). The data in this layer generally follows the third normal form of the database, and its data granularity is usually the same as that of ODS. In the DW layer, all the historical data in the BI system is saved, such as data that has been saved for 10 years.
DW: Data Warehouse is translated into data warehouse, and DW is divided into DWD, DWB and DWS from bottom to top.
The DWD:Warehouse Detail detail data layer, some also known as the ODS layer, is the isolation layer between the business layer and the data warehouse
DWB:Data Warehouse Base basic data layer, which stores objective data, is generally used as the middle layer, and can be regarded as the data layer of a large number of indicators.
The DWS:Data Warehouse Service service data layer, based on the basic data on DWB, is integrated to analyze the service data of a topic domain, usually a wide table.
Data Mart layer: DM (Data Mart):
For the data Mart layer, this layer of data is organized by topics, usually star-shaped or snowflake-structured data. In terms of data granularity, the data at this layer is at the mild summary level, and there is no more detailed data. In terms of the time span of the data, it is usually a part of the DW layer, and the main purpose is to meet the needs of user analysis, while from the perspective of analysis, users usually only need to analyze the data in recent years (such as the data of the last three years). In terms of the breadth of the data, it still covers all business data.
Application layer: Application layer:
This layer of data is constructed entirely to meet specific analysis needs, and is also star-shaped or snowflake structure data. Highly aggregated data in terms of data granularity. In terms of the breadth of data, it does not necessarily cover all business data, but a true subset of DM layer data, which in a sense is a repetition of DM layer data. In extreme cases, a model can be built for each report in the APP layer to support it. The standard layering of the data warehouse to achieve the purpose of exchanging space for time is only a suggested standard, and the actual implementation needs to determine the layering of the data warehouse according to the actual situation, and different types of data may also adopt different layering methods.
two。 Why layer the data warehouse?
Trade space for time, through a lot of preprocessing to improve the user experience (efficiency) of the application system, so there will be a lot of redundant data in the data warehouse.
Decoupling, if not layered, if the business rules of the source business system change will affect the entire data cleaning process, the workload is huge.
Simplification, the process of data cleaning can be simplified through hierarchical data management, because dividing the original one-step work into multiple steps is equivalent to splitting a complex work into several simple tasks, turning a large black box into a white box, and the processing logic of each layer is relatively simple and easy to understand, so that it is easier for us to ensure the correctness of each step, when the data goes wrong. Often we only need to adjust a certain step locally.
7. Introduction to metadata
When you need to know about a local enterprise and the services it provides, the importance of the yellow pages of the phone is reflected. Metadata (Metadata) is similar to this kind of phone yellow pages.
1. Definition of metadata
The metadata of a data warehouse is about the data in the data warehouse. Its function is similar to the data dictionary of database management system, which can be simply understood as the directory of a book, saving logical data structure, file, address, index and other information. In a broad sense, in a data warehouse, metadata describes the structure of the data in the data warehouse and the data of the establishment method.
Metadata is an important part of the data warehouse management system, and the metadata manager is a key component of the enterprise data warehouse, which runs through the whole process of data warehouse construction and directly affects the construction, use and maintenance of the data warehouse.
One of the main steps in building a data warehouse is ETL. At this time, metadata will play an important role. It defines the mapping from the source data system to the data warehouse, the rules of data transformation, the logical structure of the data warehouse, the rules of data update, the history of data import and the loading cycle. Experts in data extraction and transformation and data warehouse administrators build data warehouses efficiently through metadata.
When using the data warehouse, users access the data through metadata, clarify the meaning of the data items and customize the report.
The scale and complexity of data warehouse are inseparable from correct metadata management, including adding or removing external data sources, changing data cleaning methods, controlling faulty queries and arranging backups.
Metadata can be divided into technical metadata and business metadata.
Technical metadata is used by IT personnel who develop and manage data warehouse. It describes the data related to the development, management and maintenance of data warehouse, including data source information, data transformation description, data warehouse model, data cleaning and update rules, data mapping and access rights, etc.
Business metadata serves management and business analysts, describing data from a business perspective, including business terminology, what data is in the data warehouse, the location and availability of the data, etc., to help business people better understand which data is available in the data warehouse and how to use it.
As can be seen from the above, metadata not only defines the pattern, source, extraction and transformation rules of the data in the data warehouse, but also is the basis for the operation of the whole data warehouse system. Metadata connects the loose components of the data warehouse system, forming an organic whole, as shown in the figure.
two。 How metadata is stored
There are two common ways to store metadata:
One is based on the data set, each data set has a corresponding metadata file, and each metadata file contains the metadata content of the corresponding data set. The advantage of the first storage method is that when the data is called, the corresponding metadata is also transferred as an independent file, which is more independent from the database, and the function of the database can be used to retrieve the metadata. Metadata files can also be transferred to other database systems for operation. The deficiency is that if each data set corresponds to a metadata document, there will be a large number of metadata files in the large-scale database, which is not convenient to manage.
One way of storage is based on the database, that is, meta-database. The metadata file consists of several items, each of which represents an element of the metadata, and each item is recorded as the metadata content of the dataset. In this storage mode, there is only one metadata file in the meta-database, which is convenient to manage. Add or delete data sets as long as the corresponding record items are added or deleted in the file. When obtaining the metadata of a dataset, because what is actually obtained is only a record of the relational table data, the user system is required to accept this specific form of data. Therefore, the use of Metabase is recommended.
Metabase is used to store metadata, so it is best to choose the mainstream relational database management system. The Metabase also contains mechanisms for manipulating and querying metadata. The main advantage of establishing meta-database is to provide unified data structure and business rules, and it is easy to integrate multiple data marts within the enterprise organically. At present, some enterprises tend to establish multiple data marts rather than a centralized data warehouse, so we can consider establishing a meta-database to describe data and service application integration before establishing a data warehouse (or data Mart). Do a good job in the initial support for the implementation of the data warehouse, which is of great help to the follow-up development and maintenance. Meta-database ensures the consistency and accuracy of data warehouse data, and provides the basis for data quality management for enterprises.
3. The role of metadata
In a data warehouse, the main functions of metadata are as follows.
Describe what data is in the data warehouse to help decision analysts locate the content of the data warehouse.
Define how data enters the data warehouse as a guide for data aggregation, mapping, and cleaning.
Record the work schedule of data extraction that follows the occurrence of business events.
Record and test the requirements and implementation of system data consistency.
Evaluate the quality of data.
It's the equivalent of writing a data user guide.
8. Star model and snowflake model
In the business intelligence solution of multidimensional analysis, according to the relationship between fact table and dimension table, the common models can be divided into star model and snowflake model. When designing the model of logical data, we should consider whether the data is organized according to the star model or the snowflake model.
1. Star model
When all dimension tables are directly connected to the fact table, the whole diagram is like a star, so the model is called a star model.
The star architecture is an irregular structure, and each dimension of the cube is directly connected to the fact table, and there is no gradual dimension, so the data has a certain degree of redundancy, such as in the regional dimension table, there are two records of city C of country An and province B and city D of country An and province B, so the information of country An and province B is stored twice, that is, there is redundancy.
two。 Snowflake model
When one or more dimension tables are not directly connected to the fact table, but are connected to the fact table through other dimension tables, the diagram is like multiple snowflakes connected together, so it is called the snowflake model. The snowflake model is an extension of the star model. It further hierarchizes the dimension table of the star model, and the original dimension table may be extended to a small fact table, forming some local hierarchical regions, and these decomposed tables are connected to the main dimension table rather than the fact table. As shown in the figure, the regional dimension table is decomposed into country, province, city and other dimension tables. Its advantage is to improve query performance by minimizing data storage and joining smaller dimensional tables. The snowflake structure removes data redundancy.
The star model is generally more efficient than the snowflake model because of the redundancy of the data, so many statistical queries do not need to make external connections. The star structure does not need to consider many regularization factors, and the design and implementation are relatively simple. Snowflake model due to the removal of redundancy, some statistics need to be generated through the join of tables, so the efficiency is not necessarily as high as the star model. Regularization is also a complex process, the corresponding database structure design, data ETL, and later maintenance are more complex. Therefore, under the premise of acceptable redundancy, the star model is used more and more efficiently in practical application.
3. Comparison between star model and snowflake model
Star model and snowflake model are two commonly used methods in data warehouse, and the comparison between them should be discussed from four angles.
Data optimization
The snowflake model uses normalized data, that is, the data is organized within the database in order to eliminate redundancy, so it can effectively reduce the amount of data. Through referential integrity, its business levels and dimensions are stored in the data model.
By contrast, the star model uses de-normalized data. In the star model, dimensions refer directly to fact tables, and business levels are not deployed through referential integrity between dimensions.
Business model
In the snowflake model, the business level of the data model is represented by the primary key-foreign key relationship of a different dimension table. In the star model, all necessary dimension tables have only foreign keys in the fact table.
Performance
The third difference lies in performance. The snowflake model has a lot of joins between dimension tables and fact tables, so its performance is relatively low. For example, if you want to know the details of a user, the snowflake model will do the final join summary of several tables.
The star model has far fewer connections. In this model, if you need corresponding information, you just have to connect the dimension table to the fact table.
ETL
The snowflake model loads the data Mart, so the ETL operation is more complex in design and cannot be parallelized due to the limitations of the satellite model.
The star model loads the dimension table and does not need to add subsidiary models between dimensions, so ETL is relatively simple and can achieve a high degree of parallelization.
Thank you for your reading. the above is the content of "what is the data warehouse in the database". After the study of this article, I believe you have a deeper understanding of what the data warehouse is in the database. The specific use of the situation also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.