In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "what is the method of data warehouse architecture and component selection". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what is the method of data warehouse architecture and component selection"?
Enterprise data Warehouse Architecture
There is a simple and rude saying about data warehouse, that is, "any data warehouse is a database that connects the original data on one end to the analytical interface on the other side through data integration tools."
Data warehouse is used to manage the huge data sets of enterprises, and provides a storage mechanism to transform data, move data and present it to end users. Many architectural approaches extend the capabilities of the data warehouse in one way or another, and we will focus on the most essential issues. Without considering too many technical details, the entire hierarchical architecture can be divided into four layers:
Raw data layer (data source)
Data warehouse architecture form
Data collection, collection, cleaning and conversion
Application analysis layer
Single-tier architecture (directly connected)
In most cases, a data warehouse is a relational database that contains modules that allow multi-dimensional data, or is divided into multiple easily accessible multi-topic information domains, and the simplest data warehouse has only one-tier architecture.
The single-tier architecture means that the data warehouse is directly connected to the analysis interface (directly connected), and end users can query directly. But simplicity has its disadvantages and applicability:
Traditionally, the storage of the data warehouse starts from 100GB, and direct connection may lead to slow data query processing, because to directly query accurate data from the data warehouse, or accurate input, it is necessary to filter out a lot of unnecessary data, which requires high performance of the database and front-end BI tools, and the basic performance will not be too high.
In addition, the performance is also limited when dealing with complex dimensional analysis, and it is rarely used in large data platforms because of its slowness and unpredictability. To perform high-level data queries, the data warehouse should be extended under low-level instances to simplify data queries.
Two-tier data architecture (data Mart layer)
The two-tier architecture adds a data Mart layer to the front-end application layer and EDW layer. A data Mart is a low-level repository that contains information about a specific topic domain. In short, it is a smaller database that extends EDW under specific topics (such as sales, operations, marketing, etc.).
This approach solves the problem of departmental data query and analysis, making it easier for each department to access the required data, because each Mart contains only given domain information, and the data Mart limits end-user access to the data. set a data permission. But creating a data Mart layer requires additional hardware resources and integrates it with other databases on the data platform.
Three-tier architecture (OLAP)
Above the data Mart layer, we usually use online analysis (OLAP) to deal with cubes (cube). OLAP datasets are a class of specific databases that describe data from multiple dimensions. Relational databases can only represent two-dimensional data, while OLAP allows data to be compiled in multiple dimensions and moved between dimensions.
OLAP is dedicated to the analysis of dimensional modeling data, and then the results of OLAP are presented graphically through BI.
The business value of OLAP lies in allowing data to be sliced and sliced for multi-dimensional analysis to provide access to all enterprise data or specific data marts. Now it has basically become a mainstream architectural application.
The following architecture diagram uses the most widely used architecture, which consists of the top, middle, and bottom levels.
Bottom layer: the underlying database of the data warehouse server, usually a relational database system, uses back-end tools to clean up, transform, and load data into this layer.
Middle tier: the middle tier in the data warehouse is an OLAP server implemented using the ROLAP or MOLAP model. For users, this application layer displays an abstract view of the database, and it also acts as an intermediary between the end user and the database.
Top layer: the top layer is the front-end application layer, which connects the data warehouse and obtains data or API from the data warehouse. Common applications include data query, report making, BI data analysis, data mining and some other application development.
From the perspective of functional application and technical architecture, the following is a very detailed data warehouse architecture diagram of large and medium-sized enterprises.
The four core components of the data warehouse: underlying source database (data storage scheme), ETL, front-end applications, and OLAP services.
Data warehouse database
The underlying data warehouse server is usually a relational database system (sql statistics associated with various tables are more convenient, and non-relational databases are still different in this respect). The commonly used solutions are Oracle, db2, sqlserve and professional data warehouse solutions such as essbase, greenplum, teredata, etc.
1. Adopt traditional relational database or MPP database with extended functions.
The traditional relational databases of ① are: oracle, mysql, DB2
② massively parallel processing databases: Vertica, Teradata (commercial), Greenplum (open source)
Teradata is old and widely used in banking, but it is also really expensive. At present, we do more projects using Greenplum, which is regarded as the fastest and most cost-effective high-end data warehouse solution in the industry. Greenplum is based on PostgreSQL and opened up in 2015. I know that three of the four major domestic banks are in use, four of the five major logistics companies are in use, and many companies are moving from Teradata to GP.
2. Big data platform architecture: Hadoop+Hive
Needless to say how universal this scheme is, it is usually a combination of PG for TB-level data and Hadoop for GP,PB-level I data for 100 TB-level data.
The following is a comparison of the traditional data warehouse architecture, GP and Hadoop big data platform.
Acquisition, Collection, cleaning and conversion tool (ETL)
Data source, transformation, and migration tools are used to perform all transformations, summaries, and all changes required to transform data into a unified format in the data warehouse, also known as extraction, transformation, and loading tools. Its functions include:
1. Extraction
Full extraction: suitable for small amount of data and it is not easy to judge that the data has changed, such as relational table, dimension table, configuration table, etc.
Incremental extraction: an extraction strategy suitable for large amounts of data and in order to save extraction time.
2. Cleaning
Null value handling: replace null values with specific values or filter them out directly
Verify the correctness of the data: uniformly handle the data that does not conform to the business meaning
Standardize the data format: for example, standardize all dates into YYYY-MM-DD format
Data transcoding: converts a field represented by coding in a source data into a value that represents its true meaning through an associated coding table
Uniform data standards: for example, there are many ways to represent men and women in the source data, which are converted directly according to the values defined in the model during extraction.
3. Convert and load
Conversion: refresh tables in DW with incremental or full data in ODS
Loading: each insert data to a table can be called data loading
With regard to the selection of ETL tools, here is a comparison table, which basically includes commonly used ETL tools.
Front-end application tool
The ultimate purpose of the construction of the data warehouse platform is to sort out useful data, provide valuable information, and help the business to make correct decisions.
Front-end applications mainly interact with data in different parts of the data warehouse. These applications can be divided into four categories:
Data query and reporting tools
BI impromptu analysis tool
Data mining tools
Various application development tools based on data warehouse or data Mart
Among them, data analysis tools are mainly aimed at OLAP server, while report tools and data mining tools are mainly aimed at data warehouse.
1. Data query and reporting tools
It is usually used to generate fixed reports, automate reports, and support mass batch jobs such as printing and computing.
The popular reporting tools, in the old data warehouse era, are mainly IBM's BO, Oracle's BIEE, as well as Microsoft and cognos, which are packaged in the data warehouse solution as a whole, and reports exist as a component. However, with the traditional data warehouse, the architecture is heavy and expensive, many companies will consider designing the architecture themselves on the project, rather than directly forcing the use of expensive solutions, including the use of many open source components / platforms.
With regard to reporting tools, soft FineReport is now widely used in projects, which is widely applicable to different enterprise data warehouse structures and report requirements. For example, directly generate reports with various databases; show multi-dimensional reports on the collected data to support business analysis reports; dock group data warehouse, build a data center platform, and form a decision analysis platform.
FineReport functional architecture
2. BI impromptu analysis tool
BI generally integrates OLAP server and report presentation functions. Based on the concept of multi-dimensional database, analytical BI can analyze data from a multi-dimensional perspective. Usually, a subset of detailed data is extracted from the data warehouse and stored in OLAP memory after necessary aggregation for front-end BI analysis tools to read.
BI displays the data in multiple dimensions by dragging and dropping data fields at the front end, and finally generates various analysis reports. Commonly used BI tools are PowerBI, Tableau, FineBI, and the open source superset. Personal use of the first two, enterprise projects on the selection of multi-use FineBI, because to consider performance, service solutions and so on. The rest is self-research or open source, superset is relatively recognized as open source BI.
FineBI architecture
There is no need to say much about what BI tools do. When selecting a project, we mainly consider the difficulty of getting started (considering business use without technical foundation), data processing performance, and other things such as technology selection and cost.
3. Data mining tools
OLAP is to present and analyze the data from a multi-dimensional perspective, while data mining is an algorithm applied to reveal the regularity of the data, such as relevance, patterns and trends. This is what data mining tools do. It automates some algorithms and processes.
For example, for example, the data warehouse in a bank stores data with the theme of "customer". OLAP can realize that the data can be presented and analyzed in a report or visual way according to the customer's basic information, savings account information, historical balance information, bank transaction log, etc., grasp customer dynamics in many aspects, find data problems, and better carry out specific marketing for different types of users. On the other hand, data mining is to establish a model through historical data, analyze the future trend on the basis of fitting history, and judge which factors are likely to mean the ultimate loss of customers, so as to avoid its occurrence.
The commonly used data mining tools, R, Python and SPSS, are basically available to open source individuals. Unlike BI and reports, there are few commercial tools or project services that provide customers with customized data analysis and mining. Because the industry is too strong and you need to be very familiar with business, data and platform, I have seen basically people who raise their own data analysis team or dig this kind of talent.
4. Application development
The above reporting phenotypic and analytical data products, but there will also be extended data decision-making systems for various specific businesses, such as the president's cockpit of the banking industry based on management monitoring, the retail decision-making system based on store data management, and the marketing staff of the e-commerce platform (input marketing objectives and parameters, such as to carry out promotion activities in the Singles' Day mother and baby market. Based on the massive data in the past, the system can calculate which category of goods should be selected, in which user group, what form of activities will be better), is based on such logic-based on business depth application. At this time, the data warehouse is the role of providing a service platform. For example, the popular data platform is also generally this logic. If you do not understand the details, you will not play tricks if you service the data.
Of course, such services need to be developed by themselves.
Between these three layers, there is actually a middle-tier OLAP server, which is typically implemented as a ROLAP model or a MOLAP model. Nowadays, many mature BI tools are integrated with OLAP server, so usually we only need to choose ETL tools, storage scheme and visual BI scheme, so we won't talk about OLAP in this article.
At this point, I believe you have a deeper understanding of "what is the method of data warehouse architecture and component selection". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.