How to divide the hierarchy of data Warehouse in Hive 04/10 Update SLTechnology News&Howtos

How to divide the hierarchy of data Warehouse in Hive

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces how to divide the hierarchy of data warehouse in Hive. It is very detailed and has a certain reference value. Friends who are interested must read it!

1. Four Operations of data Warehouse

ETL (extractiontransformation loading) is responsible for extracting the data from distributed and heterogeneous data sources to the temporary middle tier, cleaning, transforming, integrating, and finally loading it into the data warehouse or data Mart. ETL is the core and soul of implementing data warehouse. The design and implementation of ETL rules account for about 60% / 80% of the total data warehouse construction workload.

(1) data extraction (extraction) includes initializing data loading and data refreshing: initializing data loading is mainly concerned with how to establish dimension tables, fact tables, and put the corresponding data into these data tables. Data refresh focuses on how to append and update the corresponding data in the data warehouse when the source data changes (for example, scheduled tasks can be created, or data can be regularly refreshed in the form of triggers).

(2) data cleaning mainly deals with the ambiguity, repetition, incompleteness and violation of business or logic rules in the source database. That is, to clean out data that is not in line with the business or useless. For example, by writing hive or MR to clean the data that the length of the field does not meet the requirements.

(3) the main purpose of data conversion (transformation) is to convert the cleaned data into the data needed by the data warehouse: the data dictionary or data format of the same data field from different source systems may be different (for example, id,B table is called ids in A table). It is necessary to provide them with a unified data dictionary and format in the data warehouse to normalize the data content. On the other hand, the contents of some fields required by the data warehouse may not be available in the source system, but need to be determined according to the contents of multiple fields in the source system.

(4) data loading (loading) is to import the last processed data into the corresponding storage space (mysql, etc.) to facilitate the data Mart to provide, and then visualization. Generally speaking, big companies do it for data.

Security and convenient operation, are self-encapsulated data platform and task scheduling platform, the underlying package big data cluster such as hadoop cluster, spark cluster, sqoop,hive,zookeepr,hbase and so on only provide web interface, and give different permissions to different employees, and then make different operations and calls to the cluster. Taking the data warehouse as an example, the data warehouse is divided into several logical levels. In this way, for different levels of data operation, creating different levels of tasks can be put into different levels of task flow for execution (a cluster of large companies usually has thousands or even tens of thousands of scheduled tasks waiting to be executed every day. Therefore, dividing different levels of task flow, different levels of tasks into the corresponding task flow for execution will be more convenient for management and maintenance).

two。 Four logical Architecture levels of data Warehouse

Data warehouse can be divided into four layers by standard. But note that this division and naming is not unique, the general number of warehouses are four floors, but different companies may be called differently. For example, the temporary layer here is called the replication layer SSA, and JD.com is called BDM. Similarly, Alibaba has a five-tier warehouse structure, which is more detailed, but the core ideas come from the four-tier data model.

(1) copy layer (SSA,system-of-records-staging-area)

SSA directly copies the data of the source system (such as reading all the data from mysql and importing it into the same structure table in hive without processing) to keep the original appearance of the business data as far as possible. The only difference from the source system data is that the data in SSA adds timestamp information to the source system data, forming multiple versions of historical data information.

(2) Atomic layer (SOR,system-of-record)

SOR is a set of table structure developed based on the model that conforms to the 3NF normal form rules. It stores the smallest level of data in the data warehouse and classifies the data according to different topic domains. For example, according to the current demand, the university data statistics service platform stores the whole school data in the SOR layer according to the four major topics of personnel, students, teaching and scientific research. SOR is the core and foundation of the whole data warehouse. In the design process, it should be flexible enough to add more data sources, support more analysis requirements, and support further upgrades and updates.

(3) Summary layer (SMA,summary-area)

SMA is the intermediate transition between SOR and DM (market layer). Because SOR is highly standardized data, it takes a lot of correlation work to complete a query. At the same time, the granularity of data in DM is often much higher than that in SOR, and a lot of summary work is needed for the summary data in raw DM. Therefore, SMA appropriately reverses the SOR data according to the demand (for example, Design a wide table structure to combine multi-table data such as personnel information, cadre information, etc.) and summary (for example, some commonly used head summary, organization summary, etc.) So as to improve the query performance of data warehouse.

(4) Bazaar / showcase (DM, data mart)

The data saved by DM is for users to access directly: DM can be understood as the data that the end user finally wants to see; DM is mainly event data of all kinds of granularity, which adapts to different data access needs by providing data of different granularity; data in the university data statistics service platform DM.

These are all the contents of the article "how to divide the hierarchy of data Warehouse in Hive". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.