Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Programmer's notes | 3 questions to get you started with data modeling

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Author introduction: Han Feng: chief engineer, Development and Management of CITIC Database

Member of ACMUG presidium, executive director of CCIA (China computer Industry Association), co-founder of Oracle ACE,DBAplus, member of ODF advisory panel, ACOUG,ACMUG,DBGeek writer, author of "Best practices for SQL Optimization". Engaged in software development in the early years, and later transferred to the field of database because of personal interest. Has many years of first-line database architecture, design, development experience, has served as the chief DBA of many companies, database architect and other positions.

[technology Salon 002] data Center: construction practice of Credit Agile data Center | Credit Technology Salon will be broadcast live online at 8: 00 p.m. on May 23. Click to sign up.

1. What is modeling?

Data is almost always used for two purposes: the preservation of operational records and the making of analytical decisions. To put it simply, operational systems save data, and typed systems use data.

The former generally reflects only the latest state of the data and is processed as a single record transaction; the core of its optimization is to process transactions faster. The latter often reflects the state changes of data over a period of time and processes data in large quantities; its core is high-performance, multi-dimensional data processing.

Usually we call the operational system OLTP (On-Line Transaction Processing)-online transaction processing and the analytical system OLAP (On-Line Analytical Processing)-online analytical processing.

In view of these two different data uses, how to organize the data to better meet the needs of data use. Here is the problem of data modeling. That is, to design a data organization mode (model) to meet different scenarios. In OLTP scenarios, entity relation model (ER) is commonly used to store data, so as to solve the problem of data redundancy and consistency in transaction processing. In the OLAP scene, there are many modeling methods: ER model, star model and multi-dimensional model. The following are explained separately:

ER model

The ER model in OLAP is different from that in OLTP. The essential difference is the abstraction of topic-oriented from the point of view of the enterprise, rather than the abstraction of the entity object relationship of a specific business process.

Star model

Star model is an implementation of dimensional model on relational database. The model indicates that each business process contains a fact table, which stores the numerical measurement of the event, around multiple dimension tables of the fact table, and the dimension table contains the text environment that actually exists at the time of the event. This star-like structure is often called a "star connection". It focuses on how users can complete the requirement analysis more quickly and has better response performance of large-scale complex queries. Based on the star model, the snowflake model can be further derived in complex scenes.

Multidimensional model

Multidimensional model is another implementation of dimensional model. When the data is loaded into the OLAP multidimensional database, the storage of the data is indexed using the format and technology involved in the dimensional data. Performance aggregation or precomputed summary tables are usually established and managed by the multidimensional database engine. Due to the use of pre-calculation, indexing strategy and other optimization methods, multi-dimensional database can achieve high-performance query.

In these three ways, the star model is widely used, which is also emphasized below.

Second, dimensional modeling 1. Basic concepts

In the process of modeling, many concepts are involved. Let's illustrate them through a scenario. For example: the common e-commerce order process, each user to submit an order (only one item), corresponds to an order record.

[business process]: place an order

[granularity]: each order (split into individual items)

[dimension]: region, age, channel, etc. (perspective for analysis)

[facts / measures]: order amount, etc. (data available for analysis)

2. Modeling steps

Collect business requirements and data implementation

Before you begin dimensional modeling, you need to understand the business requirements and the actual situation as the underlying source data. Identify requirements by communicating with business parties, viewing existing reports, etc., to understand their goals based on key performance indicators, competitive business issues, decision-making processes, and support analysis requirements. At the same time, the actual situation of the data can be communicated with database system experts to understand the feasibility of accessing data.

Select business process

A business process is an operational activity completed by an organization. Business process time establishes or acquires performance metrics and converts them to facts in the fact table. Most facts are concerned with the results of a business process. The choice of the process is very important because the process defines specific design goals and definitions of granularity, dimensions, and facts.

Declaration granularity

Declaration granularity is an important step in dimension design. Granularity is used to determine what rows in a fact table represent. Granularity must be declared before selecting a dimension or fact, because each candidate dimension or fact must be consistent with the defined granularity. Atomic granularity is the lowest level of granularity when getting data from a given business process. It is strongly recommended to start with focusing on atomic granularity data, because atomic granularity data can withstand unexpected user queries.

Confirm the dimension (describe the environment)

The dimension provides the background of "who, what, where, when, why, and how" around a business process event. The dimension table contains descriptive attributes for filtering and classifying facts needed by analytical applications. With a firm grasp of the granularity of the fact table, you can distinguish all possible dimensions.

Confirm the facts (for measurement)

In fact, metrics involving events from business processes are basically expressed in terms of data values. There is an one-to-one relationship between a fact table row and a measurement event described according to the granularity of the fact table, so the fact table corresponds to a physically observable event. In the fact table, all facts are only allowed to be consistent with the granularity of the declaration.

Deployment mode-star model or multidimensional model

Select a landing method for the dimensional model. You can choose the star model and deploy it on the relational database, through the fact table and the dimension table associated by the main foreign key, or you can choose the multidimensional model to land in the multidimensional database.

3. Modeling specification

Based on the theory of dimensional modeling, a series of terms are defined to describe modeling objects. The following picture is from "Alibaba big data's Road to practice".

Data domain

A collection that abstracts business processes or dimensions for business analysis. When dividing the data domain, it can not only cover all the current business requirements, but also be included in the existing data domain and expand the new data domain without influence when the new business enters.

Business process

Refers to the business activities of an enterprise, such as placing orders, payments, and refunds are all business processes. Please note that a business process is an inseparable behavioral event, and generally speaking, a business process is an event in an enterprise activity.

Time period

It is used to define the time range or time point of data statistics, such as the last 30 days, natural week, as of that day, etc.

Modification type

It is an abstract division of modifiers and is subordinate to a business domain.

Modifier

Refers to the limited abstraction of business scenarios for indicators other than statistical dimensions. Modifiers belong to a type of modification.

Metric / atomic index

Atomic indicators and metrics have the same meaning, based on the measurement of a business event behavior, is an indestructible indicator in the business definition, with a clear business meaning of nouns, such as the amount of payment.

Dimension

Dimension is the environment of measurement, which is used to reflect a class of attributes of the business. the collection of such attributes forms a dimension, which can also be called entity object. The dimension belongs to a data domain, such as geographical dimension (including content at country, region, province and city level) and time dimension (including content at year, quarter, month, week, day, etc.).

Dimension attribute

Dimension attributes belong to a dimension, such as country name, country ID, province name and so on.

Derived index

Derived indicator = one atomic index + multiple modifiers (optional) + time period. It can be understood as the delineation of the business statistical scope of atomic indicators.

Third, design key points 1. Dimension table design

Dimension is the foundation and soul of dimensional modeling. In dimensional modeling, the measurement is called "fact", the environment is described as "dimension", and the dimension is the diverse environment needed to analyze the facts. The columns that a dimension contains that represent a dimension are called dimension attributes. Dimension attribute is the basic source of query constraints, grouping and report label generation, and is the key to the ease of use of data. The function of dimension is generally query constraint, classification summary, sorting and so on. The design process of dimension is the process of determining dimension attributes. How to generate dimension attributes and the advantages and disadvantages of the generated dimension attributes determine the convenience of dimension use and become the key to the ease of use of data warehouse. As Kimball said, the capability of the data warehouse is directly proportional to the quality and depth of the dimension attributes.

The following principles should be followed throughout the design process:

The dimension attributes are as rich as possible to lay the foundation for the use of data. Give a detailed and meaningful text description. Precipitate the general dimension attributes to pave the way for the establishment of a consistent dimension. Strict distinction between facts and dimensions, through the use of scenarios to distinguish. 2. Fact table design

As the core of dimensional modeling of data warehouse, fact table is closely designed around the business process, and expresses the business process by obtaining the metrics that describe the business process, including the referenced dimensions and the metrics related to the business process. In the design process, you can select different types of fact tables, which have their own applicable scenarios.

The following principles should be followed throughout the design process:

Select an appropriate fact table type. The facts are as complete as possible, containing all the facts of the entire business process. Make sure that every factual measure is consistent and that repeated calculations will get the same result. Try to record some "atomic" facts, not the processed results. Some "dimension degradation attributes" can be done appropriately to improve the query performance of the fact table. In order to improve the performance of aggregation, we can do some rollup aggregation fact tables appropriately.

Source: official account-Han Feng Channel, welcome to follow.

Yixin Institute of Technology

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report