In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "what is the advanced setting method of Cube in Apache Kylin". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the advanced setting method of Cube in Apache Kylin".
As the number of dimensions increases, the number of Cuboid explodes. To ease the build pressure on Cube, Apache Kylin introduces a series of advanced settings to help users filter out the Cuboid they really need. These advanced settings include aggregation groups (Aggregation Group), federated dimensions (Joint Dimension), hierarchical dimensions (Hierachy Dimension), and necessary dimensions (Mandatory Dimension). "
As we all know, the main work of Apache Kylin is to build N-dimensional Cube for source data and to realize the precomputation of aggregation. In theory, building a Cube with N dimensions generates 2N Cuboid. As shown in figure 1, to build a Cube with 4 dimensions (A _ Magi B ~ C, D), 16 Cuboid are needed.
(figure 1)
With the increase of the number of dimensions, the number of Cuboid will grow explosively, which not only takes up a lot of storage space, but also prolongs the construction time of Cube. To ease the build pressure on Cube and reduce the number of Cuboid generated, Apache Kylin introduces a series of advanced settings to help users filter out the Cuboid they really need. These advanced settings include aggregation groups (Aggregation Group), federated dimensions (Joint Dimension), hierarchical dimensions (Hierachy Dimension), and necessary dimensions (Mandatory Dimension). This series will explain in depth the meaning of these advanced settings and the scenarios they apply.
Polymerization Group (Aggregation Group)
According to the combination of dimensions that users follow, they can be divided into large categories of combinations that they follow, which are called aggregation groups in Apache Kylin. For example, the Cube shown in figure 1, if the user only cares about the dimension AB combination and the dimension CD combination, then the Cube can be divided into two aggregation groups, namely the aggregation group AB and the aggregation group CD. As shown in figure 2, the number of Cuboid generated has been reduced from 16 to 8.
(figure 2)
Aggregation groups that users care about may contain the same dimensions, for example, aggregation group ABC and aggregation group BCD both contain dimensions B and C. The same Cuboid is derived between these aggregation groups, for example, the aggregation group ABC produces Cuboid BC, and the aggregation group BCD also produces Cuboid BC. These Cuboid will not be generated repeatedly, and a copy of the Cuboid is shared by these aggregation groups, as shown in figure 3.
(figure 3)
With aggregation groups, users can filter Cuboid in a coarse-grained manner to get the combination of dimensions they want.
Application example of aggregation Group
Suppose you create a Cube of transaction data that contains the following dimensions: the customer ID buyer_id transaction date cal_dt, the payment method pay_type, and the buyer's city city. Sometimes, analysts need to know the application of different consumption patterns in different cities by grouping and aggregating city, cal_dt and pay_type; sometimes, analysts need to aggregate city, cal_dt and buyer_id to see the consumption behavior of customers in different cities. In the above example, it is recommended to establish two aggregation groups, including dimensions and methods as shown in figure 4:
(figure 4)
Polymerization group 1: [cal_dt, city, pay_type]
Polymerization group 2: [cal_dt, city, buyer_id]
Without considering other interference factors, such aggregation groups will save unnecessary 3 Cuboid: [pay_type, buyer_id], [city, pay_type, buyer_id] and [cal_dt, pay_type, buyer_id], saving storage resources and construction execution time.
Case 1:
SELECT cal_dt, city, pay_type, count (*) FROM table GROUP BY cal_dt, city, pay_type will get data from Cuboid [cal_dt, city, pay_type].
Case2:
SELECT cal_dt, city, buy_id, count (*) FROM table GROUP BY cal_dt, city, buyer_id will get data from Cuboid [cal_dt, city, pay_type].
Case3 if you have a query that is not commonly used:
SELECT pay_type, buyer_id, count (*) FROM table GROUP BY pay_type, buyer_id do not have a ready-made exact match Cuboid.
At this point, Apache Kylin calculates the final result from the existing Cuboid through online calculation.
Joint Dimension (Joint Dimension)
Sometimes users do not care about the combination of various details between dimensions, for example, only group by A, B, C will appear in the user's query statement, but not the fine combination of dimensions such as group by A, B or group by C. This kind of problem is the problem solved by the joint dimension. For example, if dimensions A, B, and C are defined as federated dimensions, Apache Kylin will only build Cuboid ABC, while Cuboid AB, BC, A, and so on Cuboid will not be generated. The final Cube result is shown in figure 5, with the number of Cuboid reduced from 16 to 4.
(figure 5)
Application example of joint dimension
Suppose you create a Cube of transaction data, which has many common dimensions, such as transaction date cal_dt, transaction city city, customer gender sex_id, and payment type pay_type. The analysis method commonly used by analysts is to obtain different consumption preferences between male and female customers in different cities by aggregating according to transaction time, transaction location and customer gender, such as simultaneous aggregation of transaction date cal_dt, transaction city city and customer gender sex_id. In the above example, it is recommended to create a set of federated dimensions in an existing aggregation group, including dimensions and combinations as shown in figure 6:
(figure 6)
Aggregation group: [cal_dt, city, sex_id,pay_type]
Joint dimensions: [cal_dt, city, sex_id]
Case 1:
SELECT cal_dt, city, sex_id, count (*) FROM table GROUP BY cal_dt, city, sex_id, it will get data from Cuboid [cal_dt, city, sex_id]
Case2 if you have a query that is not commonly used:
SELECT cal_dt, city, count (*) FROM table GROUP BY cal_dt, city do not have a ready-made exact match. Cuboid,Apache Kylin will calculate the final result from the existing Cuboid by online calculation.
Hierarchical dimension (Hierarchy Dimension)
Dimensions with hierarchical relationships often appear in the dimensions selected by the user. For example, for the three dimensions of country (country), province (province) and city (city), the relationship between countries / provinces / cities is one-to-many from top to bottom. In other words, user queries for these three dimensions can be classified into the following three categories:
Group by country
Group by country, province (equivalent to group by province)
Group by country, province, city (equivalent to group by country, city or group by city)
Taking the Cube shown in figure 7 as an example, assuming that dimension A represents the country, dimension B represents the province, and dimension C represents the city, then the three dimensions of ABC can be set to hierarchical dimensions, and the resulting Cube is shown in figure 7.
(figure 7)
For example, Cuboid [A, B, C, D] = Cuboid [A, B, C, D], Cuboid [B, D] = Cuboid [A, B, D], so Cuboid [A, B, D] and Cuboid [B, D] do not need to be stored repeatedly.
Figure 8 shows that Kylin prunes redundant Cuboid to form the Cube structure of figure 2 according to the previous method, and the number of Cuboid is reduced from 16 to 8.
(figure 8)
Application example of hierarchical dimension
Suppose a Cube of transaction data has many common dimensions, such as the city city of the transaction, the province province of the transaction, the country country of the transaction, and the payment type pay_type, etc. Analysts can aggregate by trading city, trading province, trading country and payment type to obtain the payment preferences of consumers at different levels of geographical location. In the above example, it is recommended to establish a set of hierarchical dimensions (province/ city city of national country/ province) in the existing aggregation group, including dimensions and combinations as shown in figure 9:
(figure 9)
Aggregation group: [country, province, city,pay_type]
Hierarchical dimensions: [country, province, city]
Case 1 when analysts want to obtain consumer preferences from the urban dimension:
SELECT city, pay_type, count (*) FROM table GROUP BY city, pay_type, it will get data from Cuboid [country, province, city, pay_type].
Case 2 when analysts want to obtain consumer preferences from the provincial dimension:
SELECT province, pay_type, count (*) FROM table GROUP BY province, pay_type, it will get data from Cuboid [country, province, pay_type].
Case 3 when analysts want to obtain consumer preferences from the country dimension:
SELECT country, pay_type, count (*) FROM table GROUP BY country, pay_type, it will get data from Cuboid [country, pay_type].
Case 4 if the analyst wants to obtain the aggregation results of different granularity geographic dimensions:
Without exception, the data can be provided by cuboid in figure 3.
For example, SELECT country, city, count (*) FROM table GROUP BY country, city, it will get data from Cuboid [country, province, city].
Necessary Dimension (Mandatory Dimension)
Users are sometimes particularly interested in one or more dimensions, and the dimension group by exists in all query requests, then this dimension is called a necessary dimension, and only the Cuboid containing this dimension is generated (figure 10).
(figure 10)
Take the Cube in figure 1 as an example, assuming that dimension An is a necessary dimension, then the generated Cube is shown in figure 11, and the number of dimensions changes from 16 to 9.
(figure 11)
Application example of necessary Dimension
Suppose a Cube of transaction data, which has many common dimensions, such as transaction time order_dt, transaction location location, transaction commodity product and payment type pay_type, etc. Among them, the trading time is a dimension which is regarded as the grouping condition (group by) by high frequency. If the transaction time order_dt is set to the required dimension, the dimensions and combinations included are shown in figure 12:
(figure 12)
Thank you for your reading, the above is the content of "what is the advanced setting method of Cube in Apache Kylin". After the study of this article, I believe you have a deeper understanding of what the advanced setting method of Cube in Apache Kylin is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.