How to analyze the principle of Polymerization Group in Apache Kylin Optimization 07/19 Update SLTechnology News&Howtos

How to analyze the principle of Polymerization Group in Apache Kylin Optimization

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article analyzes "how to analyze the principle of aggregation group in Apache Kylin optimization". The content is detailed and easy to understand. Friends who are interested in "how to aggregate the original understanding in Apache Kylin optimization" can follow the editor's train of thought to read it slowly and deeply. I hope it will be helpful to everyone after reading. Let's follow the editor to learn more about "how to aggregate the original understanding and analysis in Apache Kylin optimization".

"with the increase in the number of dimensions, the number of Cuboid will explode. In order to ease the pressure of building Cube, Apache Kylin introduces a series of advanced settings to help users filter out the Cuboid they really need. These advanced settings include Aggregation Group, Joint Dimension, Hierachy Dimension, Mandatory Dimension and so on."

As we all know, the main work of Apache Kylin is to build N-dimensional Cube for source data and to realize the precomputation of aggregation. In theory, building a Cube with N dimensions generates 2N Cuboid. As shown in figure 1, to build a Cube with 4 dimensions (A _ Magi B ~ C, D), 16 Cuboid are needed.

Figure 1

With the increase of the number of dimensions, the number of Cuboid will grow explosively, which not only takes up a lot of storage space, but also prolongs the construction time of Cube. To ease the build pressure on Cube and reduce the number of Cuboid generated, Apache Kylin introduces a series of advanced settings to help users filter out the Cuboid they really need. These advanced settings include aggregation groups (Aggregation Group), federated dimensions (Joint Dimension), hierarchical dimensions (Hierachy Dimension), and necessary dimensions (Mandatory Dimension). This series will explain in depth the meaning of these advanced settings and the scenarios they apply.

This paper will focus on the implementation principle and application scenario examples of aggregation groups.

Polymerization Group (Aggregation Group)

According to the combination of dimensions that users follow, they can be divided into large categories of combinations that they follow, which are called aggregation groups in Apache Kylin. For example, the Cube shown in figure 1, if the user only cares about the dimension AB combination and the dimension CD combination, then the Cube can be divided into two aggregation groups, namely the aggregation group AB and the aggregation group CD. As shown in figure 2, the number of Cuboid generated has been reduced from 16 to 8.

Figure 2

Aggregation groups that users care about may contain the same dimensions, for example, aggregation group ABC and aggregation group BCD both contain dimensions B and C. The same Cuboid is derived between these aggregation groups, for example, the aggregation group ABC produces Cuboid BC, and the aggregation group BCD also produces Cuboid BC. These Cuboid will not be generated repeatedly, and a copy of the Cuboid is shared by these aggregation groups, as shown in figure 3.

Figure 3

With aggregation groups, users can filter Cuboid in a coarse-grained manner to get the combination of dimensions they want.

Application example

Suppose you create a Cube of transaction data that contains the following dimensions: the customer ID buyer_id transaction date cal_dt, the payment method pay_type, and the buyer's city city. Sometimes, analysts need to know the application of different consumption patterns in different cities by grouping and aggregating city, cal_dt and pay_type; sometimes, analysts need to aggregate city, cal_dt and buyer_id to see the consumption behavior of customers in different cities. In the above example, it is recommended to establish two aggregation groups, including dimensions and methods as shown in figure 4:

Polymerization group 1: [cal_dt, city, pay_type]

Polymerization group 2: [cal_dt, city, buyer_id]

Without considering other interference factors, such aggregation groups will save unnecessary 3 Cuboid: [pay_type, buyer_id], [city, pay_type, buyer_id] and [cal_dt, pay_type, buyer_id], saving storage resources and construction execution time.

Case 1:

SELECT cal_dt, city, pay_type, count (*) FROM table GROUP BY cal_dt, city, pay_type will get data from Cuboid [cal_dt, city, pay_type].

Case2:

SELECT cal_dt, city, buy_id, count (*) FROM table GROUP BY cal_dt, city, buyer_id will get data from Cuboid [cal_dt, city, pay_type].

Case3 if you have a query that is not commonly used:

SELECT pay_type, buyer_id, count (*) FROM table GROUP BY pay_type, buyer_id do not have a ready-made exact match Cuboid.

At this point, Apache Kylin calculates the final result from the existing Cuboid through online calculation.

As a multi-dimensional analysis tool, Apache Kylin uses the method of pre-calculation to exchange space for time to improve query efficiency.

On the Apache Kylin optimization of how to analyze the principle of aggregation group to share here, I hope that the above content can make you improve. If you want to learn more knowledge, please pay more attention to the editor's updates. Thank you for following the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.