The selection method of Multi-dimensional Analysis system based on CTO and data Intelligence 07/03 Update SLTechnology News&Howtos

The selection method of Multi-dimensional Analysis system based on CTO and data Intelligence

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Recently, I saw a saying: "the key thinking of architectural design is judgment and trade-off, and the key thinking of programming is logic and implementation."

Text | CTO Anson per tweet

Introduction

As the first article in this series, "the coming of the era of data Intelligence: essence and Technical system requirements" briefly describes the understanding of data intelligence and the corresponding core technical system requirements.

Data intelligence is to extract, discover and acquire knowledge from a large amount of data by combining large-scale data processing, data mining, machine learning, human-computer interaction, visualization and other technology. for people to make decisions based on data to provide effective intelligent support, reduce or eliminate uncertainty.

From the definition of data intelligence, the technical system of data intelligence needs to include at least several aspects, as shown in the following figure:

The composition of ▲ data Intelligent Technology system

Among them, the security computing system under data asset governance, data quality assurance and data intelligence will be emphasized in the following series of articles.

Recently, however, in practical work, it has been found that there are some real problems in how to deal with multidimensional data to solve practical business problems, especially at a loss as to what kind of underlying system to choose. After all, there are not too many companies with resources to experiment with.

Therefore, I studied it with the team, and at the same time, drawing lessons from some external materials, I wrote the second article in this series on this topic, mainly around the theme of "the selection method of Multi-dimensional Analysis system" for your reference. I hope to shorten your decision-making time.

Text content

Analyze the consideration elements of the system

Everyone is already familiar with the theory of CAP, so we can't have both of them, so we have to choose between C.A.P. In the analysis system, we also need to make a trade-off and balance among the three elements, which are data volume, flexibility and performance.

Three elements to be considered in ▲ Analysis system

Some systems can not meet the processing requirements, even a simple analysis requirement, when the amount of data reaches a certain amount, such as exceeding the P level, when the resources are unchanged.

Flexibility mainly refers to the flexibility of the way to manipulate data. For example, for general analysts, using SQL to operate is the first choice, without too many constraints, if you use a domain-specific language (DSL), it is relatively limited; another meaning is whether the operation is subject to pre-conditions, such as whether to support flexible Ad-Hoc queries in multiple dimensions. The last one is the performance requirement, whether it meets multiple concurrent operations and whether it can respond in seconds.

Process Analysis of data query

When you make an aggregate query of data, you generally follow the following three steps:

▲ real-time query process

First of all, we need to use the index to retrieve the corresponding line number or index location of the data, which requires the ability to quickly filter out hundreds of thousands or millions of data from hundreds of millions of pieces of data. This is the area that search engines are best at, because general relational databases are good at using indexes to retrieve more accurate small amounts of data.

Then load specific data by line number or location from the main storage, requiring the ability to quickly load tens of millions of filtered data into memory. This is the area where analytical databases are best at, because they generally use column storage, and some use mmap to speed up data processing.

Finally, distributed computing is carried out, and the final result set can be calculated according to the requirements of GROUP BY and SELECT. This is the area that big data computing engine is best at, such as Spark, Hadoop and so on.

Comparison and Analysis of Architectur

Combining the above two elements, there are three main categories in terms of structure at present:

MPP (Massively Parallel Processing)

Architecture based on search engine

Precomputing system architecture

MPP architecture

Traditional RDBMS has absolute advantages in ACID. In the era of big data, if most of your data is still structured data, and the data is not so huge, it is not necessary to use a platform like Hadoop, naturally you can also use a distributed architecture to meet the growth of data scale, and to solve the needs of data analysis, at the same time, you can also use the familiar SQL to operate.

This architecture is called MPP (Massively Parallel Processing)-massively parallel processing.

Of course, in fact, MPP is just an architecture, and its underlying layer may not necessarily be RDBMS, but it can be set up in the underlying facilities of Hadoop and add a distributed query engine (composed of Query Planner, Query Coordinator, Query Exec Engine, etc.), without using batch processing like MapReduce.

The systems under this architecture are: Greenplum, Impala, Drill, Shark and so on, in which Greenplum (generally referred to as GP) uses PostgreSQL as the underlying database engine.

Architecture based on search engine

Compared with the MPP system, the search engine converts the data into inverted index when the data (documents) are stored, and uses the three-level structure of Term Index, Term Dictionary and Posting to establish the index. At the same time, some compression techniques are used to save space.

The data (documents) are distributed to each node through certain rules (such as hashing the document ID). In the process of data retrieval, the Scatter-Gather computing model is used to deal with it separately on each node, and then focus on the node that initiated the search for final aggregation.

The main systems under this architecture are: ElasticSearch, Solr, generally using DSL to operate.

Precomputing system architecture

Systems like Apache Kylin are precomputing system architectures. It pre-aggregates the data when the data is stored in the database, and forms a "materialized view" or data Cube by establishing a certain model in advance and processing the data in advance, so that most of the processing of the data is actually completed before the query stage, which is equivalent to secondary processing.

The main systems under this architecture are: Kylin,Druid. Although both Kylin and Druid are precomputing system architectures, there are still many differences between them.

Kylin uses Cube for pre-calculation (supports SQL method). Once the model is determined, the cost to modify will be relatively large, basically need to recalculate the entire Cube, and the pre-calculation is not carried out at any time, according to a certain strategy, which also limits its requirements as a real-time data query.

Druid is more suitable for real-time computing, ad hoc query (currently does not support SQL), it uses Bitmap as the main index, so it can quickly filter and process data, but for complex queries, the performance is worse than Kylin.

Based on the above analysis, Kylin generally pushes the offline OLAP engine under super large data volume, and Druid is the real-time OLAP engine under the large data volume of the main push.

Comparison of three architectures

Systems based on MPP architecture:

There is a good amount of data and flexibility to support, but there is no inevitable guarantee for response time. As the amount of data and computational complexity increases, the response time slows down, from seconds to minutes, or even hours.

Search engine architecture system:

Compared with the MPP system, it sacrifices some flexibility for good performance, and can achieve sub-second response on search queries. However, for the query based on scan aggregation, with the increase of the amount of data processed, the response time will be reduced to the level of minutes.

Pre-calculation system:

Pre-aggregate the data when entering the library, further sacrificing flexibility for performance, in order to achieve second-order response to super-large data sets.

Combined with the above analysis, the above three are:

Support for the amount of data from small to large

Flexibility from large to small

Performance increases with the amount of data, from low to high

Therefore, we can consider it based on the size of the actual amount of business data and the requirements for flexibility and performance. For example, the use of GP may be able to meet the needs of most companies, the use of Kylin can meet the needs of a large amount of data and so on.

Conclusion

Recently, I saw a saying: "the key thinking of architectural design is judgment and trade-off, and the key thinking of programming is logic and implementation."

In the future, our technical team will continue to explore the selection method of multi-dimensional analysis system and discuss with you, as always, to provide better services for developers.

For more information, please follow: a push Institute of Technology

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.