How to select the OLAP computing engine 07/03 Update SLTechnology News&Howtos

How to select the OLAP computing engine

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to choose OLAP computing engine, I believe that many inexperienced people do not know what to do about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Today, let's talk about OLAP technology. I think a good OLAP engine should have the following three conditions: easy to develop, easy to maintain, and easy to transplant. Today, I would like to share with you several common OLAP computing engines, their characteristics, applicable scenarios, advantages and disadvantages, and so on. I hope it will be helpful to you in the selection and application.

Introduction to Kylin

1. Kylin is a MOLAP system developed by ebay.

2. Provide SQL query interface and multi-dimensional analysis capability on Hadoop to support very large-scale data.

3. Provide integration with BI tools such as Tableau

Scope of application

Suitable for: data warehouse, user behavior analysis, traffic (log) analysis, self-help analysis platform, e-commerce analysis, advertising effect analysis, real-time analysis, data service platform and other scenarios

Product characteristics

1. Kylin pre-calculates the data in hive and implements it using hadoop's mapreduce framework.

2. Kylin provides standard SQL support for Hadoop to support most query functions.

3. Users can interact with Hadoop data in subseconds, providing better performance than Hive on the same data set.

4. Users can define data models and build cubes for more than 10 billion data sets in Kylin.

5. Friendly web interface to manage, monitor and use cubes

6. Plug-ins that support additional features and features

7. Integration with life cycle management systems such as scheduling system, ETL, monitoring, etc.

8. Caches all the data results that need to be queried by pre-calculation, which requires a lot of storage space (10 + times the amount of original data).

Presto

Brief introduction

1. Presto is an open source distributed SQL query engine, which is suitable for interactive analysis and query. The amount of data supports GB to PB bytes.

2. Presto is designed and written to solve the problem of interactive analysis and processing speed of commercial data warehouse such as Facebook.

Scope of application

1. Presto supports SQL and provides the syntax features of a standard database, but it is not a relational database in the usual sense.

Presto is an optional tool that can be used to query HDFS

3. Designed to handle data warehousing and analysis: analyze data, aggregate large amounts of data and generate reports, these scenarios are usually defined as OLAP

Product characteristics

1. Presto supports online data query, including Hive and Cassandra.

2. A Presto query can merge data from multiple data sources and analyze across the entire organization.

3. Parallel computing based entirely on memory

4. Pipeline

5. Localized computing

6. Dynamic compilation and execution plan

7. Be careful with memory and data structures

8. Approximate query like BlinkDB

9. GC control

Impala

Brief introduction

1. Big data real-time query and analysis tool developed by cloudera Company.

2. It is a distributed, massively parallel processing (MPP) database engine, including different background processes running on CDH cluster hosts.

3. Impala is mainly composed of Impalad, State Store and CLI.

Scope of application

Impala is suitable for real-time interactive SQL query. Impala provides data analysts with big data analysis tools to quickly experiment and verify ideas.

Product characteristics

1. The query speed is fast. Unlike hive's underlying execution, which uses the MapReduce engine, it is still a batch process. The intermediate result of impala is not written to disk, even if it is transmitted in the form of stream through the network in time, the IO overhead of the node is greatly reduced.

two。 High flexibility. You can directly query the native data stored on HDFS, or query the data stored by optimized design, as long as the format of the data is compatible with MapReduce, hive, Pig and so on.

3. Easy to integrate. It is easy to integrate with hadoop systems and use the resources and advantages of the hadoop ecosystem to meet the requirements of query analysis without having to migrate data to a specific storage system.

4. Scalability. Can work well with some BI application systems, such as Microstrategy, Tableau, Qlikview and so on.

5. The efficiency of using Impala is 3-90 higher than that of using Hive.

Kudu

Brief introduction

1. The overall application mode of the storage system developed by Cloudera is similar to that of HBase, that is, it supports random read and write at the row level and batch sequential retrieval.

2. Kudu manages structured tables similar to relational databases.

3. The underlying core code of Kudu is developed by C++ and provides Java API interface to the outside world.

Scope of application

1. The positioning of Kudu is to provide fast analytics on fast data, that is, to make quick queries on rapidly updated data.

2. It locates OLAP and a small number of OLTP workflows, and if there is a large number of random accesses, it is officially recommended that HBase be the most appropriate.

Product characteristics

1. The cluster architecture of Kudu is basically similar to that of HBase, using master-slave structure, Master node manages metadata, and Tablet node is responsible for sharding management data.

2. Kudu adopts a method similar to that of log-structured storage system. The addition, deletion and modification operations are all placed in the buffer in memory, and then merge to the persistent column storage. Kudu still uses WALs for disaster recovery of buffer in memory.

Comparative analysis of Kylin

Kylin can be said to have taken a completely different path from the popular RTOLAP on the market. Kylin optimizes how to obtain precomputed results quickly and optimizes query parsing so that more queries can use precomputed results. Subsequent versions of Kylin will optimize precomputing speed so that Kylin can become an approximate real-time analysis engine. However, its disadvantage is that the support of SQL may be sacrificed to a certain extent, and the storage cost will be relatively large, while Presto,SparkSQL,Hawq and so on focus on the process of optimizing query data, like some other data warehouses, using column storage, compression, parallel query and other technologies to optimize query. The advantage of this scheme is that it is scalable and adaptable to a wider range of queries, but because each aggregate is On Fly, its performance is still lower than that of Kylin.

Presto lightweight and fast support near real-time interactive query is widely used in Facebook. There is no doubt about the use of distributed query engine in scalability and stability. Compared with traditional MapReduce, it eliminates latency and disk IO overhead to support UDFImpala in the later stage: support SQL query, fast query big data. You can query the existing data and reduce the loading and conversion of the data. Multiple storage formats are available (Parquet,Text, Avro, RCFile, SequeenceFile). Can be used with Hive. Cons: user-defined function UDF is not supported. Full-text search in text field is not supported. Transforms is not supported. Fault tolerance during the query period is not supported. High memory requirements.

In the application scenarios where real-time requirements are not high, for example, the generation of monthly, quarterly and annual reports. You can use traditional HadoopMapreduce to deal with massive big data. However, in some scenarios with high real-time requirements, on the one hand, it meets the real-time requirements, on the other hand, it improves the user experience. Impala deserves to be the preferred query analysis tool because of its fast response ability.

Kudu

Kudu essentially optimizes performance on the basis of column storage, hoping to improve performance by improving storage efficiency, speeding up field projection filtering efficiency, and reducing CPU overhead when querying. Most other designs exist for the purpose of supporting random reads and writes on the basis of column storage. For example, Sql-like metadata structure is an auxiliary means to improve the efficiency of column storage, and the setting of the only primary key is also a customized strategy introduced in conjunction with column storage, while other Tradeoff schemes such as Delta storage and compaction strategy are introduced to support random reading and writing and reduce latency uncertainty under this setting.

In the official test results, if it is a scenario such as random read and write of the quintessence, or single-line search requests, because of these Tradeoff, the performance throughput of HBASE is much better than that of Kudu (2 to 4 times). The advantage of kudu is that it supports SQL-like retrieval, which often requires batch sequential retrieval and analysis.

After reading the above, have you mastered how to choose the method of OLAP computing engine? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.