What is the difference between Apache Kylin and ClickHouse 07/04 Update SLTechnology News&Howtos

What is the difference between Apache Kylin and ClickHouse

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I would like to share with you the relevant knowledge of what is the difference between Apache Kylin and ClickHouse. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look.

Apache Kylin and ClickHouse are both popular big data OLAP engines in the market. Kylin was originally developed by eBay China R & D Center and contributed to the Apache Software Foundation in 2014. with its subsecond query ability and ultra-high concurrent query ability, it has been adopted by many large companies, including Meituan, Didi, Ctrip, Shell Housing, Tencent, 58.com and so on.

ClickHouse, which has been hot in the OLAP field for the past two years, was developed by Russian search giant Yandex and opened up in 2016. Typical users include well-known companies such as Byte Jump, Sina and Tencent.

What are the differences between the two OLAP engines, what are their respective advantages, and how to choose them? This paper will try to compare the two OLAP engines from the aspects of technical principles, storage structure, optimization methods and advantage scenarios, in order to provide some reference for everyone's technology selection.

01 technical principle

In terms of technical principles, we mainly make a comparison from the two aspects of architecture and ecology.

1.1 Technical Architecture

Kylin is a MOLAP (Multi-dimensional OLAP) technology based on Hadoop. The core technology is that OLAP Cube; is different from traditional MOLAP technology. Kylin runs on Hadoop, a platform with powerful functions and strong expansibility, so it can support massive (TB to PB) data. It imports precomputed (executed through MapReduce or Spark) multidimensional Cube into HBase, a low-latency distributed database, thus achieving subsecond query response. Recently, Kylin 4 began to use Spark + Parquet to replace HBase, further simplifying the architecture. Because a large number of aggregate calculations have been completed in the offline task (Cube build), when executing SQL queries, it no longer needs to access the original data, but directly uses the index combined with the aggregate results to calculate again, and the performance is a hundred or even thousands of times higher than that of accessing the original data. Because of the low utilization of CPU, it can support high concurrency, especially suitable for multi-user, interactive analysis scenarios such as self-help analysis, fixed reports and so on.

ClickHouse is a distributed ROLAP (Relational OLAP) analysis engine based on MPP architecture. Each node has equal responsibilities and is responsible for part of the data processing (shared nothing). A vectorization execution engine is developed, which makes full use of the characteristics of log merge tree, sparse index and SIMD (single instruction multiple data, Single Instruction Multiple Data) of CPU to give full play to the advantages of hardware to achieve efficient computing. Therefore, when ClickHouse is faced with a large amount of data computing scenarios, it can usually reach the limit of CPU performance.

(2) technological ecology

Kylin is written in Java, fully integrated into the Hadoop ecosystem, using HDFS for distributed storage, computing engine options MapReduce, Spark, Flink; storage engine optional HBase, Parquet (combined with Spark). Source data access supports Hive, Kafka, RDBMS, etc. Multi-node coordination depends on Zookeeper; compatible with Hive metadata, Kylin only supports SELECT query, and schema modification needs to be completed in Hive, then synchronization to Kylin; modeling and other operations are completed through Web UI, task scheduling is carried out through Rest API, and task progress can be checked on Web UI.

ClickHouse is written in C++, has its own system and relies less on third-party tools. Support more complete DDL and DML, most of the operations can be completed through the command line combined with SQL; distributed clusters rely on Zookeper management, a single node does not rely on Zookeper, most of the configuration needs to be completed by modifying the configuration file.

02 storage

Kylin uses HBase or Parquet of Hadoop ecology as storage structure, relies on rowkey index of HBase or Row group sparse index of Parquet to speed up query, and uses HBase Region Server or Spark executor to do distributed parallel computing. ClickHouse manages its own data storage, and its storage features include: MergeTree as the main storage structure, data compression and segmentation, sparse index and so on. The following will make a detailed comparison of the two engines.

2.1Storage structure of Kylin

Kylin calculates multi-dimensional Cube data through pre-aggregation, and dynamically selects the optimal Cuboid (similar to materialized view) according to the query conditions, which will greatly reduce the amount of CPU computation and the amount of IO reading.

In the construction process of Cube, Kylin encodes and compresses the dimension values such as dictionary coding, trying to minimize data storage; because the storage engine and construction engine of Kylin are pluggable, the storage structure is different for different storage engines.

HBase storage

In the case of using HBase as the storage engine, each dimension is encoded in the pre-calculation to ensure that the length of the dimension value is fixed, and when the hfile is generated, the dimensions in the calculation results are spliced into rowkey, and the aggregate value is used as value. The order of dimensions determines the design of rowkey, and also directly affects the efficiency of query.

Parquet storage engine

When using Parquet as the storage format, dimension and aggregate values are stored directly without the need for coding and rowkey stitching. Before saving to Parquet, the calculation engine sorts the calculation results by dimension, and the higher the dimension field is, the more efficient it is to filter on it. In addition, the number of shard and the number of row group of parquet files under the same partition will also affect the efficiency of the query.

2.2 Storage structure of ClickHouse

ClickHouse generally requires the user to specify partition columns when creating a table structure. Using data compression and pure column storage technology, using Mergetree to store and compress each column separately.

At the same time, the data is always written to disk in the form of fragments, and when certain conditions are met, ClickHouse will merge these data fragments regularly through background threads.

When the amount of data continues to increase, ClickHouse will merge the data of the partition directory to improve the efficiency of data scanning.

At the same time, ClickHouse provides sparse indexes for each data block. When processing query requests, sparse indexes can be used to reduce data scanning to speed up.

03 optimization method

Kylin and ClickHouse are both big data processing systems. When the data level continues to increase, adopting appropriate optimization methods can often get twice the result with half the effort, greatly reduce the query response time, reduce the storage space, and improve the query performance. Because the computing systems and storage systems of the two are different, the optimization methods adopted are also different. The next section will focus on the optimization methods of Kylin and ClickHouse.

3.1Optimization method of Kylin

The core principle of Kylin is precomputing, as stated in the first section of the technical principle: Kylin's computing engine uses Apache Spark,MapReduce; to store HBase,Parquet;SQL parsing and post-computing Apache Calcite. The core technology of Kylin is to develop a series of optimization methods to help solve the problem of dimension explosion and scanning too much data, these methods include: setting aggregation group, setting federation dimension, setting derivative dimension, setting dimension table snapshot, setting Rowkey order, setting shard by column and so on.

Set up aggregation groups: pruning through aggregation groups to reduce unnecessary precalculation combinations; setting joint dimensions: put together dimension combinations that often occur in pairs to reduce unnecessary precalculations; set derivative dimensions: set dimensions that can be calculated by other dimensions (such as year, month, day can be calculated by date) as derivative dimensions to reduce unnecessary precomputation Set dimension table snapshot: put in memory to reduce storage space; dictionary encoding: reduce storage space occupied; RowKey encoding, set shard by column: speed up query efficiency by reducing the number of rows scanned by data

3.2 ClickHouse optimization method

The most common optimization method of MPP architecture system is database and table. Similarly, the most common optimization method of ClickHouse includes setting partition and sharding. In addition, ClickHouse also includes some unique engines. To sum up, these optimization methods include:

Use flat table structure instead of multi-table Join, avoid expensive Join operations and data shuffling, set up reasonable partition keys, sort keys, secondary indexes, reduce data scanning, build ClickHouse distributed clusters, increase shards and replicas, add computing resources combined with materialized views, and appropriately adopt precomputing engines such as SummingMergetree,AggregateMergetree.

With the higher and higher requirements of performance and concurrency, the resource consumption of the machine is also increasing. In the official website documentation of ClickHouse, it is recommended that the number of concurrency of ClickHouse should not exceed 100. when concurrency is required, in order to reduce the resource consumption of ClickHouse, you can optimize it with some special engines of ClickHouse.

SummingMergetree and AggregateMergetree are the most commonly used in special engines. These two data structures are derived from Mergetree. In essence, the data that needs to be queried is calculated in advance and saved in ClickHouse through pre-calculation, so that the resource consumption can be further reduced when querying.

From the point of view of the principle of use, SummingMergetree and AggregateMergetree are similar to Kylin's Cube. However, when there are too many dimensions, it is unrealistic to manage many materialized views, and there are some problems such as high management cost. Unlike ClickHouse, Kylin provides a series of simple and direct optimization methods to avoid the problem of dimension explosion.

As you can see, both ClickHouse and Kylin provide ways to reduce the storage footprint and reduce the number of rows scanned when querying. It is generally believed that proper optimization of ClickHouse and Kylin can meet business requirements in large data scenarios. ClickHouse uses MPP current calculation and Kylin uses pre-calculation. Due to the different technical routes adopted by the two, the corresponding advantage scenarios are also different.

04 advantage scenario

Because Kylin uses precomputing technology, it is suitable for aggregate queries with fixed patterns, such as join, group by and where conditional patterns in SQL. The larger the amount of data, the more obvious the advantage of using Kylin. In particular, Kylin has great advantages in count distinct, Top N, Percentile and other scenarios, which are widely used in Dashboard, various reports, large screen display, traffic statistics, user behavior analysis and other scenarios. Meituan, Aurora, Shell Housing search, etc., use Kylin to build their data service platform, providing millions to tens of millions of queries a day, and most of the queries can be completed in 2-3 seconds. There are few better alternatives to such high concurrency scenarios.

Because of the strong on-site computing power of MPP architecture, ClickHouse is more suitable when the query request is more flexible, or there are detailed query requirements, and the concurrency is not large. Scenarios include user tag filtering with very many columns and arbitrary combination of where conditions, complex ad hoc queries with small concurrency, and so on. If the amount of data and visits is large, you need to deploy a distributed ClickHouse cluster, which will pose a high challenge to the operation and maintenance staff.

If some queries are very flexible, but do not check frequently, the present calculation will save resources. Because of the small number of queries, it can be cost-effective even if each query consumes a large amount of computing resources as a whole. If some queries have a fixed pattern, a large number of queries is more suitable for Kylin, because the number of queries is large, and the calculation results are saved by using large computing resources, and the previous calculation cost can be diluted in each query, so it is the most economical.

05 summary

In this paper, Kylin and ClickHouse are compared in terms of technical principles, storage structure, optimization methods and advantage scenarios.

Technical principle: ClickHouse adopts MPP + Shared nothing architecture, which is flexible in query, easy to install, deploy and operate. Because the data is stored locally, it is relatively troublesome to expand capacity and operation and maintenance. Kylin uses MOLAP precomputing, based on Hadoop, separation of computing and storage (especially after using Parquet storage), and Shared storage architecture, which is more suitable for scenarios where the scene is relatively fixed but the data volume is large. Based on Hadoop, it is easy to integrate with the existing big data platform and to scale horizontally (especially after upgrading from HBase to Spark + Parquet).

Storage structure: ClickHouse stores detail data, including MergeTree storage structure and sparse index, on which you can further create aggregation tables to speed up performance; Kylin uses pre-aggregation and HBase or Parquet for storage, materialized views are transparent to queries, and aggregate queries are very efficient but do not support details queries.

In terms of optimization methods, ClickHouse includes optimization methods such as partition sharding and secondary index, while Kylin adopts optimization methods such as aggregation group, joint dimension, derivative dimension, hierarchical dimension, and rowkey sorting.

Advantage scenarios: ClickHouse is usually suitable for flexible queries of the order of hundreds of millions to billions (more orders of magnitude are also supported, but it will be more difficult for cluster operation and maintenance). Kylin is more suitable for relatively fixed query scenarios of billions to tens of billions.

The following figure is a summary of many aspects:

Taken together, Kylin and ClickHouse have a variety of usage areas and scenarios. In the field of modern data analysis, there is no analysis engine that can adapt to all scenarios. Enterprises need to choose appropriate tools to solve specific problems according to their own business scenarios.

These are all the contents of the article "what's the difference between Apache Kylin and ClickHouse". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.