What are the reasons for migrating from ES to ClickHouse 07/01 Update SLTechnology News&Howtos

What are the reasons for migrating from ES to ClickHouse

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what are the reasons for migrating from ES to ClickHouse". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the reasons for migrating from ES to ClickHouse"?

Why choose ClickHouse?

ClickHouse is a high performance column distributed database management system. We tested ClickHouse and found that it has the following advantages:

① ClickHouse has a large write throughput. Single-server log writes range from 50MB to 200MB/s, and the number of records written per second exceeds 60w, which is more than 5 times that of ES.

② writing Rejected, which is more common in ES, leads to data loss, write delay and other problems, which is not easy to occur in ClickHouse.

The query speed of ③ is fast. It is officially claimed that the data is in pagecache, and the query rate of a single server is about 2-30GB. Without pagecache, the query speed depends on the read rate of the disk and the compression ratio of the data. The query speed of ClickHouse is 5-30 times faster than that of ES.

ClickHouse costs less than ES servers:

On the one hand, the data compression ratio of ClickHouse is higher than that of ES, and the disk space occupied by the same data is only 1-3 to 1-30 of ES, which not only saves disk space, but also effectively reduces disk IO, which is one of the reasons why ClickHouse query efficiency is higher.

On the other hand, ClickHouse takes up less memory and consumes less CPU resources than ES. We estimate that using ClickHouse to process logs can reduce server costs by half.

Compared with ES,ClickHouse, ④ has higher stability and lower operation and maintenance cost.

The load of different Group in ⑤ ES is uneven, and some Group load is high, which will lead to problems such as writing Rejected, and indexes need to be migrated manually. In ClickHouse, through cluster and Shard strategy, polling and writing method is adopted, so that the data can be evenly distributed to all nodes.

A large query in ⑥ ES may cause problems in OOM; through the preset query restrictions, ClickHouse will fail the query and will not affect the overall stability.

⑦ ES needs to separate hot and cold data. Moving 200T of data per day will cause problems in the relocation process. Once the relocation fails, the hot node may soon be burst, resulting in a lot of manual maintenance and recovery work.

⑧ ClickHouse Partition by talent, generally do not need to consider hot and cold separation, special scenarios users do need hot and cold separation, the amount of data will be much smaller, ClickHouse's own hot and cold separation mechanism can be well solved.

⑨ ClickHouse uses SQL syntax, which is simpler and cheaper to learn than ES's DSL.

Combined with Ctrip's log analysis scenario, the log has been formatted into JSON before entering ES, and the same type of log has a unified Schema, which is in line with the ClickHouse Table mode.

When querying logs, the quantity, total amount and average value are generally counted according to a certain dimension, which is in line with the usage scenario of column-oriented storage in ClickHouse.

Occasionally there are a small number of scenarios that need to fuzzy query the string. After filtering out a large amount of data under some conditions, and then fuzzy matching a small amount of data, ClickHouse can also be very competent.

In addition, we found that more than 90% of the logs did not use ES's full-text indexing feature, so we decided to try to use ClickHouse to process logs.

Use ClickHouse to process logs

ClickHouse highly available deployment scenario

① disaster recovery deployment and Cluster Planning

We use the method of multi-Shards and 2 Replicas to back up each other through Zookeeper, allowing one Shard and one server Down machine data not to be lost.

In order to access logs of different sizes, we divide the cluster into 6 and 20 clusters of two sizes.

② cross-IDC deployment

With the help of the characteristics of ClickHouse distributed table, we realize cross-cluster search. Ctrip has multiple IDC, and the logs are distributed in different IDC.

To avoid cross-IDC relocation logs, we deploy a set of ClickHouse in each IDC, then configure ClickHouse's cross-IDC Cluster, create distributed tables, and implement data search across multiple IDC.

As shown in the following figure:

Description of several important parameters of ③

As follows:

Max_threads:32 # is used to control the number of query threads for a user.

Max_memory_usage:10000000000 # can use up to 9.31 gigabytes of memory per query.

Max_execution_time:30 # maximum execution time for a single query.

Skip_unavailable_shards:1 # when querying through a distributed table, data from other Shard can still be queried when one Shard cannot be accessed.

The pit that ④ stepped on

We previously placed the configuration of Cluster in the directory of config.d. When ClickHouse restarted unexpectedly, we found that some of the distributed tables could not be accessed by Shard. Therefore, we no longer use config.d configuration, and Cluster configuration is placed in metrika.xml.

Consume data to ClickHouse

We use gohangout to consume data to ClickHouse. Here are some suggestions for data writing:

Use polling to write all the servers in the ClickHouse cluster to ensure that the data is basically evenly distributed.

A large number of low-frequency writes, reduce the number of parts, reduce the server merge, avoid Too many parts exceptions. The amount and frequency of data writing are controlled by two thresholds, and records are written once in more than 10 weeks or once in 30 seconds.

Write to the local table, not the distributed table, because after receiving the data, the distributed table will split the data into multiple parts and forward the data to other servers, which will increase the network traffic between servers, increase the workload of server merge, slow down the writing speed, and increase the possibility of Too many parts.

Consider the setting of partition when creating the table. I have encountered an exception that someone set partition to timestamp before, which caused the inserted data to be reported to Too many parts all the time. We usually partition by talent.

The setting of primary keys and indexes, the disorder of data, and so on can also lead to slower writes.

Data presentation

We investigated several tools such as Supperset, Metabase, Grafana and so on, and finally decided to develop on Kibana3 to support ClickHouse to achieve chart display.

The main reason is that Kibana3, a powerful data filtering function, is not available in many systems, and it is also difficult for users to adapt in the short term because of the high cost of migrating to other systems.

At present, we have developed corresponding ClickHouse versions of several commonly used charts on K3 (terms, histogram, percentiles, ranges, table). The user experience has basically been maintained with the original version, and the query efficiency has been greatly improved after optimization.

Query optimization

The Table Panel in Kibana is used to display the detailed data of the log. It is generally used to query the data of all fields in the last hour, and finally only show the first 500 records. This scenario is very unfriendly to ClickHouse.

To solve this problem, we divide the query of table Panel into two steps:

The amount of data per unit time interval of the first query, and the reasonable time range of the query is calculated according to the final amount of data displayed.

The second time according to the revised time range, combined with the default display of Column configured in Table Panel to query the detail data.

After these optimizations, the query time can be reduced to the original 1max 60, the query columns can be reduced by 50%, and the amount of query data can be reduced to the original 1max 120.

ClickHouse provides a variety of approximate calculation methods to provide relatively high accuracy while reducing the amount of calculation.

Using MATERIALIZED VIEW or MATERIALIZED COLUMN to put the amount of calculation in the normal completion can also effectively reduce the amount of data and computation of the query.

Dashboard migration

Because there are a lot of Dashboard on Kibana3, we have developed a Dashboard migration tool to migrate Dashboard by modifying the configuration of Dashboard in the kibana-init-* index.

Effect of connecting to ClickHouse

Currently, the log volume of our cluster is about 100T (about 600T before compression). The main monitoring metrics of ClickHouse server are as follows:

ClickHouse takes up less memory than ES. In order to improve query efficiency, ES will put a lot of data in memory, such as: segment index data, filter cache, field data cache, indexing buffer and so on.

The usage of ES memory is proportional to the amount of index, data, write, query, and so on. Deleting (offline) indexes, migrating indexes or expanding capacity are common ways to deal with ES memory problems.

However, deleting (offline) indexes makes it impossible for users to keep data for a longer period of time, and server expansion leads to an increase in costs.

The memory consumption of ClickHouse mainly includes memory engine, data index, data loaded into memory to be calculated, search results and so on. The amount of data and storage time of the log in ClickHouse are mainly related to the disk.

You can save at least 60% of disk space compared to ES,ClickHouse.

As shown in the figure above, the disk space occupied by Netflow logs is 32% of that of ES, ClickHouse is 18% of ES, and ClickHouse is 22.5% of ES.

Compared with the improvement of query speed, ClickHouse is 4.4x to 38 times higher than ES. The problem that can not be queried on ES has been basically solved, and the problem of slow query has been greatly improved.

Because of the large amount of data in Netflow, ES can not be queried. After optimization in ClickHouse, the query time of CK and ES of 29.5sMagneCDN is 38 times faster than that of ES, and the query CK of dbLog is 4.4times faster than that of ES.

As for the comparison of query speed, because in the production environment, there is no guarantee that the environment of ES is the same as that of ClickHouse. ES uses a 40-core 256g server, and one server deploys one ES instance, with a single server of about 3T of data.

ClickHouse uses a 32-core 128G server, a single server data volume of about 18T, a server to deploy a ClickHouse instance.

The speed of processing log query with ClickHouse has been greatly improved, the problem of short data storage time has been basically solved, and the user experience has also been improved.

We estimate that using 50% of the server resources in the current ES log cluster will be able to process the existing ES logs and provide a better user experience than it is now.

ClickHouse basic operation and maintenance

Generally speaking, the operation and maintenance of ClickHouse is simpler than that of ES, and it mainly includes the following aspects:

Access and performance optimization of new ① logs.

To clean up the ② expired logs, we delete the partition of the expired logs every day through a scheduled task.

③ ClickHouse monitoring, using the implementation of ClickHouse-exporter+VictoriaMetrics+Grafana.

④ data migration, through the characteristics of ClickHouse distributed table, we generally do not move historical data, just connect the new data to the new cluster, and then query across clusters through the distributed table.

With the passage of time, the historical data will be cleaned offline, and when all the old cluster data is offline, the migration of the new and old clusters will be completed.

When you really need to migrate data, use ClickHouse_copier or copy the data.

⑤ FAQ handling:

Slow query: the execution of the slow query is terminated by kill query and optimized by the optimization scheme mentioned earlier.

Too many parts exception: the Too many parts exception is due to too much part being written. The merge speed of the part cannot keep up with the speed generated.

The main reasons for excessive part include several aspects:

The settings do not match.

Write ClickHouse in small batch and high frequency.

It's a ClickHouse distribution table.

The number of merge threads set by ClickHouse is too small.

Unable to start: I have previously encountered a problem where ClickHouse cannot be started.

It mainly includes two aspects:

The file system is corrupt and can be resolved by repairing the file system.

The data exception of a table leads to the failure of ClickHouse loading. You can delete the abnormal data and start it, or you can move the abnormal file to the detached directory, and wait for ClickHouse to get up and then attach the file to recover the data.

Thank you for your reading, the above is "what are the reasons for migrating from ES to ClickHouse"? after the study of this article, I believe you have a deeper understanding of the reasons for migrating from ES to ClickHouse, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.