How to apply Database ClickHouse in big data's Field 07/04 Update SLTechnology News&Howtos

How to apply Database ClickHouse in big data's Field

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

In this article, the editor introduces in detail "how to apply database ClickHouse in the field of big data". The content is detailed, the steps are clear, and the details are handled properly. I hope that this article "how to apply database ClickHouse in the field of big data" can help you solve your doubts.

I. Preface

The advantage of querying database for large amount of data is that it has better query performance under the premise of large amount of data (tens of millions of levels).

1. Application scenarios

ClickHouse is used in the field of OLAP (online Analytical processing). Specifically, it is appropriate to use this technology to meet the following characteristics:

Transactional database tables are transformed into wide tables by connected table queries.

Aggregate (statistical) calculation is widely used.

High query efficiency is required, and non-idempotent queries can be tolerated within a limited time range (final consistency)

2. Learning posture

Most learning ClickHouse starts with an OLTP database, such as a Mysql database. For tens of millions of data, the table with InnoDB as the storage engine only needs the number of rows in the statistical table, and the execution efficiency is very low, and the corresponding delay is also unacceptable for some aggregate functions.

Improving the level of database hardware can improve the query efficiency to a certain extent, but it still can not completely solve the query efficiency problem. As soon as ClickHouse was launched, the fire further confirmed how urgent it is for developers to have a reasonable query efficiency under the premise of a large amount of data.

Taking the typical read-write separation of Mysql database as an example, this paper compares ClickHouse horizontally, compares why Mysql query is slow and why ClickHouse query is fast, and comprehensively considers how OLTP works with OLAP.

2. Knowledge storage (1) disk IO1, data volume and query efficiency

After the amount of data exceeds a certain boundary, the query efficiency decreases sharply, and the main reason for the low query efficiency is the disk IO. For example, Mysql database, through server optimization (increasing hardware resource consumption), can improve certain performance, but can not effectively improve query efficiency from the software level.

Tens of millions of large tables, query performance is low, mainly related to the disk this, there are two influencing factors: one is the data index location; the other is the disk IO.

(2) performance comparison 1. Disk working mechanism

When the operating system reads data from disk to memory, it roughly goes through the following process: indexing to the location of data storage; IO data in pages. When the data is indexed, the IO process is relatively fast (the speed is not the same order of magnitude as the memory IO).

The disk page IO indicates that the hit time of one record is the same as that of all hits on the disk page, IO time is the same. In practical use, querying a record is sometimes similar to multiple consecutive records at a time (the underlying logic is to IO data from a disk page).

2. Store by row (column)

Through a simple example to compare the impact of row-by-row storage and column-by-column storage on query, mainly disk IO as the technical index. The amount of test data is ten million.

CREATE TABLE `name` (`id`bigint (20) NOT NULL COMMENT 'ID', `name` varchar (32) DEFAULT NULL COMMENT' name', `deleted`tinyint (1) NOT NULL DEFAULT'0' COMMENT 'logical deletion', `create_ time`datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'creation time, `update_ time`datetime DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP COMMENT' update time, `delete_ time`time` datetime DEFAULT NULL COMMENT', PRIMARY KEY (`id`) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT=' name information table'

Through different scenarios, compare the consumption of different storage methods on disk IO, and then compare the query efficiency.

(1) query name through id

Storage mode index mode disk IO execution procedure row storage hash index O (1)

BTree index O (logN) performs the selection operation on the whole row data disk, memory performs the projection operation column storage primary key sparse index + secondary index single row name column data performs the selection operation on the disk and completes the projection operation

Rows are stored on the index to save time; columns are stored on the disk IO to save time, and the difference can be ignored when the amount of data is small, and the two are the same this round.

(2) query name through batch id

Batch query refers to a limited interval query or a limited set query with less than 100 pieces of data. Finite interval query and finite set query correspond to a small amount of data, and there is little difference in performance. Careful analysis of the process, there are still obvious differences between the two.

The efficiency of interval query is higher than that of limited set query for the following reasons: interval query data storage is continuous, single data index, single-page disk IO (small amount of data), compact data query, row-by-row storage has a slight advantage, considering that it is querying a single field, so the number of disk data indexes is once (the number of columns queried by column is the number of indexes).

Due to the discontinuous query conditions, the set query needs to be indexed separately and disk IO is completed. There are N elements in the collection (random) that need to be indexed N times, disk IO in pages.

(3) query the whole row of data through id

Storing by column is usually more efficient than storing queries by row, especially for wide tables (aggregation tables with dozens of columns or more). For query, more requirements are to query a column of data or columns of data. The database stored in columns can greatly reduce the scanning range of disk data and the IO between disk and memory, which improves the query efficiency from the IO level.

Extreme situation

The database stores id and name data, both of which are non-empty required data. in this case, the row (column) storage is similar at the IO level, and the scanning range of the data on disk is similar to that of reading and writing IO. Through id query name or batch id query name, with the help of hash index, row-by-row storage may have O (1) time complexity.

The actual data can not be so pure. Row records usually have preservation time, modification time, deletion time, and modification time of some core fields. When the amount of data is small, the subsidiary fields have little impact on the query. Once the amount of data exceeds a certain threshold, the impact on the query is gradually highlighted. Disk scans and IO that can be ignored for subordinate fields are stored by column.

Generally speaking, from a query point of view, storing by column is better than storing by row.

III. Basic knowledge (1) Table structure

The table structure used by clickhouse is somewhat different from the common relational database.

1. Sort

The table sort attribute is required in the merge tree family engine. The sorting strategy of the data in the partition is set by the ORDER BY keyword, the data is stored sequentially according to the sorting strategy when imported or saved, and the ordered data is stored directly in the disk, which has high efficiency when querying.

Sorting is also an index column, and adding fields with high frequency as query conditions to the sorting column is helpful to improve the query efficiency.

2. Primary key

The definition of primary key is strange, it only plays the role of filtering query index, and there is no unique constraint effect.

When a primary key is set, the primary key field must be included in the sort property and expanded from left to right.

3. Default value

The Null type is almost always a drag on performance for the following reasons: null values cannot be indexed; additional special placeholders need to be handled separately. Storing the same number of data in each column by column is beneficial to data query.

Before the data is imported, the null value needs to be processed to replace the null value with data that has nothing to do with the business.

(2) Table engine

The clickhouse table engine is very rich, of which the most commonly used is the merge tree family engine.

1 、 MergeTree

MergeTree engine can achieve a large amount of data query requirements, because the primary key does not have unique index constraints, there are duplicate rows. In the process of data migration, duplicate data import is inevitable. You can choose this engine if you can tolerate part of the duplicate data in the business or process the duplicate data from the application side.

CREATE TABLE test_tbl (id UInt16, create_time Date, comment Nullable (String)) ENGINE = MergeTree () PARTITION BY toYYYYMMDD (create_time) ORDER BY (create_time) PRIMARY KEY (id) TTL create_time + INTERVAL 1 MONTH SETTINGS index_granularity=8192

The MergeTree engine must specify a sort field.

Attribute meaning Note ORDER BY specify sort field (required) specify one or more fields as sort field (sort within partition) PARTITION BY specify partition rules generally use date as table partition policy PRIMARY KEY primary key field primary key element can be repeated and can specify multiple fields TTL record expiration time can specify record expiration time SETTINGS sparse index interval without special needs

The primary key of MergeTree is used to speed up the query, not like MySQL to keep records unique.

2 、 ReplacingMergeTree

The ReplacingMergeTree engine is used to remove duplicate rows, and the deduplication here has three levels of meaning: deduplication within the partition; primary key fields as comparison objects; and data deduplication practices only occur during merging.

-- force the backend merge, and stop the service optimize table test_tbl_replacing final when the table is deduplicated.

ReplacingMergeTree provides the ability to remove weight from primary keys, but still has the following limitations:

Optimize is a background action, so it is impossible to predict the specific execution time.

It is not possible to determine whether there is still duplicate data until there is a complete optimize.

Manual execution of optimize takes a lot of time in massive data scenarios, which can not meet the needs of business real-time query.

In a distributed scenario, the data of the same primary key may be sharding to different nodes, and the data of different shard may not be duplicated.

ReplacingMergeTree is more used to ensure that the data is eventually deduplicated, and there is no guarantee that the primary key will not repeat during the query process.

The ReplacingMergeTree (create_time) parameter is the version field, and the duplicate record retains the maximum version number in the row; blank is allowed, and the record inserted at the end of the duplicate row is retained by default.

To gain a deep understanding.

The de-duplication here can not achieve the purpose of de-duplication in the strict sense of relational database, so we should pay attention to this phenomenon when using it. In addition, you can't think about this problem in terms of black and white. ClickHouse made some compromises in improving the query speed.

3 、 SummingMergeTree

SummingMergeTree provides a pre-aggregation engine, which is equivalent to grouping order by fields as units, and then performing the aggregate summation operation, but these results are calculated in advance and do not need real-time calculation when querying.

If the value of the aggregate does not meet the requirements, it can be aggregated again through the aggregate function on the query result set, which is real-time calculation.

(3) built-in function

Common built-in functions need to be specifically pointed out that there will be applications in new table schemas, data imports, and so on.

1. Format date

Formatting partition functions are commonly used in table partition settings, and day-based partitions are common partition settings.

Select toYYYYMMDD (now ()) 2, Hash function

Use the hash string of the name field as the partitioning policy.

CREATE TABLE default.test02 (`id` UInt16, `name` String, `create_ time` Date) ENGINE = MergeTree () PARTITION BY LOWER (hex (MD5 (name) PRIMARY KEY idORDER BY (id,create_time)

The table may not have a primary key, and once the primary key is set, the table must be sorted in the order of the primary key.

It is also feasible to directly use the original string field value as the partitioning strategy, and considering the wide range of strings, it is safer to use hash functions.

3. Date function

Get various date functions. If you do not specify a time zone, the time zone information of the host is read by default.

SELECT toDateTime (now ()) AS T1, toDate (now ()) AS T2, toDate (now (), 'Asia/Shanghai') AS T3, toString (now ()) AS t4, installation and configuration

The version is selected to support version 20.8 for a long time, which is installed manually.

(I) install rpm-ivh clickhouse-server-20.8.19.4-2.noarch.rpmrpm-ivh clickhouse-client-20.8.19.4-2.noarch.rpmrpm-ivh clickhouse-common-static-20.8.19.4-2.x86_64.rpm (II) configuration 1, regular replacement comments

Use the mode $query to configure all comments in the XML configuration file.

# format XML file xmllint-format config.xml2, server configuration file

The server configuration file has two config.xml and users.xml, the former is a read-only configuration, and the latter can be modified dynamically at run time.

After reading this, the article "how to apply database ClickHouse in the field of big data" has been introduced. If you want to master the knowledge points of this article, you still need to practice and use it yourself to understand it. If you want to know more about related articles, welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.