How to use clickhouse sparse Index in MergeTree 04/17 Update SLTechnology News&Howtos

How to use clickhouse sparse Index in MergeTree

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you the "clickhouse sparse index in MergeTree how to use", the content is easy to understand, clear, hope to help you solve your doubts, the following let Xiaobian lead you to study and learn "how to use clickhouse sparse index in MergeTree" this article.

Logical schematic diagram of MergeTree storage structure

In the storage structure of the MergeTree table, each data partition is independent of each other and logically unrelated. There are multiple MergeTree Data Part within a single data partition. Once these Data Part are generated, they are the state of Immutable, and the generation and destruction of Data Part are mainly related to write and asynchronous Merge. The write link to the MergeTree table is an extreme batch load process, and Data Part does not support a single append insert. Each time batch insert generates a new MergeTree Data Part. If the user insert one record at a time, a separate Data Part will be generated for that record, which must be unacceptable. Generally, when we use the MergeTree table engine, we need to do aggregation on the client side for batch writes.

1. Concept

Part: writes one block of data generated at a time.

Primary.idx file: sparse indexes are stored, and one part corresponds to one sparse index.

Mark numbervalue0a1a2b

Bin file: a file that actually stores data, consisting of one or more compressed data. Compressed data is the smallest unit of storage and consists of "header files" and "compressed data blocks". The header file consists of three parts: the compression algorithm, the byte size before compression and the byte size after compression; the compressed data block is strictly limited to the 64K~1M byte size before compression. (this size is what ClickHouse considers to be the minimum consumption of compression and decompression performance.) That is, a compressed data block consists of N block, and an bin file consists of N compressed data blocks.

Mrk file: stores which compressed data block is in the bin file and the starting offset in the data block of this compressed data.

Mark index compressed data in the index the number of bytes (offset) 00010120012102 starting in the compressed data block index, the data storage file directory

Action_id.bin, avatar_id.bin and so on are all column files compressed by block in a single column.

Data is stored in bin files in compressed data units.

The compressed data blocks corresponding to the compressed data are strictly limited to be stored according to the size of the 64K~1M byte.

(1) if the size of a block is less than 64K, you need to find the next block to piece together until the size is greater than or equal to 64K.

(2) if the size of a block is in the range of 64K to 1m, a compressed data block is directly generated.

(3) if the size of a block is greater than 1m, the cut generates multiple compressed data blocks.

Different columns under a part are stored separately, and different columns store the same number of rows.

Mark identity files: action_id.mrk2, avatar_id.mrk2, and so on are all Mark tags in inventory files. Mark tags are related to two important concepts in MergeTree inventory: Granule and Block.

Granule is a logical concept used when dividing data into rows. On the question of how many lines are a Granule, in the old version this is a constant set with the parameter index_granularity, that is, every other line is a Granule. In the current version, there is another parameter, index_granularity_bytes, that affects the number of rows of Granule, which means that the sum size of all columns in each Granule should not exceed the set value as much as possible. The main problem with the fixed-length Granule setting in the old version is that the data in MergeTree is indexed according to Granule granularity. In the scenario of analyzing super large and wide tables, the data size read from storage will swell so much that users need to set parameters very carefully.

Block is the compression unit in the file. The Block of each column storage file will contain several Granule. The specific number of Granule is controlled by the parameter min_compress_block_size. When the data of a Granule is written in the Block of each column, it will check whether the current Block Size has reached the set value. If so, it will compress the current Block and then write to the disk.

From the above two points, we can see that the Block of MergeTree is neither a fixed data size nor a fixed number of rows, and Granule is not a fixed-length logical concept. So we need extra information to find a Granule quickly. This is the purpose of the Mark identification file, which records the number of lines of each Granule and the offset of its Block in the compressed file, as well as the offset position of the Granule in the extracted Block.

Primary key index: primary.idx is the primary key index of the table. ClickHouse's definition of primary key index is slightly different from that of traditional database. Its primary key index does not use primary key to remove duplicate meaning, but it still has the ability to quickly find primary key rows. The primary key index of ClickHouse stores the primary key value of the starting row in each Granule, while the data in MergeTree storage is strictly sorted by primary key. So when querying a given primary key condition, we can determine the possible existence of the data according to the primary key index, and combined with the Mark logo introduced above, we can further determine the location range of the data in the storage file. The primary key index of ClickHoue is a rough index which is relatively balanced in terms of index construction cost and index efficiency. The primary key sequence of MergeTree is consistent with the Order By sequence by default, but the user can define the primary key sequence as a partial prefix of the Order By sequence.

Partition key indexes: minmax_time.idx and minmax_region_name.idx are the partition key indexes of the table. MergeTree storage will count the maximum and minimum values of partitioning keys in each Data Part. When the partitioning key condition is included in the user query, irrelevant Data Part can be directly excluded. This is a commonly used partition clipping technique in OLAP scenarios.

Skipping index: skp_idx_avatar_id_minmax.idx is the MinMax index defined by the user on the avatar_id column. Skipping Index in Merge Tree is a class of locally aggregated rough indexes. Users need to set the granularity parameter when defining skipping index. Here the granularity parameter specifies how many Granule data are aggregated to generate index information. Users also need to set the aggregate function corresponding to the index, such as minmax, set, bloom_filter, ngrambf_v1 and so on. The aggregate function will count the column values in several consecutive Granule to generate index information. The idea of Skipping index is similar to the primary key index, because the data is sorted by the primary key, and the primary key index actually counts the MinMax value of the primary key sequence of each Granule granularity, while the Skipping index provides a variety of aggregation functions, which is a supplementary ability of the primary key index. In addition, these two kinds of indexes need users to fit their own business scenarios on the basis of understanding the principles of the index.

3. Retrieval process

MergeTree storage implements the following three methods on the KeyCondition,KeyCondition class, which first extracts the partitioning key and primary key condition from a select query, to determine the Mark Range that the filter condition may satisfy. As mentioned above, the column data in MergeTree Data Part is indexed by the Mark identification array at the granularity of Granule, and Mark Range represents the subscript interval in the Mark identification array that meets the query conditions.

In the process of index retrieval, we will first use the partition key KeyCondition to cut out the irrelevant data partitions, then use the primary key index to select the rough Mark Range, and finally use Skipping Index to filter the Mark Range generated by the primary key index. The algorithm of picking out rough Mark Range with primary key index is a process of constantly splitting Mark Range, and the result is a collection of Mark Range. The initial Mark Range covers the entire MergeTree Data Part interval, and each split will take out the Mark Range after the last split and split it into finer-grained Mark Range according to a certain granularity step, then eliminate the Mark Range that must not meet the conditions in the splitting result, and finally stop splitting when the Mark Range reaches a certain granularity. This is a simple and efficient rough filtering algorithm.

Before using Skipping Index to filter the Mark Range returned by the primary key index, you need to construct the IndexCondition for each Skipping Index. Different Skipping Index aggregation functions have different IndexCondition implementations, but the interface to determine whether the Mark Range meets the conditions is similar to KeyCondition.

Data Sampling

After the index filtering in the previous section, we have the Mark Range collection that needs to be scanned, and then it should be the data scanning section. This section inserts a brief account of how the data Sampling in MergeTree is implemented. It is not implemented in the process of data scanning, but has been completed in the process of index retrieval, which is for the ultimate sample efficiency. Users can specify a column or expression in the primary key as the sampling key when creating the table, and ClickHouse uses a simple and rude approach here: the value of the Sampling key must be numeric, and the system assumes that its value is a randomly evenly distributed state. If the value type of the Sampling key is Uint32, when we set the sample ratio to 0.1, the sample will be converted to a filter condition during index retrieval: the value of the Sampling key < Uint32::max * 0.1. Users must be aware of this detail when using the Sampling function, otherwise it is prone to sampling deviation. Generally speaking, we recommend that the sampling key is randomly scattered by adding a Hash function to the column value.

Data scanning

The data scanning section of MergeTree provides three different modes:

Final schema: this schema provides a final view of the data after Merge for table engines such as CollapsingMergeTree, SummingMergeTree, and so on. As mentioned earlier, advanced MergeTree table engines based on MergeTree adopt specific Merge logic for MergeTree Data Part. The problem is that because MergeTree Data Part is an asynchronous Merge process, the user cannot see the final data result without the final Merge into a Data Part. So ClickHouse provides a final schema in the query, which will put some advanced Merge Stream on the basis of multiple BlockInputStream of each Data Part, such as DistinctSortedBlockInputStream, SummingSortedBlockInputStream, etc., this part of the logic is consistent with the logic of asynchronous Merge, so that users can see the "final" data results in advance.

Sorted schema: sort schema can be thought of as a query acceleration optimization tool for order by push-down storage. Because the data in each MergeTree Data Part is ordered, when the user query includes the sort key order by condition, it only needs to set an InputStream to do data ordered merging on the BlockInputStream of each Data Part to achieve the global ordering ability.

Normal mode: this is the most commonly used data scanning mode for basic MergeTree tables. Parallel data scanning is carried out among multiple Data Part, and very high throughput data can be read for a single query.

Next, we will introduce several key performance optimization points in the Normal pattern:

Parallel scanning: the concurrency of the traditional computing engine in the data scanning part is mostly bound to the number of stored files, so MergeTree Data Part parallel scanning is a basic capability. However, the storage structure of MergeTree requires that the data is constantly mege and eventually merged into a Data Part, which is the most efficient for indexing and data compression. So ClickHouse adds Mark Range parallelism on the basis of MergeTree Data Part parallelism. Users can set the degree of parallelism in the process of data scanning arbitrarily. Each scanning thread is assigned a task with Mark Range In Data Part granularity, and Mark Range Task Pool is shared among multiple scanning threads, so as to avoid the problem of long tail in storage scanning.

The data involved in the query link of the data Cache:MergeTree has different levels of cache design. The primary key index and partition key index are loaded into memory in the process of load Data Part. The corresponding MarkCache and UncompressedCache,MarkCache of Mark file and column memory file directly cache the binary content of Mark file, while the Block data after decompression is cached in UncompressedCache.

SIMD deserialization: the deserialization of some column types is accelerated by handwritten sse instructions, which will have some effect if the data hits UncompressedCache.

PreWhere filtering: the syntax of ClickHouse supports additional PreWhere filtering conditions, which are judged before Where conditions. When a user adds a PreWhere filter condition to the filter condition of sql, the storage scan is carried out in two phases, first reading the dependent column values in the PreWhere condition, and then calculating whether each row meets the condition. This is equivalent to further reducing the scanning range on the basis of Mark Range. After the PreWhere column scan calculation, ClickHouse will adjust the specific number of rows to be scanned in the Granule corresponding to each Mark, which is equivalent to a part of the branch that can be discarded from the head and tail of the Granule.

The above is all the contents of the article "how to use clickhouse sparse Index in MergeTree". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.