Application Analysis in Kafka and ClickHouse 07/11 Update SLTechnology News&Howtos

Application Analysis in Kafka and ClickHouse

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "Application Analysis in Kafka and ClickHouse". In daily operation, I believe that many people have doubts about the application analysis in Kafka and ClickHouse. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "Application Analysis in Kafka and ClickHouse". Next, please follow the editor to study!

Sparse Index

In the storage system represented by database, index is a kind of data structure attached to the original data, which can improve the query speed by reducing disk access, which is similar to the real book catalogue. An index usually consists of two parts, an index key (the ≈ chapter) and a pointer to the original data (≈ page number), as shown in the following figure.

The index is organized in a variety of forms. The sparse index (sparse index) introduced in this paper is a simple and commonly used ordered index form, that is, on the basis of the ordered primary key of the data, only part (usually a small part) of the original data is indexed, so that the approximate range can be delineated when querying, and then the target data can be found by using the appropriate search algorithm in the range. As shown in the following figure, three pieces of raw data are sparsely indexed.

In contrast, if all raw data is indexed, it is called a dense index (dense index), as shown in the following figure.

Dense indexes and sparse indexes are actually the trade-off of space and time. When there is a large amount of data, indexing each piece of data also consumes a lot of space, so sparse indexes are very useful in specific scenarios. Here are two examples.

Sparse Index in Kafka

We know that in the TopicPartition of a single Kafka, message data is segment and stored with a .log extension. The timing of log file sharding is determined by both the size parameter log.segment.bytes (default is 1G) and the time parameter log.roll.hours (default is 7 days). Some of the files stored in the data directory are as follows.

├── 00000000000190089251.index

├── 00000000000190089251.log

├── 00000000000190089251.timeindex

├── 00000000000191671269.index

├── 00000000000191671269.log

├── 00000000000191671269.timeindex

├── 00000000000193246592.index

├── 00000000000193246592.log

├── 00000000000193246592.timeindex

├── 00000000000194821538.index

├── 00000000000194821538.log

├── 00000000000194821538.timeindex

├── 00000000000196397456.index

├── 00000000000196397456.log

├── 00000000000196397456.timeindex

├── 00000000000197971543.index

├── 00000000000197971543.log

├── 00000000000197971543.timeindex

The filename of the log file is 64-bit shaping, which means that the offset value of the first message stored in the log file minus 1 (that is, the offset value of the last message in the previous log file). Each log file is equipped with two index files, index and timeindex, corresponding to the offset index and the timestamp index, both of which are sparse.

You can view the information in the index file through the DumpLogSegments gadget provided by Kafka.

~ kafka-run-class kafka.tools.DumpLogSegments-- files / data4/kafka/data/ods_analytics_access_log-3/00000000000197971543.index

Dumping / data4/kafka/data/ods_analytics_access_log-3/00000000000197971543.index

Offset: 197971551 position: 5207

Offset: 197971558 position: 9927

Offset: 197971565 position: 14624

Offset: 197971572 position: 19338

Offset: 197971578 position: 23509

Offset: 197971585 position: 28392

Offset: 197971592 position: 33174

Offset: 197971599 position: 38036

Offset: 197971606 position: 42732

~ kafka-run-class kafka.tools.DumpLogSegments-- files / data4/kafka/data/ods_analytics_access_log-3/00000000000197971543.timeindex

Dumping / data4/kafka/data/ods_analytics_access_log-3/00000000000197971543.timeindex

Timestamp: 1593230317565 offset: 197971551

Timestamp: 1593230317642 offset: 197971558

Timestamp: 1593230317979 offset: 197971564

Timestamp: 1593230318346 offset: 197971572

Timestamp: 1593230318558 offset: 197971578

Timestamp: 1593230318579 offset: 197971582

Timestamp: 1593230318765 offset: 197971592

Timestamp: 1593230319117 offset: 197971599

Timestamp: 1593230319442 offset: 197971606

It can be seen that the mapping between the offset value and the corresponding data storage location in the log file is stored in the index file, while the mapping between the timestamp and the corresponding data offset value is stored in the timeindex file. With them, you can quickly locate the specific location of the message through the offset value or timestamp. And because the size of index files is small, it is easy to do memory mapping (mmap) on index files, and the access efficiency is very high.

Take the index file as an example. If we want to find the offset=197971577 message, the process is as follows:

Through binary search, in the sequence of index files, find the file containing the offset (00000000000197971543.index)

Through binary search, find the starting point of the offset interval (197971592) in the index file you located in the previous step.

Search sequentially from the starting point of the previous step until you find the target offset.

Finally, the granularity of the sparse index is determined by the log.index.interval.bytes parameter, and the default is 4KB, that is, a piece of index data is generated for every amount of 4KB data in the log file. Increasing this parameter will make the index more sparse and vice versa.

Sparse Index in ClickHouse

In ClickHouse, the index column of the MergeTree engine table is specified using ORDER BY syntax when the table is created. In the official document, the following picture is used to illustrate.

This diagram shows the case that CounterID and Date are listed as index columns, that is, they are sorted with CounterID as the primary keyword, then Date as the secondary keyword, and finally the combination of the two columns as the index key. Marks and mark numbers are index tags, and the interval between marks is specified by the index granularity parameter index_granularity when the table is created, with a default value of 8192.

In the ClickHouse MergeTree engine table, the data for each part is stored roughly in the following structure.

├── business_area_id.bin

├── business_area_id.mrk2

├── coupon_money.bin

├── coupon_money.mrk2

├── groupon_id.bin

├── groupon_id.mrk2

├── is_new_order.bin

├── is_new_order.mrk2

├── primary.idx

Among them, the bin file stores the raw data of each column (which may be compressed), and the mrk2 file stores the mapping between the mark numbers in the figure and the data location in the bin file. In addition, there is a primary.idx file that stores the specific data of the indexed column. In addition, the data for each part is stored in a separate directory, with a directory name such as 20200708 / 92 / 121 / 7, which contains a partition key, a starting mark number and an ending mark number for easy positioning.

Thus, each column is indexed by the ORDER BY column. When querying, first find the parts where the data is located, and then determine the scope of the data in the bin file through the mrk2 file.

However, ClickHouse's sparse index is different from Kafka's sparse index, which allows users to combine multiple columns freely, so we should be extra careful not to add too many index columns to prevent the index data from being too sparse and increase the cost of storage and search. In addition, a column with a too small cardinality (that is, too low discrimination) is not suitable for an index column, because it is likely that the values across multiple mark are still the same and have no index meaning.

At this point, the study of "Application Analysis in Kafka and ClickHouse" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.