How to deeply analyze the workflow, storage mechanism and partition strategy of Kafka architecture 07/11 Update SLTechnology News&Howtos

How to deeply analyze the workflow, storage mechanism and partition strategy of Kafka architecture

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how to deeply analyze the workflow, storage mechanism and partition strategy of Kafka architecture. The article is rich in content and analyzes and describes it from a professional point of view. I hope you can get something after reading this article.

I. Preface

Before you start, make it clear that kafka is a distributed streaming platform that is essentially a message queue. When it comes to message queues, you will think of the three major functions of message queues: asynchronism, de-peaking and decoupling. Kafka is mainly used in big data's real-time processing field, and it is relatively easy to use. This paper mainly analyzes the workflow, storage mechanism and partition strategy of kafka, and summarizes it from many angles.

However, it should be noted that with the great ship of the times heading towards 2020, kafka is no longer a dominant company. As a messaging platform that naturally supports multi-tenant, cross-regional replication and unified messaging model, Pulsar has successfully replaced Kafka in many enterprises. For more knowledge about Apache Pulsar, those who are interested can follow me, which will be summarized and deepened later.

II. Kafka work flow

Kafka classifies messages according to topic, and each message consists of three attributes.

Offset: indicates the offset of message in the current Partition. It is a logical value that uniquely determines a message in Partition, which can simply be thought of as an id.

MessageSize: represents the size of the message content data

The specific content of data:message

In the whole kafka architecture, producers and consumers adopt the mode of publish and subscribe. Producers produce messages and consumers consume messages, both of which perform their own duties and are oriented to topic. Note: topic is a logical concept, while partition is a physical concept. Each partition corresponds to a log file, which stores data produced by producer. )

The data produced by Producer is constantly appended to the end of the log file, and each piece of data has its own offset.

Each consumer in the consumer group records which offset he or she consumes in real time, so that when a failure occurs and recovers, he or she can continue to spend from this offset location to avoid missing data or spending repeatedly.

II. File storage mechanism 2.1, file storage structure and naming rules

At the beginning of the design of kafka, taking into account the fact that the log file is too large after the messages produced by the producer are constantly appended to the end of the log file, the slicing and indexing mechanism is adopted, specifically, each partition is divided into multiple segment. Each segment corresponds to three files: .index file, .log file, and .timeindex file (not in previous versions). The .log and .index files are located in a folder with the naming convention of topic name + partition serial number. For example, if the topic csdn has two partitions, its corresponding folder is csdn-0,csdn-1

If we open the csdn-0 folder, we will see the following files:

00000000000000000000.index00000000000000000000.log00000000000000150320.index00000000000000150320.log

From the fact that there are two log under this folder, we can conclude that this partition has two segment.

File naming rules: the first segment of the partition global starts at 0, and each subsequent segment file is named as the offset value of the last message in the previous segment file, with a numeric size of 64 bits, 20-digit character length, and no digits filled with 0.

Note: index files do not start from 0, and do not increase by 1 at a time, this is because Kafka takes a sparse index storage method, every certain byte of data to build an index, it reduces the index file size, so that the index can be mapped to memory, reducing the query disk IO overhead, at the same time does not bring too much time consumption to the query.

Here is a reference to an old kafka storage mechanism diagram without .timeindex file:

2.2, document relationship

The relationship between index file and log file: the ".index" file stores a large amount of index information, the ".log" file stores a large amount of data, and the metadata in the index file points to the physical offset address of the message in the corresponding data file.

2.3.Use offset to find message

Because each segment file name is the offset of the last message of the previous Segment, so when you need to find a message of a specified offset, you can find the segment to which it belongs through binary search in the file names of all segment, and then find its physical location corresponding to the file in its index file, you can take out the message.

For example: here we take finding a message with offset 6 as an example. The search process is as follows:

First of all, it is necessary to determine which segment file the offset information is in (because it is read and written sequentially, the binary search method is used here). The first file is called 000000000000000000, and the second file is 0000000000150320, so the data of 6 this offset must be in the first file.

After finding the file, it is easy to do it. In the 00000000000000000000.index file of this file, navigate to the location 9807 in the 00000000000000000000.log file to read the data.

Third, zoning strategy 3.1, why to carry out zoning

You need to understand why you want to partition before you understand the partitioning strategy, which can be explained in two ways:

It is easy to expand in the cluster, each Partition can be adjusted to adapt to its machine, and a topic can be composed of multiple Partition, so the whole cluster can adapt to any size of data.

Can improve concurrency, read and write in units of Partition after partitioning.

3.2. Zoning strategy

First of all, we need to know that the data sent by producer needs to be encapsulated into a ProducerRecord object. We see the following methods provided by ProducerRecord:

Through this construction method, we know that there are three kafka partitioning strategies:

If the partition is specified, the specified value is directly taken as the partiton value.

If the partition value is not specified but there is key, the hash value of key is offset with the partition number of topic to get the partition value.

In the case of neither partition value nor key value, an integer is randomly generated on the first call (incremented on this integer for each subsequent call), and this value is offset with the total number of partition available for topic to get the partition value, which is often called round-robin algorithm.

The above is the editor to share with you how to deeply analyze the workflow, storage mechanism and partition strategy of Kafka architecture. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.