What is the Log storage parsing of Kafka? 07/09 Update SLTechnology News&Howtos

What is the Log storage parsing of Kafka?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

What is the Log storage parsing of Kafka? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Introduction

Message in Kafka is organized with topic as the basic unit, and different topic are independent of each other. Each topic can be divided into several different partition (several partition for each topic are specified when the topic is created), and each partition stores a portion of the Message. Borrowing an official picture, you can directly see the relationship between topic and partition.

Partition is stored in the file system as a file. For example, if you create a topic called page_visits with five partition, then there are five directories in the Kafka data directory (specified by the log.dirs in the configuration file): page_visits-0, page_visits-1,page_visits-2,page_visits-3,page_visits-4, whose naming rule is -, which stores the data of these five partition respectively.

Next, this article will analyze the storage format of the files in the partition directory and the location of the related code.

Partition data file

Each Message in Partition is represented by offset as its offset in this partition. This offset is not the actual storage location of the Message in the partition data file, but a logical value that uniquely determines a Message in the partition. Therefore, offset can be thought of as the id of Message in partition. Each Message in partition contains the following three attributes:

Offset

MessageSize

Data

Offset is long, MessageSize is int32, indicating how big the data is, and data is the specific content of message. Its format is consistent with the MessageSet format introduced in the Kafka communication protocol.

The data file of Partition contains a number of Message in the above format, arranged from small to large according to offset. Its implementation class is FileMessageSet, and the class diagram is as follows:

Its main methods are as follows:

Append: writes the Message from the given ByteBufferMessageSet to this data file.

SearchFor: search from the specified startingPosition to find the first Message whose offset is greater than or equal to the specified offset and return its location in the file Position. It is implemented by reading 12 bytes from startingPosition, which are the offset and size of the current MessageSet. If the current offset is less than the specified offset, move the position backward LogOverHead+MessageSize (where LogOverHead is offset+messagesize, 12 bytes).

Read: the exact name should be slice, which intercepts part of it and returns a new FileMessageSet. It does not guarantee the integrity of the intercepted location data.

SizeInBytes: indicates how many bytes of space the FileMessageSet occupies.

TruncateTo: truncate this file. This method does not guarantee the integrity of the Message in the truncated location.

ReadInto: reads the contents of the file into the corresponding ByteBuffer starting from the specified relative location.

Let's think about what happens if an partition has only one data file.

The new data is added to the end of the file (calling the append method of FileMessageSet), which is always O (1) no matter how large the file data file is.

Finding the Message of an offset (calling the searchFor method of FileMessageSet) is done sequentially. Therefore, if the data file is very large, the search efficiency is low.

So how does Kafka solve the problem of search efficiency? There are two magic weapons: 1) sub-paragraph 2) index.

Segmentation of data files

One of the ways Kafka solves query efficiency is by segmenting data files, such as 100 Message, whose offset ranges from 0 to 99. Suppose the data file is divided into five segments, the first paragraph is 0-19, the second paragraph is 20-39, and so on, each segment is placed in a separate data text, and the data file is named after the smallest offset in that paragraph. In this way, when looking for the Message of the specified offset, you can use a binary lookup to locate which segment the Message is in.

Index data files

Data file segmentation makes it possible to find the Message of the corresponding offset in a smaller data file, but this still requires sequential scans to find the Message of the corresponding offset. In order to further improve the efficiency of lookup, Kafka establishes an index file for each segmented data file, and the file name is the same as the name of the data file, except that the file extension is .index.

The index file contains several index entries, each representing the index of an Message in the data file. The index contains two parts (both 4-byte numbers), which are relative offset and position, respectively.

Relative offset: because after the data file is segmented, the starting offset of each data file is not 0, and the relative offset represents the size of the Message relative to the smallest offset in the data file to which it belongs. For example, if the offset of a segmented data file starts at 20, then the relative offset of a Message with an offset of 25 in the index file is 25-20 = 5. Storage relative to offset can reduce the space occupied by index files.

Position, indicating the absolute position of the Message in the data file. Just open the file and move the file pointer to the position to read the corresponding Message.

Instead of indexing every Message in the data file, the index file uses sparse storage to build an index every other byte of data. This prevents the index file from taking up too much space, so that the index file can be kept in memory. But the disadvantage is that the Message that is not indexed can not locate its location in the data file at once, so it needs to do a sequential scan, but the scope of this sequential scan is very small.

In Kafka, the implementation class of the index file is OffsetIndex, and its class diagram is as follows:

The main methods are:

Append method, add a pair of offset and position to the index file, where the offset will be converted to the relative offset.

Lookup, which uses binary search to find the largest offset that is less than or equal to a given offset

Summary

Let's use a few diagrams to summarize how Message is stored in Kafka and how to find the Message of a specified offset.

Message is organized according to topic, and each topic can be divided into multiple partition. For example, there are five partition called topic whose directory structure is:

Partition is segmented, and each segment is called LogSegment, including a data file and an index file. The following figure shows the files in a partition directory:

As you can see, this partition has four LogSegment.

Borrow a picture from the blogger @ lizhitao blog to show how to find Message.

For example: to find a Message with an absolute offset of 7:

The first is to use a binary lookup to determine which LogSegment it is in, naturally in the first Segment.

Open the index file of this Segment and use a binary search to find the largest offset in the index entry whose offset is less than or equal to the specified offset. The index with a natural offset of 6 is what we are looking for. Through the index file, we know that the location of the Message with offset 6 in the data file is 9807.

Open the data file and scan sequentially from location 9807 until you find the Message with offset 7.

This mechanism is based on the fact that offset is orderly. The index file is mapped to memory, so the search speed is very fast.

In a word, Kafka's Message storage uses partition, LogSegment and sparse indexing to achieve high efficiency.

The answer to the question about Kafka's Log storage parsing is shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.