What is the principle of Kafka-4.Kafka workflow and file storage mechanism? 07/19 Update SLTechnology News&Howtos

What is the principle of Kafka-4.Kafka workflow and file storage mechanism?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "what is the principle of Kafka-4.Kafka workflow and file storage mechanism". In daily operation, I believe many people have doubts about what is the principle of Kafka-4.Kafka workflow and file storage mechanism. Xiaobian consulted all kinds of materials and sorted out simple and easy operation methods. I hope to help you answer the doubts of "what is the principle of Kafka-4.Kafka workflow and file storage mechanism"! Next, please follow the small series to learn together!

File Storage Mechanism Workflow

Messages in Kafka are classified by topic, producers produce messages, consumers consume messages, and are topic-oriented. Topic is a logical concept, while partition is a physical concept. Each partition corresponds to a log file, and the log file stores the data produced by the producer. The data produced by Producer is appended to the end of the log file, and each data has its own offset. Each consumer in the consumer group will record which offset he consumed in real time, so that when the error is recovered, he will continue to consume from the previous position.

Kafka file storage mechanism

Since the messages produced by the producer will continue to be appended to the end of the log file, in order to prevent the log file from being too large and resulting in inefficient data location, Kafka adopts a fragmentation and indexing mechanism, dividing each partition into multiple segments. Each segment corresponds to two files-the ".index" file and the ".log" file. These files are located under a folder named topic name + partition number. For example, the first topic has three partitions, and the corresponding folders are first-0,first-1, and first-2.

00000000000000000000.index00000000000000000000.log00000000000000170410.index00000000000000170410.log00000000000000239430.index00000000000000239430.log

The index and log files are named after the offset of the first message of the current segment. The following diagram shows the structure of index file and log file

The ".index" file stores a lot of index information, the ".log" file stores a lot of data, and the metadata in the index file points to the physical offset address of the message in the corresponding data file.

Kafka Producer Partition Strategy

1. Reasons for zoning

Easy to scale in the cluster, each Partition can be adjusted to adapt to its machine, and a topic can be composed of multiple partitions, so the whole cluster can adapt to any size of data;

Concurrency can be improved because you can read and write in Partition units.

2. Principles of zoning

We need to encapsulate the data sent by the producer into a ProducerRecord object.

Replication Data Synchronization Policy

LEO: refers to the maximum offset of each copy;

HW: refers to the largest offset that the consumer can see and the smallest LEO in the ISR queue.

(1) A follower will be kicked out of ISR temporarily after a failure occurs. After the follower recovers, the follower will read the last HW recorded on the local disk, intercept the part of the log file higher than HW, and synchronize from HW to the leader. When the LEO of the follower is greater than or equal to the HW of the Partition, that is, after the follower catches up with the leader, it can rejoin the ISR.

(2) leader failure After the leader fails, a new leader will be selected from the ISR, and then, in order to ensure data consistency between multiple copies, the rest of the followers will first cut off the parts of their log files higher than HW, and then synchronize the data from the new leader.

Note: This only guarantees data consistency between replicas, and does not guarantee that data is not lost or duplicated.

Exactly Once semantics

Setting the ACK level of the server to-1 ensures that no data is lost between Producer and Server, i.e. AtLeast Once semantics. Conversely, setting the server ACK level to 0 guarantees that producers will only send each message once, i.e. At Most Once semantics.

At Least Once can guarantee that the data will not be lost, but it cannot guarantee that the data will not be duplicated; on the other hand, At Least Once can guarantee that the data will not be duplicated, but it cannot guarantee that the data will not be lost. However, for some very important information, such as transactional data, downstream data consumers require data that is neither duplicated nor lost, i.e. Exactly Once semantics. Before version 0.11, Kafka was powerless to do anything about it, only to ensure that data was not lost, and then to globally de-duplicate data in downstream consumers. In the case of multiple downstream applications, each of which requires separate global deduplication, this has a significant impact on performance.

0.11 Version of Kafka introduces an important feature: idempotency. Idempotent means that no matter how many times the Producer sends duplicate data to the Server, the Server side will persist only one piece. The idempotent, combined with the At Least Once semantics, constitutes Kafka's Exactly Once semantics. Namely:

At Least Once + idempotent = Exactly Once

To enable idempotencies, simply set enable.idompotence to true in the Producer parameter. The idempotent implementation of Kafka is actually to re-place what needs to be done downstream in the upstream of the data. Idempotent producers are initialized with a PID, and messages sent to the same Partition are accompanied by a Sequence Number. The Broker side caches the messages, and when messages with the same primary key are submitted, the Broker persists only one.

However, PID restart will change, and different partitions also have different primary keys, so idempotent cannot guarantee Exactly Once across partitions and sessions.

Kafka efficient read and write data 1) sequential write disk

Kafka producer production data, to be written to the log file, the writing process is appended to the end of the file, for sequential writing. Official website data shows that the same disk, sequential write can reach 600M/s, while random write only 100K/s. This has to do with the mechanics of the disk. Sequential writing is fast because it saves a lot of head addressing time.

2) Zero-copy technology Zookeeper's role in Kafka

A broker in the Kafka cluster will be elected as Controller, responsible for managing the upper and lower lines of the cluster broker, partition copy allocation of all topics, and leader election.

Controller management is dependent on Zookeeper.

The following is the partition leader election process:

Kafka Affairs

Kafka introduced transaction support starting with version 0.11. Transactions can guarantee that Kafka, based on Exactly Once semantics, produces and consumes across partitions and sessions, either all succeed or all fail.

Producer transactions

In order to implement cross-partition and cross-session transactions, a globally unique Transaction ID needs to be introduced, and the PID obtained by Producer is bound to the Transaction ID. This way, when Producer restarts, the original PID can be obtained from the ongoing TransactionID. To manage transactions, Kafka introduces a new component, Transaction Coordinator. Producer is to obtain the task status corresponding to Transaction ID by interacting with Transaction Coordinator. The Transaction Coordinator is also responsible for writing all transactions to an internal Topic in Kafka, so that even if the entire service restarts, the transaction state in progress can be restored because the transaction state is preserved and thus continues.

Consumer transactions

The above transaction mechanism is mainly considered from the Producer side. For the Consumer, the guarantee of the transaction will be relatively weak, especially when it cannot guarantee that the information of the Commit is consumed accurately. This is because Consumers can access arbitrary information through offset, and different Segment Files have different life cycles, and messages of the same transaction may be deleted after restart.

At this point, the study of "Kafka-4.Kafka workflow and file storage mechanism" is over, hoping to solve everyone's doubts. Theory and practice can better match to help everyone learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.