About how kafka works 02/13 Update SLTechnology News&Howtos

About how kafka works

2026-02-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the working principle of kafka. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

Article content output source: pull hook education Java high salary training camp

Write at the forefront

After working for a few years, I have not systematically studied, and many things do not have complex work scene experience. I finally decided to sign up for a high-salary training camp for pull hook. I have really learned a lot here. I have also learned a lot after learning, and there are regular internal tweets and more opportunities, which have really helped me to improve. Special thanks to the gentle and lovely little bamboo head teacher and the serious responsible and handsome old Coke mentor for their help!

Why do you need a message queue to brush your cell phone on weekends? a message suddenly popped up on APP: "in order to give back to old customers, girlfriend buy one get one free, activities are limited to today!" . Buy one get one free is such a good thing, then I can't miss it! I can't help but order right away. So I chose two of the latest models and made an order and payment in one go! Lying in bed contentedly, thinking of having a girlfriend soon, I was so happy that I couldn't sleep. The next day, when I was working normally, I suddenly received a phone call from the delivery guy: "are you xx? your girlfriend is here. I'm downstairs. Please take it." . Me: "this. I'm at work. Can you deliver it in the evening?" . Brother: "I can't do it at night. I'm off work at night, too." . So the two men were deadlocked for a long time. Finally, the little brother said, why don't I help you put it in Xiaofang convenience store downstairs, and you will come and pick it up after work in the evening, so that the embarrassing situation can be alleviated!

Back to the point, if there is no Xiaofang convenience store, then the interaction diagram between the courier and me should be as follows:

What will happen? 1. For the sake of this girlfriend, I asked for leave to go back to get it (the boss didn't approve it). 2. The little brother has been waiting downstairs (I have other couriers to deliver). 3. Send it at the weekend (obviously can't wait). 4. I don't want this girlfriend (absolutely impossible)!

After the appearance of Xiaofang convenience store, the interaction diagram should be as follows:

In the above example, "delivery guy" and "I who buy a girlfriend" are the two systems that need to interact, and Xiaofang convenience store is what we want to talk about in this article-"message middleware". To sum up, the emergence of Xiaofang convenience store (message middleware) has the following benefits: 1. Decoupling the courier has a lot of deliveries to deliver. Every time, he needs to call one by one to confirm whether the consignee is free and which time period is available. And then determine the delivery plan. It's totally dependent on the consignee! If there is more than one express delivery, the delivery guy is probably too busy. If there is a convenience store, the courier only needs to put the courier in the same neighborhood in the same convenience store, and then notify the consignee to pick up the goods. At this time, the courier and the consignee will be decoupled!

2. After calling me, Asynchronous Express needs to wait downstairs until I take your delivery. He can't send anyone else until I take your delivery. After the delivery guy put the delivery in the Xiaofang convenience store, he can do other work. He doesn't need to wait for you to arrive and has been waiting for you. The efficiency of the work is improved.

3. Jiefeng assumes that I bought all kinds of goods from different stores on Singles Day holiday, and it so happens that the express delivery in these stores is different, such as Zhongtong, Yuantong, Shentong, and so on. More coincidentally, they all arrived at the same time! The younger brother of Zhongtong called me to pick up the express delivery from the north gate, Yuantong asked me to go to the south gate, and Shentong asked me to go to the east gate. I was in a hurry at the moment...

We can see that in the scenario where the system needs interaction, there are really many benefits of using message queue middleware. Based on this idea, there are more professional "middleware" than Xiaofang convenience store, such as Feng Nest, Cainiao Post Station, and so on. Finally, the above story is pure fiction.

The mode of message queuing communication through the above example, we introduce the message middleware and introduce the benefits of message queuing. Here we need to introduce two modes of message queuing communication:

I. Point-to-point mode

As shown in the figure above, the point-to-point pattern is usually based on a pull or polling messaging model, which is characterized by one and only one consumer processing messages sent to the queue. After the producer puts the message into the message queue, the consumer takes the initiative to pull the message for consumption. The advantage of the peer-to-peer model is that the frequency of consumers pulling messages can be controlled by themselves. However, whether there are messages to be consumed in the message queue is not perceived on the consumer side, so additional threads are needed to monitor it on the consumer side.

Second, publish and subscribe model

As shown in the figure above, the publish-subscribe model is a messaging model based on messaging, and the modified model can have many different subscribers. After the producer puts the message into the message queue, the queue pushes the message to consumers who have subscribed to such messages (similar to the official account of Wechat). Because the consumer passively receives the push, there is no need to perceive whether the message queue is a message to be consumed! However, due to the different machine performance of consumer1, consumer2, and consumer3, the ability to process messages will also be different, but message queues cannot perceive the speed of consumer consumption! So the speed of push has become a problem in the publish and subscribe model! Suppose the processing speed of the three consumers is 8M/s, 5M/s, and 2M/s, respectively. If the queue push speed is 5M/s, then consumer3 cannot bear it! If the queue push speed is 2M/s, then consumer1, consumer2 will be a great waste of resources!

Kafka briefly describes why message queuing and the two modes of message queuing communication are needed, and then it's time for the protagonist of our article, kafka! Kafka is a high-throughput distributed publish and subscribe messaging system, which can handle all action flow data in consumer-scale websites, with high performance, persistence, multi-copy backup, and scale-out capabilities. Some basic introductions will not be carried out here, there are too many introductions about these on the Internet, readers can have a look at Baidu on their own!

Without saying much about infrastructure and terminology, let's look at the picture first. through this picture, let's sort out the related concepts and the relationship between them:

If you are confused when you see this picture, it doesn't matter! Let's first analyze the related concepts, Producer:Producer, that is, the producer, the producer of the message, is the entry of the message. Kafka cluster: Broker:Broker is an instance of kafka, and there are one or more instances of kafka on each server. Let's assume that each broker corresponds to one server. The broker in each kafka cluster has a non-repetitive number, such as broker-0, broker-1, etc. Topic: the subject of the message, which can be understood as the classification of the message. The data of kafka is stored in topic. Multiple topic can be created on each broker. The partition of Partition:Topic, each topic can have multiple partitions. The function of partition is to do load and improve the throughput of kafka. The same topic in different partitions of the data is not duplicated, the form of partition is a folder! Replication: each partition has multiple replicas, which are used as spares. When the primary partition (Leader) fails, a spare tire (Follower) will be selected and become Leader. The maximum number of default replicas in kafka is 10, and the number of replicas cannot be greater than the number of Broker. Follower and leader are definitely on different machines, and the same machine can only store one copy (including itself) for the same partition. Message: the body of each message sent. Consumer: the consumer, the consumer of the message, is the outlet of the message. Consumer Group: we can group multiple consumer groups into one consumer group. In the design of kafka, data from the same partition can only be consumed by one consumer in the consumer group. Consumers of the same consumer group can consume data from different partitions of the same topic, which is also to improve the throughput of kafka! The Zookeeper:kafka cluster relies on zookeeper to store the meta-information of the cluster to ensure the availability of the system.

Workflow analysis above introduces the infrastructure and basic concepts of kafka. I don't know if you have a general impression of kafka after reading it. It doesn't matter if you are confused. Next, we will analyze the workflow of kafka with the above structure diagram, and finally come back to sort it out again. I believe you will be more rewarding!

Send data We look at the above architecture diagram, producer is the producer, is the entry of data. Notice the red arrow in the picture. Producer always looks for leader when writing data, and will not write data directly to follower! So how does leader find it? What is the process of writing? Let's take a look at the following picture:

The process of sending is already illustrated in the figure, so it is not listed in the text alone! One thing to note is that after the message is written to leader, follower actively goes to leader for synchronization! Producer uses push mode to publish data to broker, and each message is appended to the partition and sequentially written to disk, so make sure that the data in the same partition is orderly! Write the schematic diagram as follows:

It is mentioned above that data is written to different partitions, so why does kafka do partitions? I believe you should be able to guess that the main purpose of zoning is: 1, to facilitate expansion. Because a topic can have more than one partition, we can easily cope with the growing amount of data by expanding the machine. 2. Improve concurrency. With partition as the reading and writing unit, multiple consumers can consume data at the same time, which improves the efficiency of message processing.

Friends who are familiar with load balancer should know that when we send a request to a server, the server may load the request and distribute the traffic to different servers. In kafka, if a topic has multiple partition,producer, how do you know which partition to send the data to? There are several principles in kafka: 1. Partition can specify the partition to be written when writing, and if so, write the corresponding partition. 2. If no partition is specified, but the key of the data is set, a partition will be hash based on the value of key. 3. If neither partition nor key is specified, a partition will be polled.

Ensuring that messages are not lost is a basic guarantee of message queuing middleware, so how can producer ensure that messages are not lost when writing messages to kafka? In fact, the above written flow chart is described through the ACK response mechanism! When the producer writes data to the queue, you can set a parameter to determine whether or not the kafka receives the data. This parameter can be set to values of 0, 1, and all.

0 means that producer does not need to wait for the return of the cluster to send data to the cluster, which does not ensure that the message is sent successfully. It has the lowest security but the highest efficiency.

1 means that producer sends data to the cluster as soon as the leader reply is answered, and the next message can be sent only to ensure that the leader is sent successfully.

All sends data to the cluster on behalf of producer. All follower needs to be synchronized from leader before the next message is sent, ensuring that leader is sent successfully and all copies are backed up. It is the most secure, but the least efficient.

Finally, it is important to note that if you write data to a topic that does not exist, can it be written successfully? Kafka automatically creates a topic, and the number of partitions and replicas is 1 according to the default configuration.

Save data

After Producer writes the data to kafka, the cluster needs to save the data! Kafka saves data to disk, and perhaps in our general perception, writing to disk is a time-consuming operation and is not suitable for such highly concurrent components. Kafka initially opens up a separate piece of disk space to write data sequentially (more efficient than random writes).

Partition structure

As mentioned earlier, each topic can be divided into one or more partition, if you think topic is more abstract, then partition is a more specific thing! Partition is represented as a folder on the server. There are multiple groups of segment files under each partition folder, and each group of segment files contains .index file, .log file and .timeindex file (not in previous versions). The log file is actually the place where the message is stored, while the index and timeindex files are index files used to retrieve messages.

As shown in the figure above, this partition has three sets of segment files, each of which is the same size, but the number of message stored is not necessarily the same (the message size of each item is not the same). The name of the file is named after the minimum offset of the segment. For example, 000.index stores messages with an offset of 0,368795. Kafka uses the way of segmentation and index to solve the problem of search efficiency.

Message structure mentioned above that the log file is actually the place where the message is stored, and what we write to the kafka in the producer is a message, so what does the message stored in the log look like? The message mainly includes message body, message size, offset, compression type. Wait! The key things we need to know are the following three:

1. Offset:offset is an ordered id number that occupies 8byte, which can uniquely determine the location of each message within the parition!

2. Message size: message size takes up 4byte, which is used to describe the size of the message.

3. Message body: the message body stores the actual message data (compressed), and the space occupied varies according to the specific message.

Storage strategy

Kafka saves all messages whether or not they are consumed. So what are the deletion strategies for old data?

1. Based on time, the default configuration is 168 hours (7 days).

2. Based on size, the default configuration is 1073741824.

It is important to note that the time complexity for kafka to read specific messages is O (1), so deleting expired files here does not improve the performance of kafka!

Consumption data

Once the message is stored in the log file, consumers can consume it. The same as the production news, when consumers pull the message, they also go to leader to pull it.

Multiple consumers can form a consumer group (consumer group), and each consumer group has an id! Consumers of the same consumer group can consume data from different partitions under the same topic, but not multiple consumers in the group consume data from the same partition! Isn't it a little roundabout. Let's take a look at the following picture:

The figure shows that the number of consumers in the consumer group is less than the number of partition, so there will be a situation in which a consumer consumes multiple partition data, and the consumption speed is not as fast as that of consumers who only deal with one partition! If the number of consumers in the consumer group is more than the number of partition, will there be data on multiple consumers spending the same partition? It has been mentioned above that this will not happen! The extra consumers do not consume any partition data. Therefore, in the practical application, it is recommended that the number of consumer of the consumer group is the same as the number of partition!

In the section on saving data, we talked about how partition is divided into groups of segment, each segment contains .log, .index, .timeindex files, and each message contains offset, message size, message body. We have mentioned segment and offset many times. How do we use segment+offset to find messages when looking for messages? What if you need to find a message with an offset of 368801 now? Let's take a look at the following picture:

1. First find the segment file where offset's 368801message is located (using dichotomy to find it), and what you find here is in the second segment file. 2. Open the .index file in the found segment (that is, the 368796.index file, which starts with an offset of 368796. 1, and the message we are looking for with an offset of 368801 has an offset of 368796. 5index 368801 within the index, so the relative offset we are looking for here is 5). As the file uses a sparse index to store the relationship between the relative offset and the corresponding message physical offset, it is impossible to find the index with a relative offset of 5 directly. Here, we also use dichotomy to find the largest relative offset in the index entry in which the relative offset is less than or equal to the specified relative offset, so we find the index with a relative offset of 4.

3. According to the index found with a relative offset of 4, it is determined that the physical offset location of the message storage is 256. Open the data file and scan sequentially from location 256 until you find the Message with offset 368801.

This mechanism is based on the ordered offset, using segment+ ordered offset+ sparse index + binary search + sequential search and other means to find data efficiently! At this point, consumers can get the data that needs to be processed for processing. So how does each consumer record the location of his or her consumption? In earlier versions, consumers consumed offset maintenance zookeeper, and the consumer was reported at regular intervals, which easily led to repeated consumption and poor performance! In the new version, the offset consumed by consumers has been directly maintained in the _ _ consumer_offsets topic of the kafk cluster!

This is what the editor shares with you about the working principle of kafka. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.