How to carry out Kafka working principle 02/08 Update SLTechnology News&Howtos

How to carry out Kafka working principle

2026-02-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I would like to talk to you about the working principle of how to carry out Kafka. Many people may not know much about it. In order to make you understand better, the editor has summarized the following for you. I hope you can get something according to this article.

Why message queuing is needed

Back to the point, if there is no Xiaofang convenience store, then the interaction diagram between the courier and me should:

What will happen?

For the sake of this girlfriend, I asked for leave to go back to get it (the boss didn't approve it).

The little brother has been waiting downstairs (I have other deliveries to deliver).

Send it off at the weekend (obviously can't wait).

I don't want this girlfriend (absolutely impossible)!

After the appearance of Xiaofang convenience store, the interactive picture is as follows:

In the above example, "delivery guy" and "I who buy a girlfriend" are the two systems that need to interact, and Xiaofang convenience store is the message middleware we are going to talk about in this article.

To sum up, the emergence of Xiaofang convenience store (message middleware) has the following benefits:

Decoupling

The delivery guy has a lot of couriers to deliver. Every time, he needs to make a phone call to confirm whether the consignee is free and which time period is available, and then determine the delivery plan. It's totally dependent on the consignee! If there is more than one express delivery, the delivery guy will probably be too busy.

If there is a convenience store, the courier only needs to put the courier in the same neighborhood in the same convenience store, and then notify the consignee to pick up the goods. At this time, the courier and the consignee will be decoupled!

Async

The delivery guy needs to wait downstairs after he calls me, and he can't send anyone else until I take your delivery.

Delivery guy put the express delivery in Xiaofang convenience store, and can do other work, do not need to wait for you to arrive and has been waiting for the state, improving the efficiency of the work.

Peak clipping

Suppose I bought all kinds of goods from different stores on Singles' Day, and coincidentally, the express deliveries in these stores are all different, such as Zhongtong, Yuantong, Shentong, and so on. More coincidentally, they all arrived at the same time!

The younger brother of Zhongtong called me to pick up the express delivery from the north gate, Yuantong asked me to go to the south gate, and Shentong asked me to go to the east gate. I was in a hurry at the moment...

We can see that in the scenario where the system needs interaction, there are really many benefits of using message queue middleware. Based on this idea, there are more professional "middleware" than Xiaofang convenience store, such as Feng Nest, Cainiao Post Station, and so on.

*, the above story is pure fiction.

Mode of message queuing communication

Through the above example, we introduce message middleware and introduce the benefits of message queuing. Here we need to introduce two modes of message queuing communication:

Point-to-point mode

As shown in the figure above, the point-to-point pattern is usually based on a pull or polling messaging model, which is characterized by one and only one consumer processing messages sent to the queue.

After the producer puts the message into the message queue, the consumer takes the initiative to pull the message for consumption. The advantage of the peer-to-peer model is that the frequency of consumers pulling messages can be controlled by themselves.

However, whether there are messages to be consumed in the message queue is not perceived on the consumer side, so additional threads are needed to monitor it on the consumer side.

Publish and subscribe model

As shown in the figure above, the publish-subscribe model is a messaging model based on messaging, which can have many different subscribers.

After the producer puts the message into the message queue, the queue pushes the message to consumers who have subscribed to such messages (similar to the official account of Wechat).

Because the consumer passively receives the push, there is no need to perceive whether the message queue is a message to be consumed! However, due to the different machine performance of Consumer1, Consumer2, and Consumer3, the ability to process messages will also be different, but message queues cannot perceive the speed of consumer consumption!

So the speed of push has become a problem in the publish and subscribe model! Suppose the processing speed of the three consumers is 8M/s, 5M/s, and 2M/s, respectively. If the queue push speed is 5M/s, then Consumer3 cannot bear it!

If the queue push speed is 2M/s, then Consumer1, Consumer2 will be a great waste of resources!

Kafka

The above briefly describes why message queuing and the two modes of message queuing communication are needed, and then it's time for our protagonist Kafka to make a brilliant debut!

Kafka is a high-throughput distributed publish and subscribe messaging system, which can handle all action flow data in consumer-scale websites, with high performance, persistence, multi-copy backup, and scale-out capabilities.

Infrastructure and terminology

Without saying much, let's look at the picture first. through this picture, let's sort out the related concepts and the relationship between them:

If you are confused when you see this picture, it doesn't matter! Let's first analyze the relevant concepts:

Producer:Producer is the producer, the producer of the message, and the entry of the message.

Kafka Cluster:

Broker:Broker is an instance of Kafka, and there are one or more instances of Kafka on each server, so let's assume that each Broker corresponds to one server.

The Broker in each Kafka cluster has a non-repetitive number, such as Broker-0, Broker-1, etc.

Topic: the subject of the message, which can be understood as the classification of the message. The data of Kafka is stored in Topic. Multiple Topic can be created on each Broker.

The partition of Partition:Topic, each Topic can have multiple partitions. The function of partition is to do load and improve the throughput of Kafka.

The same Topic in different partitions of the data is not duplicated, the form of Partition is a folder!

Replication: each partition has multiple replicas, which are used as spares. When the primary partition (Leader) fails, a spare tire (Follower) will be selected and become Leader.

The default number of replicas in Kafka is 10, and the number of replicas cannot be greater than the number of Broker. Follower and Leader are definitely on different machines, and the same machine can only store one copy (including itself) for the same partition.

Message: the body of each message sent.

Consumer: the consumer, the consumer of the message, is the outlet of the message.

Consumer Group: we can group multiple consumer groups into one consumer group. In the design of Kafka, data from the same partition can only be consumed by one consumer in the consumer group.

Consumers of the same consumer group can consume data from different partitions of the same Topic, which is also to improve the throughput of Kafka!

The Zookeeper:Kafka cluster relies on Zookeeper to store the meta-information of the cluster to ensure the availability of the system.

Workflow analysis

The above introduces the infrastructure and basic concepts of Kafka. I don't know if you have a general impression of Kafka after reading it. It doesn't matter if you are confused.

Next, we will analyze the workflow of Kafka with the structure diagram above. * come back and sort it out again. I'm sure you'll be more rewarded!

Send data

We see in the architecture diagram above that Producer is the producer and the entry point to the data. Notice the red arrow in the picture. Producer is always looking for Leader when writing data, and will not write data directly to Follower!

So how does Leader find it? What is the process of writing? Let's take a look at the following picture:

The process of sending is already illustrated in the figure, so it is not listed in the text alone! One thing to note is that after the message is written to Leader, Follower actively goes to Leader for synchronization!

Producer uses Push mode to publish data to Broker, and each message is appended to the partition and sequentially written to disk, so make sure that the data in the same partition is orderly!

Write the schematic diagram as follows:

It is mentioned above that data is written to different partitions, so why does Kafka do partitions? I believe you can guess that the main purpose of zoning is:

Easy to expand. Because a Topic can have more than one Partition, we can easily cope with the growing amount of data by extending the machine.

Increase concurrency. With Partition as the reading and writing unit, multiple consumers can consume data at the same time, which improves the efficiency of message processing.

Friends who are familiar with load balancing should know that when we send a request to a server, the server may load the request and distribute the traffic to different servers.

So in Kafka, how do you know which Partition to send data to if a Topic has multiple Partition,Producer?

There are several principles in Kafka:

Partition can specify the Partition to be written when writing, and if so, write the corresponding Partition.

If no Partition is specified, but the Key of the data is set, a Partition is Hash based on the value of Key.

If neither Partition is specified nor Key is set, a Partition is polled.

Ensuring that messages are not lost is a basic guarantee of message queuing middleware, so how can Producer ensure that messages are not lost when writing messages to Kafka?

In fact, the above written flow chart is described through the ACK response mechanism! When the producer writes data to the queue, you can set a parameter to determine whether to confirm whether the Kafka receives the data. This parameter can be set to values of 0, 1, all:

0 means that Producer does not need to wait for the return of the cluster to send data to the cluster, which does not ensure that the message is sent successfully. Security * but efficiency *.

1 means that Producer sends data to the cluster as soon as the Leader reply is answered, and the next message can be sent only to ensure that the Leader is sent successfully.

All sends data to the cluster on behalf of Producer. All Follower needs to be synchronized from Leader before the next message is sent, ensuring that Leader is sent successfully and all copies are backed up. Security, but efficiency.

* Note that if you write data to a Topic that does not exist, can it be written successfully? Kafka automatically creates Topic, and the number of partitions and replicas is 1 according to the default configuration.

Save data

After Producer writes the data to Kafka, the cluster needs to save the data! Kafka saves the data on disk, perhaps in our general understanding, writing to disk is a time-consuming operation, which is not suitable for such highly concurrent components.

Kafka initially opens up a separate piece of disk space to write data sequentially (more efficient than random writes).

① Partition structure

As mentioned earlier, each Topic can be divided into one or more Partition, if you think Topic is more abstract, then Partition is a more specific thing!

Partition is represented as a folder on the server, with multiple sets of Segment files under each Partition folder.

Each set of Segment files contains .index files, .log files, and .timeindex files (not in previous versions).

The Log file is where the Message is actually stored, while the Index and Timeindex files are index files that are used to retrieve messages.

As shown in the figure above, this Partition has three sets of Segment files, each of which is the same size, but the number of Message stored is not necessarily the same (the Message size of each item is not the same).

The name of the file is named after the minimum Offset of the Segment. For example, 000.index stores messages with an Offset of 0,368795. Kafka uses the way of segmentation and index to solve the problem of search efficiency.

② Message structure

The Log file mentioned above is actually the place where the Message is stored, and what we write to the Kafka in Producer is also a Message.

What does the Message stored in Log look like? The message mainly includes message body, message size, Offset, compression type. Wait!

The key things we need to know are the following three:

Offset:Offset is an ordered id number that occupies 8byte, which uniquely determines the location of each message within the Parition!

Message size: the message size takes up 4byte and is used to describe the size of the message.

Message body: the message body stores the actual message data (compressed), and the space occupied varies according to the specific message.

③ storage policy

Kafka saves all messages whether or not they are consumed. So what are the deletion strategies for old data?

Based on time, the default configuration is 168 hours (7 days).

Based on size, the default configuration is 1073741824.

It is important to note that the time complexity for Kafka to read specific messages is O (1), so deleting expired files here does not improve the performance of Kafka!

Consumption data

Once the message is stored in the Log file, consumers can consume it. When talking about the two modes of message queuing communication, we talk about the point-to-point mode and the publish-subscribe mode.

Kafka uses a peer-to-peer model. Consumers actively go to the Kafka cluster to pull messages. Like Producer, consumers also go to Leader to pull messages when pulling messages.

Multiple consumers can form a consumer group (Consumer Group), and each consumer group has an id!

Consumers of the same consumer group can consume data from different partitions under the same Topic, but not multiple consumers in the group consume data from the same partition!

Isn't it a little roundabout? Let's take a look at the following picture:

The figure shows that the number of consumers in the consumer group is less than the number of Partition, so there will be a situation in which a consumer consumes multiple Partition data, and the consumption speed is not as fast as that of consumers who only deal with one Partition!

If the number of consumers in the consumer group is more than the number of Partition, will there be data on multiple consumers spending the same Partition?

It has been mentioned above that this will not happen! The extra consumers do not consume any Partition data.

Therefore, in the practical application, it is recommended that the number of Consumer of the consumer group is the same as the number of Partition!

In the section on saving data, we talked about how Partition is divided into groups of Segment, each Segment contains .log, .index, .timeindex files, and each Message contains Offset, message size, message body.

We have mentioned Segment and Offset many times. How do we use Segment+Offset to find messages when looking for messages?

What if you need to find a Message with an Offset of 368801 now? Let's take a look at the following picture:

① first finds the Segment file where Offset's 368801message is located (using dichotomy to find it), and what you find here is the second Segment file.

② opens the .index file (that is, the 368796.index file) in the found Segment, which starts with an offset of 368796x1.

The Message we are looking for with an Offset of 368801 has an offset within that Index of 368796 / 368801, so the relative Offset we are looking for here is 5).

Because the file uses a sparse index to store the relationship between the relative Offset and the corresponding Message physical offset, it is impossible to find an index with a relative Offset of 5 directly.

Here, we also use dichotomy to find the relative Offset in the index entries in which the relative Offset is less than or equal to the specified relative Offset, so we find the index with a relative Offset of 4.

③ determines that the physical offset location of the Message storage is 256 based on the index found with a relative Offset of 4.

Open the data file and scan sequentially from location 256 until you find the Message with Offset 368801.

This mechanism is based on the ordered Offset, using Segment+ ordered Offset+ sparse index + binary search + sequential search and other means to find data efficiently!

At this point, consumers can get the data that needs to be processed for processing.

After reading the above, do you have any further understanding of how Kafka works? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.