How to implement the principle of kafka 07/09 Update SLTechnology News&Howtos

How to implement the principle of kafka

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to carry out the principle of kafka, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Why message queuing is needed

While brushing his mobile phone boredom on the weekend, a message popped up on a treasure website APP: "in order to give back to old customers, girlfriend buy one get one free, activities are limited to today!" . Buy one get one free is such a good thing, then I can't miss it! I can't help but order right away. So I chose two of the latest models and made an order and payment in one go! Lying in bed contentedly, thinking of having a girlfriend soon, I was so happy that I couldn't sleep. The next day, when I was working normally, I suddenly received a phone call from the delivery guy: "are you xx? your girlfriend is here. I'm downstairs. Please take it." . Me: "this. I'm at work. Can you deliver it in the evening?" . Brother: "I can't do it at night. I'm off work at night, too." . So the two men were deadlocked for a long time.

Finally, the little brother said, why don't I help you put it in Xiaofang convenience store downstairs, and you will come and pick it up after work in the evening, so that the embarrassing situation can be alleviated!

Back to the point, if there is no Xiaofang convenience store, then the interaction diagram between the courier and me should be as follows:

What will happen?

1. For the sake of this girlfriend, I asked for leave to go back to get it (the boss didn't approve it).

2. The little brother has been waiting downstairs (I have other couriers to deliver).

3. Send it at the weekend (obviously can't wait).

4. I don't want this girlfriend (absolutely impossible)!

After the appearance of Xiaofang convenience store, the interaction diagram should be as follows:

In the above example, "delivery guy" and "I who buy a girlfriend" are the two systems that need to interact, and Xiaofang convenience store is what we want to talk about in this article-"message middleware". To sum up, the emergence of Xiaofang convenience store (message middleware) has the following benefits:

1. Decoupling

The delivery guy has a lot of couriers to deliver. Every time, he needs to make a phone call to confirm whether the consignee is free and which time period is available, and then determine the delivery plan. It's totally dependent on the consignee!

If there is more than one express delivery, the delivery guy is probably too busy. If there is a convenience store, the courier only needs to put the courier in the same neighborhood in the same convenience store, and then notify the consignee to pick up the goods. At this time, the courier and the consignee will be decoupled!

2. Async

The delivery guy needs to wait downstairs after he calls me, and he can't send anyone else until I take your delivery. After the delivery guy put the delivery in the Xiaofang convenience store, he can do other work. He doesn't need to wait for you to arrive and has been waiting for you. The efficiency of the work is improved.

3. Peak clipping

Suppose I bought all kinds of goods from different stores on Singles' Day, and coincidentally, the express deliveries in these stores are all different, such as Zhongtong, Yuantong, Shentong, and so on. More coincidentally, they all arrived at the same time! The younger brother of Zhongtong called me to pick up the express delivery from the north gate, Yuantong asked me to go to the south gate, and Shentong asked me to go to the east gate. I was in a hurry at the moment...

We can see that in the scenario where the system needs interaction, there are really many benefits of using message queue middleware. Based on this idea, there are more professional "middleware" than Xiaofang convenience store, such as Feng Nest, Cainiao Post Station, and so on. Finally, the above story is pure fiction.

Mode of message queuing communication

Through the above example, we introduce message middleware and introduce the benefits of message queuing. Here we need to introduce two modes of message queuing communication:

I. Point-to-point mode

As shown in the figure above, the point-to-point pattern is usually based on a pull or polling messaging model, which is characterized by one and only one consumer processing messages sent to the queue. After the producer puts the message into the message queue, the consumer takes the initiative to pull the message for consumption.

The advantage of the peer-to-peer model is that the frequency of consumers pulling messages can be controlled by themselves. However, whether there are messages to be consumed in the message queue is not perceived on the consumer side, so additional threads are needed to monitor it on the consumer side.

Second, publish and subscribe model

As shown in the figure above, the publish-subscribe model is a messaging model based on messaging, and the modified model can have many different subscribers. After the producer puts the message into the message queue, the queue pushes the message to consumers who have subscribed to such messages (similar to the official account of Wechat).

Because the consumer passively receives the push, there is no need to perceive whether the message queue is a message to be consumed! However, due to the different machine performance of consumer1, consumer2, and consumer3, the ability to process messages will also be different, but message queues cannot perceive the speed of consumer consumption!

So the speed of push has become a problem in the publish and subscribe model! Suppose the processing speed of the three consumers is 8M/s, 5M/s, and 2M/s, respectively. If the queue push speed is 5M/s, then consumer3 cannot bear it! If the queue push speed is 2M/s, then consumer1, consumer2 will be a great waste of resources!

Kafka

The above briefly describes why message queuing and the two modes of message queuing communication are needed, and then it's time for our protagonist, kafka, to make a brilliant debut!

Kafka is a high-throughput distributed publish and subscribe messaging system, which can handle all action flow data in consumer-scale websites, with high performance, persistence, multi-copy backup, and scale-out capabilities.

Some basic introductions will not be carried out here, there are too many introductions about these on the Internet, readers can have a look at Baidu on their own! So this is Kafka! (multi-map + in-depth) take a look at this recommendation. Follow the official account Java technology stack for more Kafka series tutorials.

Infrastructure and terminology

Without saying much, let's look at the picture first. through this picture, let's sort out the related concepts and the relationship between them:

If you are confused when you see this picture, it doesn't matter!

Let's first analyze the relevant concepts.

Producer:Producer is the producer, the producer of the message, and the entry of the message.

Kafka cluster:Broker:Broker is an instance of kafka, and there are one or more instances of kafka on each server, so let's assume that each broker corresponds to one server. The broker in each kafka cluster has a non-repetitive number, such as broker-0, broker-1, etc.

Topic: the subject of the message, which can be understood as the classification of the message. The data of kafka is stored in topic. Multiple topic can be created on each broker.

The partition of Partition:Topic, each topic can have multiple partitions. The function of partition is to do load and improve the throughput of kafka. The same topic in different partitions of the data is not duplicated, the form of partition is a folder!

Replication: each partition has multiple replicas, which are used as spares. When the primary partition (Leader) fails, a spare tire (Follower) will be selected and become Leader. The maximum number of default replicas in kafka is 10, and the number of replicas cannot be greater than the number of Broker. Follower and leader are definitely on different machines, and the same machine can only store one copy (including itself) for the same partition.

Message: the body of each message sent.

Consumer: the consumer, the consumer of the message, is the outlet of the message.

Consumer Group: we can group multiple consumer groups into one consumer group. In the design of kafka, data from the same partition can only be consumed by one consumer in the consumer group. Consumers of the same consumer group can consume data from different partitions of the same topic, which is also to improve the throughput of kafka!

The Zookeeper:kafka cluster relies on zookeeper to store the meta-information of the cluster to ensure the availability of the system.

Workflow analysis

The above introduces the infrastructure and basic concepts of kafka. I don't know if you have a general impression of kafka after reading it. It doesn't matter if you are confused. 6 steps to master Kafka in an all-round way. Take a look at this suggestion. Down. Follow the official account Java technology stack for more Kafka series tutorials.

Next, we will analyze the workflow of kafka with the above structure diagram, and finally come back to sort it out again. I believe you will be more rewarding!

Send data

We see in the architecture diagram above that producer is the producer and the entry point to the data. Notice the red arrow in the picture. Producer always looks for leader when writing data, and will not write data directly to follower! So how does leader find it?

What is the process of writing? Let's take a look at the following picture:

The process of sending is already illustrated in the figure, so it is not listed in the text alone! One thing to note is that after the message is written to leader, follower actively goes to leader for synchronization! Producer uses push mode to publish data to broker, and each message is appended to the partition and sequentially written to disk, so make sure that the data in the same partition is orderly! Write the schematic diagram as follows:

It is mentioned above that data is written to different partitions, so why does kafka do partitions? I believe you can guess that the main purpose of zoning is:

1. Easy to expand. Because a topic can have more than one partition, we can easily cope with the growing amount of data by expanding the machine.

2. Improve concurrency. With partition as the reading and writing unit, multiple consumers can consume data at the same time, which improves the efficiency of message processing.

Friends who are familiar with load balancer should know that when we send a request to a server, the server may load the request and distribute the traffic to different servers. In kafka, if a topic has multiple partition,producer, how do you know which partition to send the data to?

There are several principles in kafka:

1. Partition can specify the partition to be written when writing, and if so, write the corresponding partition.

2. If no partition is specified, but the key of the data is set, a partition will be hash based on the value of key.

3. If neither partition nor key is specified, a partition will be polled.

Ensuring that messages are not lost is a basic guarantee of message queuing middleware, so how can producer ensure that messages are not lost when writing messages to kafka?

In fact, the above written flow chart is described through the ACK response mechanism! When the producer writes data to the queue, you can set a parameter to determine whether or not the kafka receives the data. This parameter can be set to values of 0, 1, and all.

0 means that producer does not need to wait for the return of the cluster to send data to the cluster, which does not ensure that the message is sent successfully. It has the lowest security but the highest efficiency.

1 means that producer sends data to the cluster as soon as the leader reply is answered, and the next message can be sent only to ensure that the leader is sent successfully.

All sends data to the cluster on behalf of producer. All follower needs to be synchronized from leader before the next message is sent, ensuring that leader is sent successfully and all copies are backed up. It is the most secure, but the least efficient.

Finally, it is important to note that if you write data to a topic that does not exist, can it be written successfully? Kafka automatically creates a topic, and the number of partitions and replicas is 1 according to the default configuration.

Save data

After Producer writes the data to kafka, the cluster needs to save the data! Kafka saves data to disk, and perhaps in our general perception, writing to disk is a time-consuming operation and is not suitable for such highly concurrent components. Kafka initially opens up a separate piece of disk space to write data sequentially (more efficient than random writes).

Partition structure said earlier that each topic can be divided into one or more partition, if you think topic is more abstract, then partition is a more specific thing! Partition is represented as a folder on the server. There are multiple groups of segment files under each partition folder, and each group of segment files contains .index file, .log file and .timeindex file (not in previous versions). The log file is actually the place where the message is stored, while the index and timeindex files are index files used to retrieve messages.

As shown in the figure above, this partition has three sets of segment files, each of which is the same size, but the number of message stored is not necessarily the same (the message size of each item is not the same). The name of the file is named after the minimum offset of the segment. For example, 000.index stores messages with an offset of 0,368795. Kafka uses the way of segmentation and index to solve the problem of search efficiency.

The Message structure says that the log file is actually the place where the message is stored, and what we write to the kafka in the producer is a message, so what does the message stored in the log look like? The message mainly includes message body, message size, offset, compression type. Wait!

The key things we need to know are the following three:

1. Offset:offset is an ordered id number that occupies 8byte, which can uniquely determine the location of each message within the parition!

2. Message size: message size takes up 4byte, which is used to describe the size of the message.

3. Message body: the message body stores the actual message data (compressed), and the space occupied varies according to the specific message.

Storage policy kafka saves all messages regardless of whether they are consumed or not. So what are the deletion strategies for old data?

1. Based on time, the default configuration is 168 hours (7 days).

2. Based on size, the default configuration is 1073741824. It is important to note that the time complexity for kafka to read specific messages is O (1), so deleting expired files here does not improve the performance of kafka!

Consumption data

Once the message is stored in the log file, consumers can consume it. When talking about the two modes of message queuing communication, we talk about the point-to-point mode and the publish-subscribe mode.

Kafka uses a peer-to-peer model. Consumers actively go to the kafka cluster to pull messages. Like producer, consumers also go to leader to pull messages when pulling messages.

Multiple consumers can form a consumer group (consumer group), and each consumer group has an id! Consumers of the same consumer group can consume data from different partitions under the same topic, but not multiple consumers in the group consume data from the same partition!

Isn't it a little roundabout. Let's take a look at the following picture:

The figure shows that the number of consumers in the consumer group is less than the number of partition, so there will be a situation in which a consumer consumes multiple partition data, and the consumption speed is not as fast as that of consumers who only deal with one partition! If the number of consumers in the consumer group is more than the number of partition, will there be data on multiple consumers spending the same partition?

It has been mentioned above that this will not happen! The extra consumers do not consume any partition data. Therefore, in the practical application, it is recommended that the number of consumer of the consumer group is the same as the number of partition! In the section on saving data, we talked about how partition is divided into groups of segment, each segment contains .log, .index, .timeindex files, and each message contains offset, message size, message body.

We have mentioned segment and offset many times. How do we use segment+offset to find messages when looking for messages? What if you need to find a message with an offset of 368801 now? Let's take a look at the following picture:

1. First find the segment file where offset's 368801message is located (using dichotomy to find it), and what you find here is in the second segment file.

2. Open the .index file in the found segment (that is, the 368796.index file, which starts with an offset of 368796. 1, and the message we are looking for with an offset of 368801 has an offset of 368796. 5index 368801 within the index, so the relative offset we are looking for here is 5). As the file uses a sparse index to store the relationship between the relative offset and the corresponding message physical offset, it is impossible to find the index with a relative offset of 5 directly. Here, we also use dichotomy to find the largest relative offset in the index entry in which the relative offset is less than or equal to the specified relative offset, so we find the index with a relative offset of 4.

3. According to the index found with a relative offset of 4, it is determined that the physical offset location of the message storage is 256. Open the data file and scan sequentially from location 256 until you find the Message with offset 368801.

This mechanism is based on the ordered offset, using segment+ ordered offset+ sparse index + binary search + sequential search and other means to find data efficiently!

At this point, consumers can get the data that needs to be processed for processing. So how does each consumer record the location of his or her consumption?

In earlier versions, consumers consumed offset maintenance zookeeper, and the consumer was reported at regular intervals, which easily led to repeated consumption and poor performance! In the new version, the offset consumed by consumers has been directly maintained in the _ consumeroffsets topic of the kafk cluster!

This is the end of the answer to the question on how to carry out the principle of kafka. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.