Design principle and performance Application of Kafka 04/20 Update SLTechnology News&Howtos

Design principle and performance Application of Kafka

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "the design principle and performance application of Kafka". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "the design principle and performance application of Kafka".

Preface

Dagan data, as a company that provides big data services, often meets the need for customers to report data. Such a request does not need to return the processing result immediately, but requires the background to uniformly archive and mine a series of reported data, and then present the resulting data to the customer. This kind of business demand needs to provide data temporary storage service, that is to say, we need a system to communicate between producers (customers report data) and consumers (background data processing). In short, it is called inter-system communication message system, which is the classical producer (producer) and consumer (consumer) model.

However, there is one messaging system that happens to be created to cope with this business scenario, and that is kafka. So what kind of system is kafka? What are the characteristics? What about the actual throughput performance? With these questions, let's find out together.

First, brief introduction to Kafka

First of all, according to the official website, we know that kafka is a distributed stream processing platform, a messaging system that can handle enterprise publish / subscribe, and has the characteristics of high fault tolerance and timeliness of consumption, so how does it do this? Keep looking down.

1, topic and log:

Topic (topic) and log (log) settings is a major feature of kafka, a kafka cluster can create multiple topic, each topic is equivalent to a message queue, which means that different formats of data can be published to different topic, reducing the logical difficulty of consuming the data. So what is the data structure processed in each topic? Let's first take a look at an anatomical picture of topic:

Figure 1:topic principle analysis diagram

As you can see from figure 1, when the message is delivered, the kafka will eventually write the message to a specific partition (partition) on the disk through load balancing. Because these messages are stored sequentially on the same partition, there will be an offset based on the starting position (offset) for each message for a particular partition, so we only need to indicate which offset in which partition to start consumption in order to achieve the purpose of repeated consumption.

1) although kafka can increase the load by adding partition, its data is eventually written to disk. For example, the write efficiency of a mechanical disk is very low, do we need to increase the load of an topic to set more partition for it?

There is a strong correlation between mechanical disk drive throughput and seek delay, that is, linear read and write speed is much higher than random read and write. For example, in a 67200rpm SATA RAID-5 disk array, the random write speed is about 100k/s, while the linear write speed can reach 600M/s, which is about 6000 times that of the former. As you can see from figure 1, kafka uses the latter, which uses operating system read-ahead and write-behind technology to greatly improve disk access performance. Setting the number of partition can increase the load of topic from the point of view of disk reading and writing, but too much partition will increase the amount of cpu computation, so the solution is to set different partition numbers according to different configured machines and different business scenarios.

2) what is the storage type of offset offset? will the value of offset be reset to 0 if the message is large enough? if set to 0, will the subsequent consumption be disordered?

Kafka offset is a log sequence number (log sequence number), so you don't have to worry about the length of offset. So how big is the log sequence number, for example: if a partition receives 1T of logs a day, the offset can be used for at least 1 million years. Since offset is sufficient and will not be set to zero, consumer disorder will not occur from this point of view.

3) will the logs written to disk be retained? What do I need to do if I want to delete expired messages?

You can set the message expiration time through the log.retention parameter in the configuration file. Messages that exceed the expiration time will be deleted by the system, and deleted messages can no longer be re-consumed.

2. Distributed cluster

From the previous introduction, we have learned that kafka achieves very high throughput through partition and sequential disk reading and writing, but the high throughput of a single machine will have a disastrous impact on the business in the event of a failure. How to deal with this problem? You must already know, that is to use the cluster approach, in the event of a machine failure, the client can choose to link to other machines to ensure business stability. Each partition will have a server as leader, and one or more servers (server) as follower. Leader will handle all read and write requests, while follower will back up data from leader. If a leader fails, other follower will automatically elect one to be a new leader, so for a server, it may be leader under some partition. For other partition, it is follower, which is designed to balance the load better.

1) are there any small details worth paying attention to when building a kafka cluster?

The kafka official website has been built in detail, so I won't repeat it here. It is recommended that pseudo-clusters (multiple broker running on the same physical machine) should not be used in formal projects, and zookeeper clusters and kafka clusters should not appear on the same physical machine, which will affect the sequential read and write efficiency of kafka.

2) if a server fails in a kafka cluster, how to ensure data integrity?

There is a replication factor control parameter in the kafka configuration file. If you set this parameter to N, it means that a piece of data will be saved N times, and the data will be backed up to different server, so when setting the replication factor to N, even if one server fails, the data integrity will be guaranteed.

3. Sequence of producers, consumers and messages:

The purpose of all this is to implement the data structure of a queue. We are no stranger to the data structure of queues, so we can imagine that for a topic queue in kafka, the production and consumption logic should be like this: there are many producers writing data to topic, and there is a lot of consumer consumption data on the other end. (see figure 2)

Figure 2: original understanding diagram of producers and consumers

However, in fact, the kafka producer-consumer model has its own particularity, so how does the queue kafka join and leave the queue? Next let's take a look at the kafka producer-consumer model.

Producer: producer, as its name implies, is the one that publishes messages to the kafka queue, that is, queue operators. The producer function is to select a partion in the topic and send data to that partition. The process of selecting partition is a way of load balancing. For example, you can use polling or set your own partition selection function to achieve load balancing. Of course, if you use encapsulated api such as (https://github.com/dpkp/kafka-python), you don't have to worry about load balancing. There will be a default load balancing function to implement this function.

Consumer: the consumer function reads data from the queue and processes it logically, but there is something special about kafka consumers. Kafka adds the concept of a group (group), a topic can have multiple group, when multiple consumer belong to a group, a message will be sent to all groups, but within the group, this message will only be consumed by one consumer. Therefore, a group is a real "logical consumer (logic consumer)". The relevant logic is shown in figure 3.

Message ordering: we know the consumption of messages through figure 3, so what will a message flow consume look like? In fact, because load balancing rules are specified in high-level api, when the same producer publishes two different message data, it will be sent to a specific partition according to the corresponding rules, and the data will be extracted from the partition according to the same rules when consuming, thus ensuring the order of the two pieces of data consumption, thus ensuring the sequence of messages.

1) for a topic with multiple consumer, how do I set group if I want to implement that a message is consumed by multiple consumer and a message is consumed by only one consumer?

Setting multiple consumer to the same group can realize that a message is only consumed by multiple consumer, all consumer are set to different groups, and a message will be consumed by all consumer.

2) if there is a batch of data consumption that must be strictly in accordance with the order of joining the queue, how to set up producers and consumers.

If the amount of data is small, topic can be set to a partition;. If the amount of data is large, a producer can write a dead load balancer function to send the data to a specific partition. When consuming the data, specify the partition consumed by the consumer, and offset to consume the data sequentially.

Figure 3: schematic diagram of message flow in multiple consumer groups

Second, Kafka performance test:

Kafka is a cross-language message queuing system. Java, Python and other language clients are provided on github. For simplicity, we use kafka-python (https://github.com/dpkp/kafka-python) as a client to link to the kafka cluster for testing.

Test environment:

1, quantity of broker: 3

2. Backup factors: 2

3. Disk information: 200g ordinary mechanical hard disk

4, cpu parameters: 8 cores and 8 threads

5, language: Python2.7

6, client: kafka-python

7, quantity of partition: 5

Single-process producer sends 10 messages for testing (see figure 4):

Figure 4: delay result diagram of a producer sending a message

Statistics from the above figure show that the average delay is 0.004488707, that is, the qps can reach 2000, which is undoubtedly amazing. So will kafka perform well in a multi-process situation? Let's set up 10 processes to see if there is a big change in the latency of kafka under 10 processes. As shown in figure 5 (printing too many messages and intercepting part of the result picture):

Figure 5: delay result diagram of multiple producers sending messages (part)

As can be seen from figure 5, 10 processes each send 10 messages with an average delay of 0.00050380466 seconds, and qps is close to 200000. Because kafka supports thousands of clients to read and write at the same time, the kafka throughput is amazing. You are welcome to complete more tests.

Third, the introduction of the application of kafka in outlook data.

1, the application in vertical search:

We know that search engines need to update documents regularly, and if we temporarily save the contents that need to be updated to kafka, so that when the index is updated, we only need to continue to retrieve data from the last offset in the corresponding partition, which can achieve the purpose of incremental update, and the expired data will be cleaned automatically, reducing operational redundancy and complexity.

2, application in user portraits and related recommendations:

Different from the user click behavior data reported before the user profile, the massive item data reported before the relevant recommendation requires higher data accuracy. Imagine that if a piece of item data is not correctly stored in the database because of processing failure, then the item will never appear during the relevant recommendation, so it puts forward more stringent requirements for "rollback". However, in kafka, you only need to reset the consumed offset to the offset when the consumption fails, and fix the storage problem and consume again.

Of course, there are more extensive applications of kafka, which will not be discussed here. According to the official website, kafka has expertise in many aspects, such as website behavior tracking (Website Activity Tracking), data monitoring, flow processing and so on. If you have research on the principle of kafka or have experience in practical application, welcome to discuss it, thank you!

About outlook data

Outlook data focuses on enterprise big data's technical services, and uses * multi-layer intelligent mining algorithms to realize in-depth analysis and mining of massive user behavior and text data. to provide enterprises with intelligent text analysis, accurate user behavior modeling, personalized recommendation, intelligent search and other data mining functions.

At this point, I believe you have a deeper understanding of "the design principle and performance application of Kafka". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.