How to reduce the delay of Kafka messages by 10 times 04/07 Update SLTechnology News&Howtos

How to reduce the delay of Kafka messages by 10 times

2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "how to reduce the delay of Kafka messages by 10 times". The content is simple and clear. I hope it can help you solve your doubts. Let me lead you to study and learn this article "how to reduce the delay of Kafka messages by 10 times".

Business problem

As shown in the figure above, it is a concurrent access model that simulates the construction of a customer's business web page. When the user clicks on the page to generate a HTTP request, the request is sent to the business production process, and a delivery thread (Deliver Thread) is started to call Kafka's SDK interface and send three messages to DMS (distributed message Service). Each message is 3k in size. You need to wait for all three messages to be processed before returning the request response ⑧. When the message reaches DMS, the business consumption process calls the consumption API of Kafka to fetch the message, and then puts each message into a response thread (Response Thread) for processing. After the response thread has finished processing, it notifies the delivery thread through the HTTP request, and the delivery thread returns the response after receiving the response.

100 concurrent access latency 500ms, which fails to meet the user's business requirements

The customer has put forward a clear requirement: every two-core ECS should be able to support 100 concurrent visits, and the end-to-end delay range of each message is tens of milliseconds, that is, the time from the producer to the consumer response. According to the customer's measurement, after using the Kafka queue of DMS, the delay of 100 concurrent visits is about 500ms, or even a delay of seconds, which is far from meeting the business requirements put forward by customers. In comparison, customers use native Kafka built by themselves in the Pod zone, and the latency tested is only about 10~20ms when the number of concurrent visits is 100. So the question is, why is there such a big difference in latency between the Kafka queue of DMS and the native Kafka built in Pod zone when the concurrent traffic is the same? Our DMS architect Mr. Peng solved the customer problem perfectly after a series of analysis of the latency problem. Let's take a look at his mental journey.

Analysis of difficult problems

According to the simulated customer business model, Mr. Peng also constructed a test program on Huawei cloud production environment, and also simulated 100 concurrent visits. Through the test, it is found that the average time delay of pressure test on class production environment is about 60ms. What is the difference between the delay value in class production and the delay measured by customers in the real production environment? The problem became complicated and confusing.

Mr. Peng immediately decided to run the constructed test program on Huawei Cloud to see what the reason was. At the same time, the same test program is deployed on the customer's ECS server, and the concurrency of 100 is simulated, and the following delay result comparison table is obtained:

Pre-tuning delay

Current network delay (ms)

Class production delay (ms)

100 concurrency

500ms ~ 4000ms

40ms ~ 80 ms

1 concurrency

31ms

6ms

Ping test

0.9ms ~ 1.2ms

0.3ms ~ 0.4ms

Table 1 comparison of time delay between Huawei cloud network and similar production environment

From the results of the delay comparison table, Mr. Peng found that even under the same concurrency pressure, the latency of Huawei cloud network is much worse than that of class production. Mr. Peng realized that there are two questions that need to be analyzed: why is the latency of Huawei Cloud Network worse than that of similar production? How to solve the problem that the delay performance of DMS Kafka queue is worse than that of native self-built Kafka queue? Mr. Peng makes the following analysis:

Time delay analysis

Returning to the nature of the problem, how on earth did the delay of DMS Kafka queues come into being? What are the specific categories of controllable end-to-end delay? Mr. Peng gives the following calculation formula:

Total delay = queue delay + send delay + write delay + replication delay + pull delay

Let's look at what each term in the formula means in turn.

Queuing delay: after the message enters the Kafka sdk, it first enters the queue of the partition to be sent, and then sends the message after the message is packaged.

Delivery delay: the time a message is sent from the producer to the server.

Write delay: the time the message was written to the Kafka Leader.

Replication latency: consumers can only consume messages below the high water level (that is, messages saved by multiple replicas), so the time when the message is written to Kafka Leader and all replicas write the message until it rises to high water level is the delay of message replication.

Pull delay: the time it takes for consumers to pull data in pull mode and the pull process.

(1) delay in joining the team

Which part of the current network has the greatest delay? Through our program, we can see that the delay of queuing waiting for sending is very large, as shown below:

That is, messages are waiting in the queue on the production side, and it is too late to send them!

Let's take a look at other delay analysis, because it cannot be tested in the existing network, we have tested the same pressure in the class, and the other delays are as follows:

(2) replication delay

The following is under 1 concurrency of the class production environment test

From the log point of view, the replication delay is included in the remoteTime, of course, this time will also include the slow producer write delay, but it also reflects that the replication delay is also a factor to improve the performance delay.

(3) write delay

Because users use high throughput queues and writes are dropped asynchronously, we can see from the log that the write latency is very low (localTime), which can be judged not to be a bottleneck.

Both transmission delay and pull delay are related to network transmission, and this optimization is mainly determined by adjusting the parameters of TCP.

The above is all the contents of the article "how to reduce the delay of Kafka messages by 10 times". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.