What are the questions about Kafka? 10/31 Update SLTechnology News&Howtos

What are the questions about Kafka?

2025-10-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what are the problems about Kafka". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the problems about Kafka"?

Order problem

1. Why should the order of messages be guaranteed?

At first, there were very few merchants in our system, and in order to achieve the function quickly, we didn't think too much. Since it is through the message middleware kafka communication, when the order system sends a message, the order detailed data is placed in the message body, our kitchen display system can obtain the relevant message data as long as we subscribe to topic, and then deal with our own business.

However, there is a key factor in this scheme: to ensure the order of messages.

Why?

There are many states of an order, such as placing an order, payment, completion, cancellation, etc., and the message that it is impossible to place an order has not been read, so read the message of payment or cancellation first. if this is really the case, will the data not be confused?

Well, it seems necessary to ensure the order of the messages.

two。 How to ensure the order of messages?

We all know that the topic of kafka is unordered, but a topic contains multiple partition, and each partition is ordered internally.

In this way, the idea becomes clear: as long as producers write messages to the same partition according to certain rules, and different consumers read different partition messages, the order of production and consumer messages can be guaranteed.

This is what we did at the beginning. Messages with the same merchant number were written to the same partition,topic to create four partition, and then four consumer nodes were deployed to form a consumer group, with one partition corresponding to one consumer node. In theory, this scheme can ensure the order of messages.

Everything seemed to be "seamless", so we went online "smoothly".

3. There was an accident

This function has been online for some time, but it is quite normal at first.

However, the good times did not last long, and soon received complaints from users that some orders and dishes could not be seen in the paddling client.

I located the reason. During that time, the network of the company was often unstable, the business interface timed out from time to time, and the business request could not connect to the database from time to time.

The impact of this situation on sequential messages can be said to be devastating.

Why do you say that?

Suppose the order system sends three messages: "place an order", "pay" and "complete".

However, our system failed to process the "place order" message due to network reasons, and the data of the next two messages cannot be stored, because only the data of the "place order" message is complete data. Other types of messages will only update the status.

In addition, we did not have a failure retry mechanism at that time, which magnified the problem. The problem becomes: once the data of the "place order" message fails, the user will never see the order and dish.

So how to solve this urgent problem?

4. Solution process

At first, our idea was that when the consumer was processing the message, if the processing failed, immediately retry 3-5 times. But what if some requests don't succeed until the sixth time? It is impossible to retry all the time. This synchronous retry mechanism will block the reading of order messages from other merchants.

Obviously, when an exception occurs with the above synchronous retry mechanism, it will seriously affect the consumption speed of message consumers and reduce its throughput.

In this way, we have to use the asynchronous retry mechanism.

If you use the asynchronous retry mechanism, messages that fail to process will have to be saved to the retry table.

But a new question immediately arises: how do you keep the order of only one message?

Saving a message does not guarantee the order, if the "place order" message fails, there is no time to retry asynchronously. At this point, the "payment" message is consumed, and it must not be consumed normally.

At this time, the "payment" message should be waiting all the time, judging from time to time, whether the news in front of it has been consumed?

If this is really done, two problems will arise:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

The "pay" message is preceded by the "place order" message, which is relatively simple. However, if there are more than N messages in front of a certain type of message, how many times do you need to judge? this judgment is too coupled with the order system, which is equivalent to transferring part of the logic of their system to our system.

Affect the consumption speed of consumers

At this time, a simpler solution emerges: when consumers process the message, they first determine whether the order number has data in the retry table, and if so, directly save the current message to the retry table. If not, business processing is performed, and if an exception occurs, the message is saved to the retry table.

Later, we set up a failed retry mechanism with elastic-job, and if it fails after seven retries, mark the status of the message as failed and email the developer.

Finally, due to the instability of the network, the problems that users have been unable to see some orders and dishes on the client side have been solved. Nowadays, merchants delay seeing dishes occasionally at most, which is much better than not seeing them all the time.

News backlog

With the marketing of the sales team, there are more and more merchants in our system. This is followed by a growing number of messages, resulting in consumers can not deal with, often there is a backlog of messages. The impact on merchants is very intuitive, and the orders and dishes on the rowing client may not be seen until half an hour later. I can put up with it for a minute or two, and with the delay of half the news, I can't stand some grumpy merchants, so I complained immediately. During that time, we often received complaints from merchants about delays in orders and dishes.

Although adding server nodes can solve the problem, according to the company's usual practice of saving money, we have to do system optimization first, so we began the journey of solving the problem of message backlog.

1. The message body is too large

Although kafka claims to support millions of TPS, sending messages from producer to broker requires a network IO,broker to write data to disk and a disk IO (write operation). Consumer gets messages from broker through a disk IO (read operation) and then through a network IO.

A simple message from production to consumption requires two times of network IO and two times of disk IO. If the message body is too large, it is bound to increase the time consumption of IO, which in turn affects the speed of kafka production and consumption. As a result of consumers being too slow, there will be a backlog of messages.

In addition to the above problems, the message body is too large, and the server's disk space is wasted. If you don't pay attention to it, you may run out of disk space.

At this point, we have come to the time when we need to optimize the excessive size of the message body.

How to optimize it?

We re-combed the business, there is no need to know the intermediate status of the order, just need to know a final status.

It's so good that we can design it like this:

The message body sent by the order system only contains key information such as id and status.

After the kitchen displays the system consumption message, the data is obtained by calling the order details query API of the order system through id.

The kitchen display system determines whether there is data for the order in the database, and if not, enter it into the database, and update it if so.

Sure enough, after this adjustment, the problem of news backlog did not reappear for a long time.

two。 Routing rules are unreasonable

Don't be happy too early. One day at noon, some merchants complained that there was a delay in orders and dishes. As soon as we checked the topic of kafka, there was another backlog of news.

But this time it's a little weird. Not all the news on partition has a backlog, but there's only one.

At first, I thought it was something wrong with the node that consumed the partition message. However, after investigation, no anomalies were found.

This is strange, what is the problem?

Later, I checked the log and database and found that several merchants had particularly large orders, and these merchants happened to be assigned to the same partition, which made the partition much more messages than other partition.

At this time, we realized that the rule of routing partition by merchant number when sending messages is unreasonable, which may result in too many partition messages for consumers to handle, while some partition may be idle because there are too few messages.

In order to avoid this uneven distribution, we need to adjust the routing rules for sending messages.

After thinking about it, the routing with the order number is relatively more uniform, and there will not be a particularly large number of messages for a single order. Unless it is the case that someone has been adding food, but it costs money to add food, so in fact, there is not much news about the same order.

After adjustment, the order number is routed to different partition, and the message of the same order number is sent to the same partition each time.

After the adjustment, the problem of news backlog did not reappear for a long time. During this period, the number of our merchants is growing very fast, more and more.

3. Chain reaction caused by batch operation

In the scenario of high concurrency, the problem of message backlog can be said to go hand in hand, and there is really no way to solve it fundamentally. On the face of it, it has been solved, but I don't know when it will pop up, such as this time:

One afternoon, the product came and said: several merchants complained. They said that the food was delayed. Check the reason quickly.

The problem is a little strange this time.

Why do you say that?

First of all, this time is a little strange, usually something goes wrong, isn't it the rush hour for lunch or evening meals? Why is the problem in the afternoon?

According to my previous experience, I looked directly at the topic data of kafka, and sure enough, there was a backlog of messages above, but this time there was a backlog of more than 100,000 messages in each partition, which was hundreds of times more than the number of pressurized messages in the past. There is an unusual backlog of news this time.

I hastened to check the service monitoring to see if the consumers hung up, but luckily they didn't. Check the service log again and find nothing abnormal. At this time, I was a little confused. I took a chance to ask the order group what happened in the afternoon. They said that there was a promotion in the afternoon and went to a JOB to update the order information of some merchants in bulk.

At this time, I suddenly woke up, which was caused by their mass messages in JOB. Why didn't you inform us? It's so lame.

Although we know the cause of the problem, how should we deal with the backlog of hundreds of thousands of information?

At this point, it is not possible to directly increase the number of partition. Historical messages have been stored in 4 fixed partition, and only new messages will be added to the new partition. What we need to deal with is the existing partition.

It is not possible to add service nodes directly, because kafka allows multiple partition in the same group to be consumed by a single consumer, but does not allow a partition to be consumed by multiple consumer in the same group, which may result in a waste of resources.

It seems that we can only use multithreading.

In order to solve the problem urgently, I used the thread pool to process messages, and both the core thread and the maximum number of threads were configured to 50.

After the adjustment, sure enough, the backlog of messages continues to decrease.

But then there was a more serious problem: I received an alarm email and there were two downmachine nodes in the order system.

Soon, my colleagues in the order group came to me and said that the number of concurrency of our system calling their order query interface increased sharply, which was several times higher than expected, resulting in two service nodes failing. They integrate the query function into a single service, deploy 6 nodes, hang 2 nodes, and if they don't deal with it, the other 4 nodes will also be hung up. Order service can be said to be the core service of the company. if it fails, the loss of the company will be great, and the situation is extremely urgent.

In order to solve this problem, we can only reduce the number of threads.

Fortunately, the number of threads can be dynamically adjusted through zookeeper. I changed the number of core threads to 8 and the number of core threads to 10.

Later, the operation and maintenance of the order service after the restart of the 2 nodes returned to normal, just in case, add 2 more nodes. In order to ensure that there will be no problems with the order service, the current consumption rate is maintained, and the backlog of messages in the system shows that the backlog returned to normal after an hour.

Later, we held a review meeting and came to the conclusion that:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

The batch operation of the order system must notify the downstream system team in advance.

Downstream system team multi-threaded call order query interface must do stress test.

This is a wake-up call to the order inquiry service, which, as the company's core service, is not good enough to deal with high concurrency scenarios and needs to be optimized.

Monitor the backlog of messages.

By the way, for scenarios that require strict message order, you can change the thread pool to multiple queues, each with a single thread.

4. The watch is too large

In order to prevent the problem of message backlog from happening again, consumers have been using multithreading to process messages.

But at noon one day, we still received a lot of alarm emails reminding us that there was a backlog of topic messages from kafka. We were looking into the cause when the product came over and said: another merchant complained about the delay in the dishes, so take a look. This time she looked a little impatient, she did optimize it many times, but she still had the same problem.

In the eyes of the layman: why can't the same problem be solved all the time?

In fact, they do not know the bitterness of technology.

On the surface, the symptoms of the problem are the same, all due to food delays, what they know is caused by the backlog of information. But they do not know the underlying reasons, there are actually many reasons for the backlog of information. This may be a common problem with using message middleware.

I was silent and had no choice but to pinpoint the cause.

Later, I checked the log and found that it took as long as 2 seconds for consumers to consume a message. It used to be 500 milliseconds, but now how can it be 2 seconds?

Oddly enough, the consumer's code has not been greatly adjusted, why is this happening?

Checked the online food table, the single table data actually reached tens of millions, other dishes table is the same, now the single table stores too much data.

Our group combed the business, in fact, the dishes are only displayed in the last 3 days on the client side.

This is easy to do, our server stores excess data, it is better to archive the excess data in the table. So DBA archived the data for us, keeping only the data from the last 7 days.

After this adjustment, the problem of news backlog was solved and the former calm was restored.

Primary key conflict

Don't be happy too soon, there are other problems, such as: alarm emails often report database anomalies: Duplicate entry'6' for key 'PRIMARY', says primary key conflict.

This problem is usually due to the fact that there are more than two sql with the same primary key and inserting data at the same time. After the first insert is successful, the second insert will report a primary key conflict. The primary key of the table is unique and repetition is not allowed.

I examined the code carefully and found that the code logic would first query whether the order exists from the table according to the primary key, update the status if it exists, and insert the data if not, no problem.

This judgment is useful when the amount of concurrency is small. However, in a high concurrency scenario, if both requests find that the order does not exist at the same time, the exception of primary key conflict will occur when one request inserts data first and the other requests insert data again.

The most common way to solve this problem is to add locks.

I thought the same way at first, adding database pessimistic locks must not work, too affecting performance. Add a database optimistic lock, based on the version number, is generally used for update operations, such as this insert operation is basically not used.

The rest can only use distributed locks, our system is using redis, we can add redis-based distributed locks to lock the order number.

But then I thought about it carefully:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Adding distributed locks may also affect the message processing speed of consumers.

Consumers rely on redis, and if there is a network timeout in redis, our service will be tragic.

So, I'm not going to use distributed locks either.

Instead, you choose to use mysql's INSERT INTO... ON DUPLICATE KEY UPDATE syntax:

INSERT INTO table (column_list) VALUES (value_list) ON DUPLICATE KEY UPDATE C1 = v1, c2 = v2,...

It will first try to insert data into the table and update the field if the primary key conflicts.

After the modification of the previous insert statement, there has been no primary key conflict problem.

Database master-slave delay

One day soon after, I received a complaint from merchants that after placing an order, I could see the order on the paddling client, but the food I saw was incomplete, and sometimes even the order and dish data could not be seen.

This problem is different from that in the past. According to past experience, let's first see if there is a backlog of messages in kafka's topic, but this time there is no backlog.

After checking the service log, it is found that some of the data returned by the order system interface is empty, and some only return order data, but no dish data.

This is very strange. I went directly to the colleagues in the order group. They examined the service carefully and found no problems. At this time, we all thought that there would be something wrong with the database and went to find DBA. Sure enough, DBA found that the master library of the database synchronizes data to the slave database, with occasional delays, sometimes up to 3 seconds, due to network reasons.

If our business process takes less than 3 seconds from sending a message to consuming a message, the data may not be found or the latest data may not be found when calling the order details query API.

This problem is so serious that it can lead to direct errors in our data.

In order to solve this problem, we have also added a retry mechanism. When calling the API to query data, if the returned data is empty, or only the order is returned without food, the retry table is added.

After the adjustment, the problem of merchant complaints has been solved.

Repeated consumption

Kafka supports three modes when consuming messages:

At most onece mode up to one time. Ensure that the commit of each message is successful before consumption processing. Messages may be lost, but they will not be repeated.

At least onece mode at least once. Ensure that each message is processed successfully before commit. The message is not lost, but it may be repeated.

The exactly onece mode is passed exactly once. The offset is processed simultaneously with the message as the only id, and the atomicity of the processing is guaranteed. Messages are processed only once and are not lost or repeated. But it's hard to do that.

The default mode of kafka is at least onece, but this pattern may lead to the problem of repeated consumption, so our business logic must be idempotent.

Our business scenario uses INSERT INTO... ON DUPLICATE KEY UPDATE syntax to save data, insert when it does not exist, update when it exists, and naturally support idempotency.

Multi-environmental consumption problem

Our online environment at that time was divided into pre (pre-release environment) and prod (production environment). The two environments shared the same database and shared the same kafka cluster.

It is important to note that when configuring topic for kafka, add a prefix to distinguish between different environments. Pre environments start with pre_, such as pre_order, and production environments start with prod_, such as prod_order, to prevent messages from being strung in different environments.

However, when the OPS switches nodes in the pre environment and configures topic, it is mismatched and becomes the topic of prod. Just that day, we had new features on the pre environment. As a result, some messages from prod were consumed by the consumer of the pre environment, and because the message body was adjusted, the consumer of the pre environment failed to process the messages.

As a result, some messages have been lost in the production environment. Fortunately, in the end, consumers in the production environment solved the problem by resetting the offset and re-reading that part of the message without causing much loss.

Thank you for your reading, the above is the content of "what are the problems about Kafka". After the study of this article, I believe you have a deeper understanding of what is the problem of Kafka, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.