How to analyze the investigation process of Kafka Catton accident 07/19 Update SLTechnology News&Howtos

How to analyze the investigation process of Kafka Catton accident

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to analyze the Kafka Catton accident investigation process, in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Due to a function online, resulting in a sharp decline in the amount of data, how to achieve troubleshooting!

1. Confirm the authenticity of the problem?

Was told by the data department, a serious decline in the amount of data, that is to know the seriousness of the problem. And the problem arises after my function is online, and the first reaction is, what's wrong with my code? However, we still have to follow the process, through various dimensions of data to compare the number of requests, the actual landing. Confirm the problem!

In fact, in the process, we did not confirm the decline in the amount of data. But this is also due to the decline in the data. We can only proceed to the next step!

two。 Check the code, find experienced students, and compare the original functional differences?

In fact, this step is a bit blind. Because the first step of the investigation did not find enough proof that the problem was with us, but the problem was that only we had been online during the period, so we had to reflect on ourselves.

Fortunately, however, this process is really useful, if you do find a hole buried by yourself, this hole will indeed lead to a decline in the amount of data. Just fix it!

Then I breathed a sigh of relief and thought it was done. In fact, the amount of data still can not go up. This is super awkward!

I have begun to doubt life, haven't the code been posted? Is there a difference between online and local somewhere? The test environment repeatedly tests correctly. I really want to directly get the test environment code online, hey, forget it, a lot of things will not be transferred by people's will, let's be rational! Don't find a way out!

3. Why don't you just sit next to dba and let's keep an eye on the amount of data at any time?

Self-investigation can no longer save yourself, so go to dba. Please help me to count the changes in the amount of data after being online, and the result is that there is not much difference. I thought it might be that the time was too short to see a change. Let's count it later. Still no change! Oh, my God, the pot is still there.

Large amount of data is not good, then I use my own account to test it, after the completion of the operation, observe the data, and find that sometimes there are sometimes none! Well, I can't say anything.

4. Debug locally, right?

It was originally thought that it was an online problem, so it would be better to deal with it urgently. However, the fact is beyond my expectations, the verification directly to the online, is not responsible for the user, is not responsible for the data. Let's start locally.

Local debugging has to go to vpn, which is a bit annoying, but it still runs anyway. No problem! This is awkward.

Then, lead to the next topic!

5. The configuration of the online environment is different from the test environment?

Then we try to find out the differences, even if there is one more file, the change time of a file is inconsistent, we all want to try! Of course, for the sake of safety, we cannot verify it directly online unless there is enough evidence to show that there is something wrong with the online configuration. Of course, we did not find such evidence in the end, but moved everything on the line to the test environment to verify it, and the result was unimpeded!

There is also a reason to prove that this road is not feasible. Will the things that run well in the previous configuration break down by themselves? Impossible. No Through Road!

6. It really doesn't work, so you can only change the code for online debugging?

Debug the first step, each log! Add the complete log to the incomplete part of the previous request, and send it again. With the log, there is evidence, but it is really a mistake. The log is not typed correctly, and it is enough to print the parameters to the memory address.

After the log is changed, test it and continue to use your own account. Again, sometimes you can get in and sometimes you can't (monitoring means to set up a temporary kafka consumer for dba, and then pull the data out to see)! What about the whole thing?

Is it possible that some machines are broken? The request assigned to the bad machine fails, and the request assigned to the correct machine is correct. Then I did a lot of data verification for a long time. I thought this was the direction, but I was hit back.

7. Can't we just grab the bag?

Tcpdump, a network flow capture package artifact, lsof assists.

Grab the packet just to confirm a problem, the client machine has sent a request to the server machine, the network flow is running normally! It is then proved that the client machine has a large number of long connections to the server and that the data stream sends and receives normally (syn). This at least shows that there is nothing wrong with the client! Then there is still one problem, that is, there is something wrong with the server! We firmly believe that, of course, there must be evidence.

By the same token, we grabbed the package in the reverse direction on the server machine, and then caught the package from the client, which is very smooth. Uh...

8. No, there is no train of thought. Why don't you restart the machine?

No, I'm talking about restarting the service. There have been some changes recently. According to reason, whoever changes and restarts. However, this is useless because the previous releases have already been restarted n times. Then what to do. All that's left is to restart the server, kafka service, and be a doctor.

After reboot, verify it. As a result, it seems that there are successes and failures!

9. Change an asynchronous request to a synchronous request?

There is no train of thought again, I am not reconciled to it, why the test environment is good, not on-line? And think about the difference?

The conclusion is that the online concurrency is large and the test environment is not available. And then found that this piece of code is done by asynchronous threads, could there be a problem here?

Anyway, try synchronous request instead. One more page!

Needless to say, after changing to synchronization, although user requests are basically slow, it is found that kafka requests do exist. Is it really because of this, then we can't change it in this way? user experience comes first. In order to change asynchronism to synchronization, we have to struggle with it. Change it back and move on to something else!

10. Then go back to the test environment, stress test concurrency?

After reverting to asynchronism, we have returned to the original situation of success and failure.

Since it is suspected that high online concurrency is caused, why not take a high-concurrency stress test in the test environment? Use the shell script to quickly write a circular request script, after a large number of requests to kafka, there is no exception, so the concurrency problem is cancelled. (for,nohup a.sh > / dev/null 2 & > 1 &) n times means to simulate n concurrent requests

11. Why don't you go over the code again?

I don't know how many times I've checked it, but I still have to check it, otherwise how to do it? several people look at the code together!

However, this is of no use.

twelve。 Put aside user behavior and directly manipulate the request in the form of a command line.

Although user behavior is the most authentic verification, it is also a troublesome one.

We put aside all kinds of intermediate links and make a request directly to the kafka server!

There are two ways: 1 to request with the current code, and 2 to request with the request method that comes with kafka. As a result, we get two different results. the data requested by code is not successful, and if we use kafka's own request method, we will respond in millisecond level. Hey, does this make me doubt the code again?

13. There is no way out. Let's take another look at the data.

There is really no way of thinking, can only take a look at the data, when passing the time.

The accident happened just when you didn't expect it. The data has returned to normal! Damn it!

The reversal time and event is due to the reboot of kafka, which leads to the recovery of data.

Well, the problem has been located, caused by Kafka Catton. We can't stand it any longer. Send a conclusion email and go back to bed first.

14. Why does kafka get stuttered?

This is the root of the problem! It's just that we didn't have the strength to go any further at that time!

The conclusion is that the throughput decreases because the number of topic requests is too large and the partition is too small. After enlarging the partition, it finally returned to normal!

The answer to the question on how to analyze the Kafka Catton accident investigation process is shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.