In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article introduces the relevant knowledge of "how to solve the problem of production Kafka failure". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Background
Last week more than two o'clock in the afternoon, A Fan was leisurely tapping the code, sporadically saw several alarm robots sent alarm messages with high Kafka cluster load, saw that the load was high and did not pay much attention to it, not to mention that this point was not the peak, thinking that it would be better later. Who knows it won't get better after a while, and more and more, take the computer and run to the operation and maintenance office to see what's going on. Do not see do not know, a look startled, the cluster of some topic data can not be written in! However, the producer side does not report any errors and seems to be writing normally, while the cluster is reporting errors, and the consumer side does not consume any data.
The error report is as follows:
ERROR [KafkaApi-2] Error when handling request {replica_id=-1,max_wait_time=500,min_bytes=1,topics= [{topic=xxxx,partitions= [{partition=0,fetch_offset=409292609,max_bytes=1048576}]}} (kafka.server.KafkaApis) java.lang.IllegalArgumentException: Magic v1 does not support record headers
There must be no problem to see this program, because there is no upgrade recently, and the cluster and service are tried to restart, but the problem still exists. At this time, in order to ensure the stability of the business, considering that there may be problems with the topic, it is decided to delete the topic and then recreate it automatically. Although some data will be lost, it will not have a big impact, but it will be more serious if the service cannot write the data for a long time.
Deal with
Fortunately, our service is based on the service configuration and discovery made by Nacos. Modify the Kafka cluster configuration in Nacos and temporarily switch to another cluster, and then restart the service, because we do not enable Nacos configuration to take effect automatically. After switching, the data is normally written to the new cluster, and then the wrong topic in the old cluster is manually deleted. After deleting the erroneous topic, the cluster becomes normal without the above error. Now that there are no errors, switch the cluster configuration back by modifying Nacos, and everything is fine.
The whole accident took more than 20 minutes from discovery to resolution, but because the alarm message was ignored at the beginning, the data was affected for almost an hour. Fortunately, this data will not have a big impact on the online business itself. Moreover, some of the data can be recovered by switching to temporary clusters and log data.
After a review, mainly summed up the following points, to share with you, encourage:
Awe online! Check the online environmental alarm information for the first time to make sure there is no problem!
Ensure the security of online data, backup and switch temporary environment in time (this area must be dynamically configured, do not go through the release process slowly, it is recommended to use Nacos)
Review after the event, review the whole process, what can be optimized, what is wrong, which is a waste of time, and whether this situation can be solved quickly next time. Time is money in production. One more minute in an accident is one more minute of risk. Sometimes one minute can change a lot of things.
This is the end of the content of "how to solve the problem of production Kafka failure". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.