The system is broken, don't panic, refer to the following measures 07/09 Update SLTechnology News&Howtos

The system is broken, don't panic, refer to the following measures

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

As a programmer, I believe there is one thing you don't want to see. That is, there is a technical failure in the system running online. (especially on weekends when you are out on happy: d)

Dealing with this kind of thing especially reflects a person's comprehensive ability. Because it will involve the ability to resist pressure, the ability to communicate with the outside world, and the technical ability needed to troubleshoot problems, and so on. If you haven't had the chance to become a core developer, you rarely have such a stressful experience. Because dealing with things in this situation is actually very panicked, after all, all the people who use the system, as well as their boss, your superiors, your boss, and so on, are keeping an eye on it.

I still remember that on Singles Day one year, as the "chief problem handling officer", I was urgently dealing with the pressure on the server. My boss silently came up behind me and asked, "what's the problem? when is it good?" Make up for this picture in your head and imagine it.

As long as you continue to work as a programmer, I think you will have a chance to encounter such a scene. Because of a famous law, Murphy's law. Murphy's Law: anything that can go wrong is bound to go wrong. If you don't have a clear idea of how to deal with it, you will go around like ants on a hot pot and bump around like headless flies if an online problem occurs.

So, this time, I would like to share some of my experience as a "chief problem officer" over the years. This is the experience of carrying my N sweat and brain cells.

We come across a bug during the daily iterations of project development, and the process is almost the same: locate bug-> solve the bug. Maybe a small number of people will have a thought and review after solving bug to see if there is anything similar to bug and get rid of it.

This process of "positioning-> solving-> reviewing" is also applicable to the handling of online problems. But it has to be more than that. As the saying goes, the most difficult part of solving a problem is not the process of solving, but the process of positioning. Therefore, for online problems, we can't afford to wait for the time it takes to locate the problem, so we have to "restore" the normal use of the system as the top priority. So, the process becomes: "restore-> locate-> resolve-> Review."

One more thing, there are some people's view that the primary goal of restoring the system should include sacrificing the action of keeping the scene, which may also take several minutes. I object to this point of view. The reason is that the length of time to solve the problem is indeed a very important indicator, but since the problem has already occurred, if the root cause is not found in the follow-up because there is no reservation of the site, the next time the problem occurs again, the scene will be even more ugly. So my view on this matter is that keeping the scene is the most important. Therefore, the process becomes: "keep the site-> restore-> locate-> resolve-> Review."

Of course, keeping the scene does not mean that it has to be comprehensive and take a lot of time. Keep all the relevant clues you can think of right now in the quickest way. If the root cause can not be found because of the lack of clues afterwards, it can only be said that it is inexperienced and consider which on-site data need to be retained in the future. Okay, now that you've identified these five steps, what can you do with each step? I'll talk one by one.

/ 01 keep the site /

One of the most important things to keep on site is to save the dump file of the exception program. With it, you can get rid of the blind image analysis problem, and can quickly locate the source of the problem.

I used three "most" to emphasize its importance. If you haven't mastered it, then put aside all the things I mentioned later and master it first.

In addition, if the monitoring system of the system is not complete, it is necessary to quickly save the monitoring data of the operating system and the third-party components through screenshots when problems occur.

When saving monitoring data, you should pay special attention to network-related data. If you find that there is something abnormal in the network-related data, then save the current network connection through the command. Because relatively speaking, the probability of network problems is much higher than that of hardware, whether caused by programs or other causes. The larger the system, the more so.

/ 02 restore /

There are many ways to restore system access. First of all, we have to mention a magic trick that applies to 80% of cases-- restart. Yes, based on years of experience, it does work in most cases. It is precisely because it works all the time that many people habitually restart it at the first time, causing the site to forget to save and be destroyed. There are also two types of restart, forced restart and natural restart. Of course, priority is given to a natural restart, which can avoid generating some unexpectedly dirty data. But if there is an abnormal resource consumption in the system, don't foolishly wait for a natural restart, you can only force it to restart (kill the process).

The second common method is "rollback". Of course, the prerequisite is that you judge that the problem is due to the most recent release. Otherwise, blind rollback will not only not work, but will become more and more messy, especially in distributed systems. Because in the distributed system, once the upstream and downstream coupling place does not dock, light will report an error, heavy will appear a large number of abnormal data, enough for you to follow up for a long time.

The third method is "demotion". Pause the faulty module and stop the service. Of course, this action requires good communication with the business side, whether a separate downgrade of a module will lead to problems such as incomplete business.

The fourth method is "flow restriction" or "capacity expansion". If you find that the system can't handle the sudden increase in traffic, you can quickly expand several machines and programs if you can. If the capacity cannot be expanded, you can choose to limit the flow and directly deny service to a certain percentage of requests. After all, compared with some of the services that cannot be provided, the latter must be cost-effective.

There are some relatively minority methods are "cut to standby", "fault isolation" and so on. They have more requirements on the environment and conditions. Sometimes the system may not return to a completely normal state, for example, reading data is OK, but some operations still have problems writing data. In this case, do not rush to locate the problem, or try your best to restore to the maximum available state before taking the next step, after all, the user comes first.

/ 03 location /

With regard to positioning, it is most convenient if there is a dump file. By analyzing the dump file through the dump file analysis tool, you can quickly locate the faulty lines of code, especially the program blocking, memory overflow, cpu100% and so on are obviously the problems of the program itself.

Different languages have different dump analysis tools, you can search the tutorials on the Internet. The ultimate goal is to locate the stack information of the outlier, which is equivalent to directly locating the problem code wherever it appears.

If the analysis of dump files is to skip the steps of cocoon peeling and get to the point. Through monitoring data, log layer by layer analysis is a slow job. However, if the dump file is missing or the problem can never be parsed from the dump file, you can only choose the latter.

We must have a sense of relevance when looking at logs and monitoring data, rather than just looking at a single dimension. Because sometimes the data you see in a single dimension seems normal, but you may not be associated with it. For example, the number of tcp connections has halved, but memory has increased by 100%. Why? There may be clues to the malfunction.

/ 04 solution /

Once the problem is located, it will be very simple to solve it. The change code of the change code, the change configuration file of the change configuration. There is no more to say here, after all, there are too many situations, what we may encounter may be different.

/ 05 Review /

Everyone knows the benefits of review, but there are not many people who actually do it. If you don't know where to start reviewing, you might as well start with the following questions.

1. What is the cause of this failure?

two。 Is there a faster way to resume business at that time?

3. How to avoid similar failures?

4. Are there similar potential risks in the current system?

If you can answer these questions, I think this review will be in place, and the rest is implementation.

Of course, no matter how well you handle the failure, the best thing is not to break down. So we need to make more preparations in the early stage.

/ 06 know your program /

Many of us know that the only way we are responsible for programs is through coding. Unless the program is a single application, this approach is far from enough.

I suggest you follow the following list to learn about your program:

1. What modules does the program contain and what are the corresponding users? Which are the core modules and which can be "abandoned"?

two。 How do multiple modules / systems flow? (try to draw a flow chart to deepen your memory)

3. Which middleware do you rely on and who is responsible for maintaining them?

4. What other programs do you rely on, strong or weak, and who is responsible for maintaining them?

5. What kind of storage is relied on behind the dependent storage and message queue, and who is in charge of storage operation and maintenance?

6. What environment is the online program deployed in? Do you have the conditions to deploy and tune independently?

/ 07 make a good monitoring /

Most failures do not occur suddenly, but accumulate gradually until they break out. So the value of monitoring is not only to look at the data, but also to identify anomalies.

General monitoring is divided into two dimensions, the system dimension and the business dimension. Monitoring indicators are divided into three layers, "environmental indicators", "procedural indicators" and "business indicators". How to do this is explained in my previous article "distributed system focus-360 °omni-directional monitoring", so I won't repeat it here.

If it is a distributed system, you can also build a request link tracking system. There are many mature off-the-shelf solutions, CAT, SkyWalking, Zipkin, Pinpoint, and so on.

One more thing, when we do monitoring and early warning, we should also pay attention to volatility in addition to setting thresholds. For example, for a resource with a daily utilization rate of 20%, in addition to setting a threshold of more than 80% for early warning, it also needs early warning when its fluctuation range is more than 100% (utilization rate 40%). Otherwise, once it grows to 80% at a faster rate, the chance for you to eliminate the fault before it breaks out is very slim.

In addition, several fault response schemes are preset for common failures, and regular fault drills (usually considered by companies of a certain size or companies in the stage of expansion) can make the team more comfortable with online failures.

If you have the misfortune to become an online fault solver, if your supervisor is not around, you need to report the problem to your supervisor regularly so that TA can understand the severity of the problem, repair progress and make a decision.

Anyway, even if you don't report it, you will be urged sooner or later. Instead of being urged passively, take the initiative to report it.

The original text is from: https://www.linuxprobe.com/system-down-process.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.