How to troubleshoot online faults 07/12 Update SLTechnology News&Howtos

How to troubleshoot online faults

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how to troubleshoot online faults". In daily operation, I believe many people have doubts about how to troubleshoot online faults. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the question of "how to troubleshoot online faults"! Next, please follow the editor to study!

Preface

When it comes to online failures, programmers should have experienced it, and we can quickly improve from the failure recovery process. If you step on too many pits, you will gradually become a Daniel. This question is also one of the most popular questions for interviewers in large companies, and the interviewer can get at least two aspects of feedback from the candidate's answer to this question. The first is whether the project you are usually responsible for is the core project. If you say that you are responsible for the rear management system and restart the OK if there is a problem, you can only turn right when you go out. The second is the candidate's ability to deal with problems systematically. How do you quickly stop the bleeding; how to quickly locate the specific problem step by step; whether the preparation before the failure is adequate, and whether there is an emergency response plan at the risk point.

Next, let's talk about the online troubleshooting process.

Rapid hemostasis

In a distributed system environment, the most important thing is to stop bleeding quickly. Students who have been in Internet companies know that the first question in scary retrospective meetings is why the failure took half an hour before the business complained. Or why it took 15 minutes to know that it was a slow SQL problem that was rubbed on the ground.

The reason is that in distributed systems, faults are easy to produce "domino effect". For example, the request response of an infrastructure service becomes slow because of a slow sql, which will lead to the accumulation of upstream requests and the thread cannot be released, which in turn causes the online system to become very slow, resulting in a large number of error. This avalanche process is sometimes very fast, when the person in charge of testing, operation and maintenance, and upstream systems bombards you with all kinds of phone calls and information, you must be at this time.

What shall I do? If the problem has not been located for a long time, you can only use the killer mace first:

Restart the system when it is not available, first ensure the availability of the system, and continue to locate the specific functional problems. It would be awkward to restart a large number of error in a few minutes.

Important current-limiting interfaces should be prepared for current-limiting configuration in advance, and the interface QPS can be changed dynamically.

If there was an online rollback the day before, nine times out of ten, it was caused by being online. In this case, if the problem is not checked out for the time being, you can roll back first, and then organize a bunch of people to pick up the newly submitted code.

Emergency expansion first, the service must be stateless, support dynamic expansion, and the bottleneck must be in the application service, if the bottleneck is in DB or anywhere else.

Troubleshooting process

As mentioned earlier, if there is an online launch the day before, and then there is a fault rollback, nine times out of ten, the regression test is not complete, affecting the previous logic, and the worst-case scenario is a bunch of people line by line to pick the code. What we are talking about now is what to do if the production service becomes slow and the error alarm continues to increase.

One of the most important means of service monitoring system design is service monitoring. When the system is online, you can't run naked, otherwise you don't know how to die. Amway to Meituan's open source CAT monitoring system, CAT can real-time monitor various indicators, each link event. Including server CPU load, JVM memory, GC information, thread information, slow URL, slow SQL, request response time aPCge, 95 lines, 99 lines, and how many service error alarms per unit time, etc.

Troubleshooting orders interviewers often like to ask candidates what troubleshooting orders are available online and how to troubleshoot them. Generally speaking, the investigation is conducted in the order of the whole and then the part.

1. First of all, query the whole machine through the top command to check the overall situation. The more important indicators are Load AVg,CPU usage, CPU and MEM of each process.

You can also view the reduced version through uptime

2.CPU queries CPU through the vmstat tool. Vmstat contains two parameters. The first parameter is the sampling interval time, but in seconds, and the second parameter is the number of samples. Such as:

Vmstat-n 2 3

Indicates that samples are taken every 2 seconds, for a total of 3 times.

Proces

R number of processes running and waiting for CPU time slices

B the number of processes waiting for resources

Cpu

The percentage of CPU time consumed by us user processes. If it is more than 50% for a long time, there may be a risk of CPU leakage, and the optimization program is needed.

Percentage of CPU time consumed by sy kernel processes

3. Memory

Freefree-gfree-m

General application available memory / system physical memory

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.