What is the idea of event and fault troubleshooting in IT operation and maintenance? 04/15 Update SLTechnology News&Howtos

What is the idea of event and fault troubleshooting in IT operation and maintenance?

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the IT operation and maintenance of events, troubleshooting ideas is how, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor with you to understand.

Before explaining the event and fault handling ideas, let's talk about a fault scenario (take the call center system as an example):

The business staff reported that the call center system ran slowly, some phones timed out in the self-service language system, the traffic was transferred to manual seats, and the line burst occurred in manual seats.

The operation and maintenance staff began to work, checking the use of resources, checking whether the service is normal, checking whether the log is wrong, and checking whether the transaction volume is still there. Time passes unwittingly on the keyboard, but the reason has not yet been located.

The manager came to know the situation: "has the system been restored?" , "what is the impact of the failure?" , "is the deal interrupted?" ……

The operation and maintenance staff quickly tapped the keyboard, wrote sql, looked at the trading volume, tapped the keyboard, wrote orders, looked at the system resources, and the situation.

In the end, the cause of the problem is that one of the features does not control the number of returns, resulting in a memory leak.

In view of this fault, the business hopes that the operation and maintenance staff can solve the fault recovery more quickly. The manager hopes to develop and optimize the call center fault handling process and does the following things:

1. The time of priority failure handling process-"Don't use keyboard for work that can be done with mouse."

2. Find the fault in advance and strengthen the monitoring-- "the technology discovers the problem earlier than the business, and the monitoring is not only an alarm, but also helps to locate the fault"

3. Improve the emergency response plan-- "the emergency plan is up-to-date, accurate, simple and clear".

4. Long-term goal: fault self-healing-"the operation that can be solidified is automated, and what can be done by the machine can be done by the machine"

The following will start with the common fault handling methods, and then from the fault preparation work (improving monitoring, making emergency plans, etc.) to solve the problems raised by the manager, and put forward the idea of solving the fault in the future.

1. Common methods:

1) determine the fault phenomenon and initially determine the impact of the problem.

Before dealing with the fault, the operation and maintenance personnel should first know the fault phenomenon, which directly determines the formulation of the fault emergency plan, which depends on the operation and maintenance personnel need to have a certain degree of familiarity with the overall function of the application system.

After confirming the fault phenomenon, the operation and maintenance personnel can be guided to judge the impact of the fault.

2) Emergency recovery

The most basic index of operation and maintenance is system availability, and the timeliness of emergency recovery is the key index of system availability.

With the judgment of the above fault phenomena and effects, we can formulate fault emergency operations, there are many fault emergencies, such as:

If the overall performance of the service is degraded or abnormal, you can consider restarting the service.

If the application has made changes, you can consider whether you need to cut back the changes.

With insufficient resources, emergency capacity expansion can be considered.

For application performance problems, you can consider adjusting application parameters and log parameters

As the database is busy, you can consider optimizing SQL through database snapshot analysis.

If there is a mistake in the design of the application function, you can consider emergency closing the function menu.

There are a lot of.

In addition, it should be added that before the failure emergency, the current system scene needs to be saved under conditional conditions, such as catching a CORE file or database snapshot file before killing the process.

3) quickly locate the cause of the fault

Whether it is accidental and reproducible

Whether the failure phenomenon can be reproduced is very important for solving the problem quickly. It shows that there are always ways or tools to help us locate the cause of the problem, and the failure that can be reproduced may be caused by service anomalies, changes and other work.

However, if the fault is accidental and has a very small probability, it is more difficult to troubleshoot, which depends on whether the system has enough on-site information during the fault to determine whether the always cause can be located.

Whether any relevant changes have been made

Most of the faults are caused by changes, after determining the fault phenomenon, if there is a change, it is helpful to analyze whether it is caused by the change from the point of view of the change, and then quickly locate the fault and prepare emergency plans such as failback.

Can the scope be narrowed down?

On the one hand, the application system advocates decoupling, and a trade will flow through different application systems and modules; on the other hand, the failure may be due to the problems of application, system software, hardware, network and so on. In troubleshooting the cause of the fault should avoid comprehensive troubleshooting, it is recommended to narrow the scope of the problem to a certain procedure before starting to coordinate the related team troubleshooting.

Cooperation analysis of related parties

At the same time as point (3) avoids the investigation of all the related teams at the same time, we need to have an open attitude to request the cooperation and positioning of the related parties after narrowing the scope, while we need to have a positive cooperation attitude towards the related parties.

Whether there are enough logs

The most commonly used way to locate the cause of the failure is to analyze the application log. Operators need to know not only which service process the business function corresponds to, but also which application log corresponding to this service process. And have some simple application log exception error judgment ability.

Do you have files such as core or dump

The system site during the fault is very important. It is recommended to leave the system site files, such as CORE\ DUMP, or TRACE to collect information, and back up some logs that may be overwritten before the fault emergency.

The above are common methods for general failures. When major failures or multi-party failures occur, small-scale troubleshooting is often not conducive to rapid resolution. It is necessary to start the emergency handling process. It is recommended to consider the following communication:

Summon relevant personnel

Describe the current situation of the fault

Explain the normal application logic flow

Statement change

Check progress and display information

Leadership decision-making

2. Perfect monitoring and control

1) perfect from monitoring visualization.

A perfect monitoring strategy requires a unified visual operation interface. After formulating a sound monitoring strategy, fault handlers need to be able to see the corresponding operation data quickly. For example, they can see the trend over a period of time, data performance during the fault period, performance analysis, and other data, and these data can be formulated in advance and the analysis results can be directly derived to fault handlers. This greatly improves the efficiency of fault handling. Taking the call center system as an example, the following real-time transaction data need to be configured in advance in order to locate the fault:

-transaction performance data: average transaction time, internal module transaction time (IVR transaction time, interface bus transaction time), related system transaction time (core transaction time, work order system transaction time, etc.)

-important transaction indicator data: transaction volume, IVR transaction volume, telephone traffic, agent call rate, number of core transactions, work order and other system transaction volume

-transaction exception data: transaction success rate, failure rate, error code maximum transaction

-analyze transaction data by server: count the number of transaction transactions for each service by server. The total transaction time is consumed.

With the above transaction data, and through monitoring and statistics at a certain frequency, operators can see when the fault begins with a mouse click, whether there is a problem within the system or a problem with the related system, which is the most prominent transaction, whether the transaction volume of each server is balanced, and so on.

2) perfect from the aspect of monitoring.

The most basic work of monitoring is to realize the comprehensive monitoring and management of IT resources such as load balancing equipment, network equipment, server, storage device, security equipment, database, middleware and application software. In the monitoring of application software, we need not only service process and port monitoring, but also business and transaction layer monitoring.

Comprehensive application monitoring can make fault early warning and save the data that affect the running environment of the application, so as to shorten the fault processing time.

3) improve the monitoring and alarm.

A sound monitoring strategy requires a clear monitoring alarm, and the personnel on duty can make a simple problem location and emergency handling plan according to the monitoring alarm. For example, surveillance messages similar to the following:

22:00, the [application port: 9080] does not exist in the [front application module] of [application server LC_APPsvrA 10.2.111.111] in the [financial management application system]. This port functions as [providing financial application processing (load balancer deployment)], possibly because [SERVER1 service stops abnormally]. The monitoring system has carried out the following emergency treatment [automatic execution port process starts], and the event is of [high] urgency.

Through the SMS content, the administrator can see which system, which application, which module has gone wrong, what may be the reason, and what impact it has on the business. Information such as whether it needs to be dealt with immediately (such as whether the early morning warning can be delayed to the next day) and other information.

4) perfect the monitoring and analysis.

A sound monitoring strategy requires not only real-time data alarms, but also analytical alarms of aggregated data. Needless to say, the importance of alarms of real-time data analysis can discover potential risks for aggregated and analyzed data. At the same time, it also provides help for the analysis of difficult and complicated diseases.

5) improve the monitoring initiative.

Monitoring is not only an alarm, it can also do more, as long as we find a way to give it the rules to actively solve events, it will have the ability to deal with failures for administrators.

3. Emergency plan

It is necessary to make a contingency plan in advance, but we encounter some problems in our daily work:

1) lack of continuous maintenance, lack of drills, and untimely and inaccurate information in the emergency plan

2) the emergency plan is too large and comprehensive, which is not conducive to reading and use.

3) the form of the emergency plan is larger than the actual use effect, and the pertinence of the scheme is not strong.

4) only pay attention to the content of the emergency plan, but do not pay attention to the operator's understanding of the plan.

In view of the above common problems, the emergency plan needs to do the following:

1) simplify the content

Many people may think that failures occur in a variety of forms, so contingency plans need to cover all aspects. But in the actual fault handling process, we can find that in fact, our emergency measures often repeat several commonly used steps, so I think the emergency plan should be focused, if an emergency plan can deal with 80% of the scenarios of normal fault handling, then this emergency manual should be qualified. Too much pursuit of content that affects all aspects of the application system will lead to poor readability of the solution and eventually change a document that is subject to inspection. The following is what I think the emergency plan of the application system should have:

(1) system level:

Can know the role of the current application system in the whole transaction, when there are problems in the current system or upstream and downstream problems, we can know how to cooperate with upstream and downstream analysis problems, such as: how upstream and downstream systems communicate, whether there are unique keywords and so on.

In addition, the system level also involves some basic emergency operations, such as capacity expansion, system and network parameter adjustment and so on.

(2) Service level:

You can know what business this service affects, where the logs, programs and configuration files involved in the service are, how to check whether the service is normal, how to restart the service, how to adjust the application-level parameters, and so on.

(3) transaction level:

Can know how to find a certain branch or a certain type of trading problems, is a large area, local, or occasional problems, can use data to explain the impact of the transaction, can locate the error information of the transaction. The most common method here is the use of database queries or tools.

Know how to check whether the most important transactions are normal, and emergency solutions for important scheduled tasks, such as opening, changing dates, time requirements for reconciliation and emergency measures.

(4) use of assistive tools:

Sometimes, some tools or automation tools are needed to assist in analysis and emergency response, and there is a need for methods of how to use assistive tools.

(5) Communication plan:

The communication program involves the address book, including upstream and downstream systems, third-party units, business units and other channels.

(6) other:

How the above five points are complete, I believe that this emergency manual can solve 80% of the fault recovery work.

2) contingency planning is an ongoing task

With the emergency plan, how to keep the operation and maintenance personnel updated continuously is a difficult point. I think in order to solve this difficulty, we need to let the operation and maintenance personnel use this manual frequently. If a manual does not have a scenario to use, managers need to create opportunities for operators to use the manual, such as emergency drills.

3) pay attention to the operation and maintenance staff's understanding of the application of key information

The first two points pay attention to the manual, and the last point I think is necessary to pay attention to the people who use this manual. Some operation and maintenance personnel think that the application operation and maintenance personnel do not have the ability to understand the content of the application system very thoroughly, so the position of the application operation and maintenance personnel in the fault handling process is very awkward, and the operation and maintenance personnel have the right to operate, but they don't know what to do.

In this regard, I agree that application operation and maintenance personnel do not need to master the business functions of the application system, but I think application operation and maintenance personnel need to have the following basic capabilities for the application system itself:

(1) know what the application system does and what the basic business is.

(2) know the deployment of application architecture and the logical relationship between upstream and downstream systems.

(3) know how to find and simply locate the data information such as the function of the service, port, service-level emergency handling, log and so on under the application.

(4) know the important time points and tasks of the application system, such as opening, closing, changing dates, timing tasks, and how to judge whether these tasks are correct.

(5) know the flow of the most important transactions

(6) know the common database table structure and can use it.

4. Intelligent event handling

The processing method is as follows (detailed intelligence involves monitoring, rule engine, configuration tools, CMDB, application configuration library and other modules to work together)

Thank you for reading this article carefully. I hope the article "events and troubleshooting ideas in IT Operation and maintenance" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.