Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Talking about quantifiable data Center Monitoring Service and Operation method

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Share

Shulou(Shulou.com)06/01 Report--

Talking about quantifiable data Center Monitoring Service and Operation method

After more than a decade of construction and development, whether it is the old data center or the new data center, the later operation and maintenance management methods and means have been considered more mature, of course, operation and maintenance management tools have become a necessary product. When it comes to data center operations and maintenance, there are many theories, schemes, methods and tools. Today, we mainly discuss the problems faced by active monitoring tools and the solutions.

The main problem faced by the monitoring system is that there are too many alarms, which leads the users to think that the system is unreliable. Although these alarms are configured by the users themselves, the users have no idea. The second question is how to use the monitoring system and how to assess the team on duty to make the best use of things and make the best use of people. The third problem is how to quantify the monitoring service and reflect the value of the monitoring service.

With regard to the problem of too many alarms, based on my previous project experience, the two main reasons for the high number of alarms are too many monitoring strategies and too detailed monitoring scope. The solution is mainly to optimize the alarm through directed configuration policy and limiting repeated alarms, so that the accuracy of serious alarm information is improved to about 80%, but there is still more information for early warning, because it is impossible to set the threshold to an appropriate value and can not completely limit the frequent trap information in the network (trap is information triggered by network devices and each OS). Of course, for most products, the acceptance of invalid trap can be restricted through restrictive policies. These methods need long-term system maintenance to complete.

The assessment of the monitoring system mainly depends on the system function, the coverage of equipment types, monitoring frequency granularity and stability and other indicators. Of course, we think that the indicator of fault accuracy is very important. If we consider that the tool is the tool of the operation and maintenance team itself, the definition of this index is of little significance. It may be easier to understand the continuous optimization description of the tool later. The accuracy is related to the attitude and ability of the operation and maintenance team. According to the summary of many projects I have done, the importance that the operation and maintenance team attaches to the monitoring tools directly affects this data.

There is no clear agreement on the assessment of the monitoring team in the industry, which is mainly a summary of the experience of long-term operation and maintenance. It is generally recognized that the main indicators of monitoring service assessment are response time and the number of alarms, which are mainly accounting workload and cost. Quantity will become the base of accounting services. In different production environments, the equipment load, operating time, environment and business systems vary greatly, and the number and time of failures are uncertain. For example, in a network with more high-end Cisco switches, the load is also very low, and there will be no problems with the network throughout the year. However, for the network with relatively old construction years and old equipment, the frequency of failure is relatively high.

The assessment indicators of monitoring service are mainly defined as false alarm rate, false alarm rate and reporting rate (within 15 minutes). The first two indicators are to assess the operational ability of the team to monitor the system, which is described in the following alarm quality issues. The operation and maintenance team can not rest easy because of the monitoring system, the operation and maintenance team needs to constantly optimize and improve the monitoring system, and continuously optimize the monitoring after the network and business system changes. The third indicator is to assess the implementation ability of the team, there is an alarm must be analyzed and reported in time. In this way, it is assessed from the two latitudes of the whole team's work attitude and ability.

The value statistics of monitoring services is mainly to calculate the cost of services, which is a common point of view in quantifying modern services. No matter Party An or Party B, speaking of numbers is a generally accepted point of view. According to the above-mentioned alarm volume as a base for accounting costs, and then according to the severity level of the alarm and the level of related business items, weighted calculation is carried out, such as the same severity level of alarm. For non-hierarchical business systems, it is found that the value of the alarm is not the same. On the basis of the consideration of the above indicators, by increasing the calculation of response time, we can basically calculate the value of services. The calculation formula is (need the support of CMDB):

Macrop (W1, A1, b1, R1, w2, a2, b2, R2 +). Wn*an*bn*rn) + basic service price (verification false positives, inspection, etc.)

Basic price services include: number of network elements * unit price; network elements are the smallest units that can be monitored and managed in network management, including software, hardware, applications and other services. This includes regular alarm monitoring and performance reports.

Calculated with the above two latitudes, it is mainly motivated from the attitude and ability of the service team.

Abbreviation

Character description

M (money)

Service value

W (work)

Warning item

A (alert)

Alarm level

B (business)

Business system level

R (response)

Response time

P (price)

Basic price

For example:

Alarm level: business system level: response time:

Serious alarm

1.5

XX production system

1.5

5 minutes

1.5

Advanced alarm

1.2

OA system

1.2

10 minutes

1.2

Primary alarm

1.0

Company portal system

1.0

15 minutes

1.0

Warning

1.0

XX test system

1.0

30 minutes

0.9

Primary alarm

0.8

Internal forum

0.8

60 minutes

-1

Among the several domestic Internet companies known at present, the maturity of data center operation and maintenance is relatively high, and the assessment of operation and maintenance is mainly considered from five latitudes, that is, response time, preparation (prevention mechanism), attitude and ability, processing results and follow-up measures. The first three are related to monitoring, timely reporting reflects the response time; continuous optimization of monitoring tools, inspection and drills reflect the degree of preparation and ability.

Alarm FAQ

1. There are limitations and blind spots in monitoring. Evasion method: establish monitoring strategy in network layer, application layer and system layer to eliminate blind spots as much as possible. To prevent underreporting.

2. Alarm delay. In the process from generating alarm to receiving alarm, the system will go through alarm conversion interface, email or SMS interface, which is prone to queuing and blocking. Evasion methods: broaden channels, reduce congestion, send serious alarms to send text messages, other early warning alarms to send e-mail or page display and so on. To prevent underreporting.

3. Alarm quality. The improvement of monitoring strategy and quality will continue in the process of operation and maintenance. Evasion method: the core idea is operation. Through planning and overall planning, not only alarm classification, alarm model and alarm strategy should be planned globally, but also alarm number and alarm distribution should be continuously optimized according to business and human. Prevent false positives

Alarm model

The main contents are as follows: 1. Alarm classification, which is convenient for establishing alarm model, convenient induction and analysis and location, the most important thing is to have a complete and systematic fault detection and alarm response mechanism.

2. Alarm model, with a preprocessor with certain rules, such as defining a threshold or multi-dimensional combination conditions. For example, when an alarm is generated when a certain threshold is exceeded many times in a row, the alarm caused by instantaneous high performance can be avoided.

Alarm optimization

1. Converge alarm according to frequency, and design alarm strategy according to frequency and times.

2. Converge the alarm and merge the alarm according to the responsible person, equipment type or time.

3. Alarm correlation, so that there are no repeated alarms between related modules. (this function is available in the self-development systems of some Internet centers.)

4. Alarm analysis is mainly about the persistence analysis, tracking and optimization of alarms in the process of operation, so as to keep the number of alarms in a reasonable range.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Network Security

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report