How to understand the whole SRE operation and maintenance system 02/12 Update SLTechnology News&Howtos

How to understand the whole SRE operation and maintenance system

2026-02-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to understand the entire SRE operation and maintenance system, which may not be well understood by many people. In order to make you understand better, the editor has summarized the following contents for you. I hope you can gain something according to this article.

Construction of SRE operation and maintenance system and division of job responsibilities.

Observability system

In any enterprise of a certain scale, once the whole SRE operation and maintenance mode is implemented, then the construction of the observability system will become particularly important, and in the whole observability system, we will usually be divided into the following three aspects:

Metric monitoring: that is, various metrics are monitored, such as basic resource metrics, service performance metrics, and business call metrics.

Log: monitoring the operation log of various devices and services.

Call chain: a business-level call chain analysis that usually helps operators, developers, and operators quickly identify bottlenecks of overall calls in distributed systems.

A complete set of observable systems that ensure that you have insight into the system and track its health, availability, and what is going on within the system.

For the construction of the whole observable system, we need to pay attention to the following two points:

Determine what the quality standard is and ensure that the system continues to approach or remain within the limit of the quality standard

Pay attention to this work systematically-instead of just checking the system at random

In the entire enterprise-level observable system, I think at least the following features should be included:

Complete index collection: can dock the corresponding monitoring indicators of most of the equipment and technology stacks in the enterprise; at the same time, the monitoring index system that supports common equipment can be quickly connected to the monitoring equipment and indicators to avoid that all equipment monitoring is built from scratch; support for log data collection

Massive equipment support: the number and scale of enterprise IT systems are getting larger and larger, so the monitoring system needs to monitor massive equipment than before.

Monitoring data storage and analysis: monitoring data is the basis of operation and maintenance analysis, operation and maintenance automation and intelligence, so massive monitoring data storage and visual analysis based on monitoring data are the basic capabilities of a monitoring system.

The observable system is the foundation of the whole operation and maintenance system, and it needs to provide data support for the whole operation and maintenance system.

Therefore, an enterprise-level observability system should be a platform. On the one hand, you can access more operation and maintenance indicators through configuration or development; on the other hand, you can also connect more professional operation and maintenance tools, integrate and access multiple operation and maintenance data, and provide data services for more operation and maintenance scenarios. On the whole, the observability system provides a data basis for enterprise operation and maintenance, allowing us to make more decisions on accident response and capacity prediction rather than relying on previous experience and clapping on the head.

Fault response

If something goes wrong, how to remind people and respond? Tools can help solve this problem, and countries can define rules that remind human beings.

Fault response is based on data built using observability systems, and feedback loops are used to help us strengthen the monitoring of services.

The fault response usually consists of the following actions:

Attention: whether we are actively discovering bottleneck points or outliers, or passively exposing bottlenecks through observability systems, we should pay active attention to them.

Communication: timely notify the observed risk points to the relevant parties, and inform the impact area and the relevant remedial measures

Recovery: after the three parties have reached an agreement, repair the relevant risk points and outliers according to remedial measures

It should be noted that if the whole observability system can be done well in the early stage, the fault should usually start with a simple alarm message or an alarm phone, so, usually, if the observable system is good enough, it can only play the role of traceability and troubleshooting, but can not find it in time, so we need to rely on each observation data to calculate and evaluate the alarm. In order to timely notify the relevant alarm to the relevant people, in order to expose the risk point.

Alarm is only the first step of the whole fault response, which solves the problem of how to find the fault, and most of the fault response work is about defining processing strategies and providing training so that people know what to do when they receive an alert. usually this part is more about the summary and precipitation of past historical experience and operation and maintenance experience, including some abstract and instrumental precipitation of experience. To ensure the efficiency and generalization of fault response (that is, it does not rely on human experience).

For the whole alarm system, it is necessary to ensure the effectiveness of the alarm, otherwise, the whole alarm system is likely to be reduced to a garbage data generator, and alarm effectiveness means that the following two requirements need to be met:

Alarm timeliness: if there is something wrong with the system, you need to inform the operation and maintenance staff to deal with the alarm in time through the alarm information.

Alarm accuracy: as long as there is an alarm information system, there are bound to be problems (for many enterprises, there may be a large number of useless alarms, such as disk problems, mem and other related problems, of course, automation, business form, alarm threshold)

In the whole process of operation and maintenance, we often find that there are a large number of irrelevant alarm information, which makes the attention of operation and maintenance personnel lost in the alarm sea, while leaders in the field of non-operation and maintenance usually pay attention to the response degree of the whole alarm. Therefore, restraining and eliminating invalid alarms so that operation and maintenance personnel will not be engulfed by alarm storms is also the key construction content of alarm management.

Usually, after the construction of our observable systems, we can use all kinds of monitoring data integrated into the monitoring platform, apply algorithms and means such as trend prediction, short-period detection, intermittent recovery, baseline judgment, repeated compression and so on to achieve alarm compression convergence and enhance the effectiveness of alarm.

At the same time, for the front-line operation and maintenance personnel, we need to make a comprehensive modeling and analysis according to the multiple monitoring indicators of the same system or equipment, which can be summarized into a score of health degree. give the first-line operation and maintenance personnel system based on the health degree of the system hierarchical evaluation system, truly and intuitively reflect the running state of the system, and achieve rapid demarcation of the problem.

For example, the overall utilization of the resource is evaluated by comprehensive weighted calculation of multiple indicators of the basic resource; through the resource utilization of all the resources associated with an application and the overall modeling analysis of the operation and maintenance architecture of the application, a score is calculated to evaluate the health of the application as a whole.

If this process is more mature, it can be closed-loop according to the existing internal solutions and alarms. In a simple scenario, when the disk is full, the alarm will first trigger a standardized disk inspection and delete the relevant discarable data. If the alarm still cannot be resolved, you can directly contact the front-line OPS for manual intervention next time, and then summarize the standardization experience.

Fault review

Fault review is to review and summarize some service exceptions and service outages in the past to ensure that the same problem will not occur again. In order to unite and work together, we want to build a culture of blameless and transparent hindsight. Individuals should not be afraid of accidents, but be sure that if an accident happens, the team will respond and improve the system.

Remarks: in fact, in domestic SRE culture, only large-scale accidents that have a significant impact on business will be reviewed, but in fact, if time and experience permit, ordinary accidents should also be reviewed on a small scale. The so-called big failures are accumulated from constant minor problems. In addition, in fact, for individuals related to operation and maintenance, we should also conduct a timely review of minor failures, in order to continuously strengthen personal fault handling and repair capabilities.

I think a key consensus of SRE is to acknowledge the imperfection of the system, and it is unrealistic to pursue a system that never stops. Based on the imperfect system, we will inevitably face and experience system failures and failures.

So what is important for us is not to find this or that person responsible for this failure, but to thoroughly review the root cause of the failure and failure, and how to avoid the same failure again. System reliability is the direction of the whole team to work together, quickly recover from failure and learn lessons, everyone rest assured to ask questions, deal with downtime, and strive to improve the system.

Note: usually in the process of fault review within many enterprises, the relevant personnel may inadvertently regard the root causes of failure and failure as fault blame and a series of punitive measures, and force the occurrence of the fault through some disciplinary measures. this way is often very undesirable, imagine that everyone does not want to have an accident, either outside of cognition or rule defects. No one will ever create a fault knowing that there will be a fault.

What we need to keep in mind is that failure is something we can learn from, not something to be afraid of and ashamed of!

In the process of daily operation and maintenance, accidents such as failures are actually a good opportunity for us to review and learn. Through historical monitoring data, analyze the root causes of accidents, formulate follow-up coping strategies, and edit these strategies into standardized, reusable and automated operation and maintenance application scenarios through the operation and maintenance platform to provide standard and fast solutions for subsequent handling of the same problems. This is the most real value of the process in hindsight.

Test and release

Testing and release is primarily a precaution for overall stability and reliability, which is an attempt to limit the number of accidents that occur and to ensure that infrastructure and services remain stable when new code is released.

As a person who has been engaged in operation and maintenance for a long time, perhaps the biggest fear in his heart is the release of a new application version. Because in addition to the probability event of hardware and network equipment damage, which belongs to the level of natural disaster, the second day of the release of new application versions is usually a high-risk period of downtime and accidents. Therefore, for some large-scale products, network closure operations are usually carried out on the eve of holidays and important activities to avoid the emergence of business bug caused by the launch of the new version.

Testing is about finding the right balance between cost and risk. If you take too much risk, you may be tired of dealing with system failure; on the other hand, if you are too conservative, you will not be able to release new things fast enough for the company to survive in the market.

In the case of a large error budget (that is, the failure leads to less downtime of the system in a period of time), we can appropriately reduce the test resources and relax the testing and conditions for the system to go online, so that the business can have more functions online to keep the business sensitive. When the error budget is relatively small (that is, the failure leads to more downtime of the system in a period of time), it is necessary to increase the test resources and tighten the online testing of the system, so that the potential risks of the system can be more effectively released. Avoid system downtime to keep the system steady. The balance between sensitivity and steady state needs to be shared by the whole operation and maintenance and development team.

In addition to testing, application release is also a common responsibility of the operation and maintenance team. One of the principles of SRE is to code and instrumentalize everything that can be repeated; in addition, the complexity of application release is often proportional to the complexity of the system. Therefore, in the application system, large-scale enterprises often have begun to build an automated application release process based on the automation framework.

Through automated publishing tools, we can build a pipeline to automate all operations in the deployment process (such as compilation and packaging, test release, production preparation, alarm shielding, service stop, database execution, application deployment, service restart, etc.).

Capacity planning

Capacity planning is about predicting the future and discovering system limits, and capacity planning is also about ensuring that the system can be improved and enhanced over time.

The main goal of planning is to manage risks and expectations, and for capacity planning, it involves extending capacity to the entire business; the expectation is how people expect services to respond when they see business growth. The risk is to spend time and money on additional infrastructure to deal with the problem.

First of all, capacity planning is the analysis and judgment of the predictability of the future, and its prediction is based on massive operation and maintenance data. Therefore, in addition to the corresponding architecture and planning team, a comprehensive operation and maintenance data center is a necessary facility to realize system capacity planning.

Capacity trend early warning and analysis will comprehensively collect, organize, clean and structurally store all kinds of operation and maintenance data from a variety of operation and maintenance monitoring, process management and other data sources, integrate the operation and maintenance data from various tools and build a variety of data topics.

The data that applies these data topics is used to help operators evaluate the problem, including:

What is the current capacity?

When to reach the capacity limit

How to change capacity

Perform capacity planning

The operation and maintenance platform can not only provide the necessary data support, but also need to provide the necessary data visualization support capabilities. Operation and maintenance data visualization provides some necessary capabilities to ensure that operation and maintenance personnel can make better use of the operation and maintenance data evaluation capacity.

First of all, the operation and maintenance platform needs to have strong data retrieval ability. The operation and maintenance platform stores a large amount of operation and maintenance data. When operators try to establish and verify an exploratory scene, they often repeatedly retrieve and query specific data. If the data query of the operation and maintenance data analysis platform is very slow or the query angle is very few, the time for the operation and maintenance personnel to establish the scene will be very long or even impossible. Therefore, through the platform, operation and maintenance personnel can achieve keywords, statistical functions, single condition, multi-condition, fuzzy multi-dimensional search functions, as well as massive data query in seconds, which can more effectively help operation and maintenance personnel to analyze data more conveniently.

Second, the platform needs strong data visualization ability. It is often said that "a thousand words are worth a picture". Operators often make statistical analysis and generate all kinds of real-time reports through the operation and maintenance data of each system. carry on multi-dimensional, multi-angle in-depth analysis, prediction and visual display of all kinds of operation and maintenance data (such as application log, transaction log, system log), and express and promote the prediction results and experience of their analysis to others.

Automated tool development

SRE involves not only operation but also software development. Of course, this part refers to the development of tools and platforms related to operation and maintenance as well as the field of SRE. In Google's SRE architecture, SRE engineers spend about half of their time developing new tools and services, some of which are used to automate some manual tasks, while others are used to constantly fill and repair other systems within the entire SRE system.

By writing code to liberate ourselves and others from repetitive work, if we don't need humans to complete the task, then write code so that humans don't need to be involved.

SRE despises repetitive work from the original manual and passive response to a more efficient and automated operation and maintenance system.

Automated operation and maintenance framework:

Advantages and necessity of automated operation and maintenance tools:

Improve efficiency: by the program automation operation, effectively reduce the operation and maintenance of human resources investment, but also let the operation and maintenance personnel can release their energy and invest in more important areas.

Standardization of operation: to unify the operation and maintenance entry of many complex and error-prone manual operations, realize the white screen of operation and maintenance operation, and improve the manageability of operation and maintenance operation; at the same time, reduce manual misoperation caused by the mood of operation and maintenance personnel, avoid the occurrence of tragedies such as "from deletion to run".

Inheritance of operation and maintenance experience and ability: operation and maintenance automation tools summarize the experience accumulated by many operation and maintenance teams into various operation and maintenance tools in code to achieve automated and white-screen operation and maintenance operations. Newcomers to the operation and maintenance team can effectively inherit, reuse and optimize them. This code-based work inheritance transforms individual capabilities into team capabilities and reduces the impact of staff mobility on work.

The construction of automatic operation and maintenance system must be based on operation and maintenance scenarios, which are iterated and built repeatedly in the enterprise, and are the most commonly used operation and maintenance scenarios in enterprises.

For example, common operation and maintenance scenarios: software installation and deployment, application release and delivery, asset management, automatic alarm handling, fault analysis, resource application, automatic inspection, and so on. Therefore, the entire automated operation and maintenance system construction should also support a variety of different types of automated job configuration capabilities, through simple script development, scene configuration and visual customization process to achieve more operation and maintenance scenarios.

User experience

What the user experience layer wants to say is that, as a SRE, the ultimate goal is to ensure the stability and availability of the business from the user's point of view. Operation and maintenance staff in the traditional sense will not pay attention to this point, because people usually only consider the stability of my underlying operation and maintenance system or underlying resources, but in fact, the stability of the entire business is the concern of SRE, and the stability and availability of the business usually need to simulate and measure the overall availability and reliability from the user's point of view.

All the areas of work related to SRE mentioned above, whether they are monitoring, incident response, review, test and release, capacity planning, and building automation tools, are nothing more than to provide a better business experience for system users. Therefore, we all need to pay attention to the user experience of the system in the process of operation and maintenance.

In the actual operation and maintenance work, we can often through the application log, monitoring data, business testing and other business-related user experience information. In the operation and maintenance data platform, through these user experiences to monitor the association and series of data, reproduce the user's final business call link and the relationship of each application link to the performance data. Finally, starting from the business user experience data, gradually achieve the system operating status data, equipment operating status data link, so that the operation and maintenance system to achieve the goal of the end-user experience as the center.

These user experience information provides an irreplaceable role for the operation and maintenance team to grasp the overall user experience of customers, the monitoring of system availability and the targeted optimization of the system.

In fact, the SRE operation and maintenance system puts more emphasis on user experience as the core, automation and operation and maintenance data as the means to ensure application business continuity. From this point, we will find that it is still very different from the traditional operation and maintenance in the past. We are no longer just installation and deployment engineers, we need to ensure the stability and reliability of the upper business through a series of technical means.

After reading the above, do you have any further understanding of how to understand the whole SRE operation and maintenance system? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.