How to solve the two difficult problems of release coordination and monitoring alarm in big data 04/09 Update SLTechnology News&Howtos

How to solve the two difficult problems of release coordination and monitoring alarm in big data

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you how to solve the two difficult problems of release coordination and monitoring alarm in big data. The content is concise and easy to understand, which can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Today, I mainly share with you the landing practice of several-person cloud SRE. Because the target customers are mainly in the financial industry, based on the characteristics of ITSM, this paper introduces the release coordination and monitoring alarm in the actual scenario.

Core idea of SRE

SRE is a model rehearsed by Google in the process of operation and maintenance for more than ten years. It has accumulated a lot of experience in practice, which is different from the traditional operation and maintenance. It is the concrete practice of operation and maintenance based on the idea of DevOps.

The responsibilities of the SRE position are as follows:

Emergency response

Daily operation and maintenance

Engineering research and development

The job responsibilities of SRE are similar to those of traditional operation and maintenance, but there are great differences in the way of work: 1) Emergency response is mainly implemented in monitoring, handling emergency incidents and summarizing afterwards.

2) Daily operation and maintenance includes capacity planning, performance monitoring, and more management.

3) the difference between engineering R & D and traditional operation and maintenance lies in participating in R & D events, SLO formulation and guarantee, and automation, which is not only a long-term goal, but also a hot topic.

SRE's working principle: embrace change: refuse to change without fear of risk, and discover and solve problems through constant rehearsals and rehearsals.

Service level goal: assign service levels to different operators.

Reduce chores: save more time for development.

Automation & simplicity: the main purpose of development.

ITSM characteristics of financial industry

The main characteristics of the financial industry in ITSM are hierarchical management, working mode, operation and maintenance and development are completely separated, but this is the performance that the idea of DevOps has not been achieved. The size of the operation and maintenance team increases linearly. For example, when a system is launched, 1-2 operation and maintenance personnel will be assigned to follow up. Whether from the network or the allocation of resources, their responsibilities are more on emergency handling and routine changes.

CSRC and CBRC have compliance requirements for operation and maintenance, such as two places and three data centers, which is a relatively obvious characteristic of the financial industry.

The difference between the traditional model and SRE-the traditional model: easy recruitment, the traditional industry recruitment operation and maintenance staff will first be able to write scripts such as Shell,Python, there will be new requirements in automated operation and maintenance tools, and past experience will be accumulated to deal with problems such as accidents that have been solved.

SRE: difficult to recruit, relatively new positions, it is difficult to find a perfect match; there will be development requirements, emphasis on automation, including programmatic content in automation tools, teamwork, and so on. Next, share the landing of SRE in two customer scenarios: an exchange and a commercial bank credit card center.

The SRE platform architecture model is shown in the figure above. The resource supply layer is a PaaS platform based on human cloud, which uses Docker containerized management for resource mobilization, and application scheduling is based on Mesos and Marathon, respectively. At present, several cloud has also opened up an application scheduling system named Swan (Mesos scheduler, Github address: https://github.com/Dataman-Cloud/swan Welcome Star&Fork), with the goal of replacing Marathon. Then there is the software technology architecture layer, corresponding to the architecture department of the company, including the adoption of RPC framework, caching, message center selection and so on.

The main content to be shared is at the DC SRE level. Further up, there are TISRE in the product line and APP SRE close to the number of users, so I understand that this is a long-term construction process.

Practice-release Coordination

Release coordination is widely used in daily work, such as application launch and change management. Under the guidance of SRE, a team similar to release coordination has been set up on a project site. The establishment of the SRE team is related to the online characteristics of the financial industry system:

There are many systems in the financial industry, including credit, credit review and many other applications, and the system logic is also more complex. The development and test environment such as the physical environment is completely isolated, which is different from the Internet industry. The Internet industry is published online, and the test environment may be a production environment, using grayscale release or blue-green release mode to do.

Online coordination needs to face multiple outsourcing teams at the same time, and the outsourcing team is relatively uncontrollable, resulting in high communication costs.

Large-scale systems have a long launch cycle.

How to solve the above problems, make the release controllable, reduce the release failure rate, and shorten the release time cycle?

To solve the above problems, in the mind of SRE, the first step is to establish a release coordination team. At present, SRE engineers can only train themselves. Team recommended composition: project manager, architect, operation and maintenance engineer, development engineer. The main way of communication is to release and launch meetings, and continue to work on Check systems or products.

The responsibilities of the team include: reviewing new products and internal services to ensure expected performance and non-performance indicators. According to the release task progress, responsible for the technical issues in the release process, and coordinate the development progress of the outsourcing company, and so on.

The most important thing is to be the goalkeeper in the release process, and it is up to the release coordination team to decide whether the system will be online or not.

The team has to conduct a meeting review at different stages throughout the service lifecycle in order to continue its work. According to the released checklist, the meeting includes three aspects: architecture and dependency, capacity planning, and release plan.

In terms of logical architecture and dependencies, deploy the architecture, including the order of request data streams, check load balancing, and log specifications focus on monitoring requirements. At the same time, check whether the test management is carried out during the third-party monitoring call, and so on.

Capacity planning is mainly based on the compression report to estimate the capacity, as well as the peak, for example, there are more Wechat activities, so the resources needed will be estimated according to the company budget of the estimated peak, and then implemented on the container to formulate a detailed plan to ensure the success rate of release.

Make a release plan to ensure success.

In the guidance of SRE, everything should be implemented in the tool, and it is up to the tool to check that everything is in place. At that time, we made a publishing platform, including PipeLine and Jenkins, through which we called the configuration F5 and configuration center on the load balance, as well as the mechanism of the service registry. All the release items are based on the container cloud platform, and the functional modules include change management, release management, process template and release process monitoring.

The above figure is an overview of the release platform project, which shows the implementation of the project in the release process-success rate and failure rate. Before the release platform, the managers of the entire launch process can not see the release details in real time, whether it is stuck on the network or a certain service, so the progress is out of control.

With such an operation and maintenance market, the whole release process can be tracked visually, and the key nodes need to be audited manually.

Specific release steps:

First, check the configuration in F5

Second, manual inspection

Third, upload packages and manage configuration items

Finally, restart the container and check it manually

The whole process embodies the idea of SRE. Every step of the release platform can be completed through interface configuration, and the key points in the middle are manually participated. The purpose is to ensure the success rate of product launch, avoid manual configuration problems during the launch process, and lead to rollback events.

With the release coordination team, the success rate, automation and release efficiency have been significantly improved, reducing the implementation of human flesh operations on the configuration items of Jenkins and PipeLine, and reducing the probability of errors.

Practice-Monitoring alarm

As a vertical system, monitoring plays an important role in the product system of several people cloud. We have a deep understanding of the importance of monitoring. We have 1 or 2 monitoring specialists in the SRE team of a financial company, and the commissioner's main responsibility is to maintain the monitoring system. One is an internal staff of Party A, and the other is a colleague of several people.

The main problems to be solved in monitoring: first of all, it should be found in time, timeliness is very important, so it is necessary to establish a monitoring system.

Why there is a failure, it is necessary to do more accident summary and follow-up fault tracking.

The image above shows the architecture of the monitoring system, the timing database based on Prometheus, and the red line for the monitoring data flow. Because it is the Mesos framework, you will see the monitoring items on the Mesos computing node on the left. Collect the CPU and memory and disk information of the host and all containers of the host through containerized Cadvisor components. The alarm part uses the Altermanager component of Prometheus, supports Webhook mode, and alarms through SMS and email. In order to summarize afterwards, some alarm events need to be stored in the database.

The green line is mainly reflected in log collection and keyword alarm. Log collection through containerized Logstash components first collects logs from middleware in the container, such as Tomcat and other Web middleware, as well as application logs, and throws some information into Kafka as needed, and makes online log analysis through big data platform.

Log keyword alarm is an event alarm by filtering keywords in the process of log transmission through a self-developed component.

The health status of CCS is pushed to Prometheus through the event Pushgateway of Marathon, and finally, the monitoring information and alarm configuration are checked with the self-developed UI in the foreground. In order to facilitate the use of Prometheus query to do a unified package, reduce the complexity of API use.

The four gold indicators of monitoring are clearly mentioned in the SRE system: delay, traffic, error, and saturation:

Delay: monitor the access time of the API and URL of the service, directly configure the URL of a service, and constantly access the service at the backend, including the expected value of the access time, and issue an alarm when the time is exceeded.

Traffic: the number of connections requested by load balancer.

Error: monitor HTTP request return code and exception keywords in the log.

Saturation: monitor memory utilization according to different systems, such as memory resource system, high IO read and write usage, and monitor resource IO.

The above indicators need to be continuously optimized in the process of operation and maintenance, for example, some alarms may be caused by network problems, such as network switch problems, directly hang up the Marathon components of the platform, the application is obviously a third-party service call, and a lot of problems will arise one after another. It is necessary to aggregate the common problems and reduce the number of false alarms. However, reducing false alarms may also remove valid alarm cards, which is also worth thinking about.

Fault tracking is similar to after-the-fact summary. Manual operations will prolong the recovery time. When an alarm occurs, it is usually handled by one person, and the policy is upgraded over time. For example, more resources will be called to handle the fault step by step for more than half an hour. In fault tracking to solve online problems mainly, supplemented by Virgo obsessive-compulsive disorder, do a good summary of Root Cause better feedback to automation tools.

The summary of the accident is very important. The solution does not mean the end. It is necessary to trace the causes, such as the restart of the switch, which leads to some problems in Marathon. After summary, it is found that the best solution is to notify OPS when the switch is restarted, stop the relevant components, and start again after the switch is restored, so as not to affect the actual operation of the business.

To learn from failures, record alarm events to establish a knowledge base, facilitate retrieval when problems occur, quickly find solutions, summarize and solve an accident, establish a feedback mechanism, and constantly make real-time feedback with the product during the SRE monitoring process, including the use of connection pools, encourage active testing, and try to find ways to see the results as far as possible when operation and maintenance do not happen.

Target location

There are many contents of SRE target positioning, which vary when landing in different industries, so we should be based on reality, embrace changes, in order to better deal with accidents, persist in doing exercises and drills, and make suggestions on products in accident summary, so we will also have decision-making power in the research and development of tools.

The above content is how to solve the two difficult problems of issuing coordination and monitoring alarm in big data. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.