In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
How to solve the 18 key problems of Linux automatic operation and maintenance? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.
Not long ago, I shared and introduced the landing experience and tools of enterprise automation operation and maintenance. many of the scenes are my comparison of the practices of front-line Internet companies and traditional industries based on practical experience: how to integrate automated operation and maintenance as a whole? How to understand and build automatic operation and maintenance from the perspective of methodology?
By sorting out a series of specific problems and discussion results of automatic operation and maintenance put forward by operation and maintenance enthusiasts.
I. risk of automatic operation and maintenance platform
Question 1: how to control the risk of automatic operation and maintenance?
First, the nature of all automation function modules falls to the code level, so it is necessary to test the code of the automation operation and maintenance function, which is suitable for developing the process of project management.
Second, for some operations that delete or modify classes, double check and rollback schemes need to be considered, and operations that cannot be rolled back cannot be done (this is no different from manual operations).
The third is the grayscale strategy, which can be used to verify whether the result of the automatic operation is consistent with the expectation, if it is consistent, continue, and if it is inconsistent, it needs to be rolled back.
Fourth, monitoring cooperation, the monitoring system can find faulty operations in time and give an alarm in time.
Fifth, authority management, for those who can operate automatic operation and maintenance platform, it is necessary to have strict authority control.
Sixth, through the API docking system, need to have authentication mechanism.
Question 2: how to control the security and permissions of the automated operation and maintenance platform?
Personally, I think we should pay attention to the following aspects:
Control the permissions of Web page operations by adding roles through the AD domain
For the case of interface call, the corresponding permission module is required.
For the operation and maintenance platform itself, it is necessary to prevent the platform from deleting and modifying production resources without authorization.
Regularly scan the security of the platform to scan the vulnerabilities of the platform itself.
Second, automatic operation and maintenance platform planning
Question 1: how should the construction of automatic operation and maintenance be planned?
There is no fixed answer to this question, it needs to be combined with the specific situation in several steps, and the ultimate goal is to achieve all end-to-end delivery. Generally speaking, it can be divided into the following stages:
To solve the most urgent pain points (here generally refers to the pain points of the operation and maintenance team itself or the problems raised by other teams that have been squeezed for a long time)
Collect the automation operation and maintenance requirements of other groups (development and test team) in IT department and schedule them internally to solve them.
After solving the problems of the first two points, connect the points together to eliminate the human flesh work between the points.
Check and fill the gaps in the initially formed automatic operation and maintenance chain to form a positive feedback chain.
Question 2: how to formulate standardized norms in the construction of automatic operation and maintenance?
Standardization needs to be combined with the specific situation of the company, generally speaking, there are the following aspects need to be standardized (for reference):
Server Pod standardization, a Pod put several machines, how to connect
Physical models, computing-intensive, memory-intensive, IO-intensive or memory-intensive, need to be summarized into several standard models from different manufacturers.
Operating system standardization, including operating system version, operating system kernel parameters, drive letter path, etc.
Software installation standardization, including software version, installation path, log path, log cutting, parameter tuning, etc.
Software deployment is standardized, dual nodes can not be deployed on the same physical machine and the same cabinet, to avoid host and cabinet-level failures.
Question 3: in the actual operation and maintenance environment, how should we develop a complete set of automated operation and maintenance management plan to support automatic operation and maintenance work?
To formulate an automated operation and maintenance plan, the following aspects need to be considered:
Make clear the purpose of formulating the automatic operation and maintenance plan, which is the guiding ideology of formulating the automatic operation and maintenance plan.
Define the service object role of the automated operation and maintenance scheme
Make clear what is the grasp of different object roles in the process of automatic operation and maintenance.
Identify the security issues that should be paid attention to during the landing of the automated operation and maintenance plan (such as permission refinement, call authentication, operation audit, etc.)
To further understand the operation and maintenance needs of other colleagues through research.
In the plan, it is clear that the plan to build an automated operation and maintenance platform is divided into several stages, and the requirements are scattered in these stages.
Make clear the specific way to turn the automated operation and maintenance scheme into an automated operation and maintenance platform (self-research, outsourcing or secondary development based on outsourcing)
Make clear the positive feedback process in the use of the platform in the automated operation and maintenance plan.
Question 4: how many stages does the construction of automated operation and maintenance need to be carried out? How should planning be made?
There is no fixed answer to this question, it needs to be combined with the specific situation in several steps, and the ultimate goal is to achieve all end-to-end delivery. Generally speaking, it can be divided into the following stages:
Solve the most urgent pain point at present
Collect the automation operation and maintenance requirements of other groups (development and test team) in IT department
After solving the problems of the first two points, connect the points together to eliminate the human flesh work between the points.
Check and fill the gaps in the initially formed automatic operation and maintenance chain.
III. CMDB data acquisition
How to find problems automatically in the process of 1:CMDB construction?
Automatic discovery of CMDB is generally based on the following ways:
Obtain relevant information, such as VMware, EMC storage, etc., by calling the API interface of the collected software.
Obtain relevant configuration information through a certain protocol (public or private), such as SNMP
By executing commands on the host and processing the results, such as grabbing the information of the middleware on the host
The information is obtained by executing the command of the middleware.
Automatic discovery generally achieves the purpose of automatic discovery through the combination of the above ways.
Question 2: how to choose CMDB to collect data automatically in the construction of automatic operation and maintenance?
This problem is a bit big. As far as data collection is concerned, if CMDB data is to be collected comprehensively, it needs to be considered from two aspects: one is the automatic collection ability of CMDB collection tools, and the other is that some data need to be entered manually by means of process. For example, the automatic information collection tools such as the name of the business system, the person in charge of the operation and maintenance of the business system, the person in charge of the development and the person in charge of the test can not be collected and need to be maintained manually.
If you need to build a CMDB system, there are three ideas:
Complete self-research, which requires that the team's R & D capability is relatively strong, and some people have a better understanding of the process of ITIL, and the realization of automatic collection is slow.
Direct procurement of commercial CMDB products has the advantages of quick launch and strong automatic collection ability, but the disadvantage is that some requirements may not be directly met and need customized development.
Based on open source products to do secondary development, for example, based on IOP, but the automatic discovery capability still has to be realized on its own, the advantage is that there is a basic available framework.
Question 3: how to ensure the real-time and consistency of CMDB data at the same time?
Real-time: ensuring the real-time performance of CMDB data depends on the automatic acquisition ability of CMDB tools.
Consistency: consistency requires process control and regular data audit operations, which can be achieved with the help of the capabilities of the CMDB platform.
IV. Selection of operation and maintenance tools
Question1: what factors should be considered when automating the selection of operation and maintenance tools?
When choosing automatic operation and maintenance tools, the author thinks that we should consider the following aspects:
The maturity of automated operation and maintenance tools, that is, the audience in the industry. Here, both commercial and open source can be evaluated from this point of view.
Can the function of automatic operation and maintenance tools meet the needs of operation and maintenance?
If you choose an open source automated operation and maintenance tool, you should also consider whether the technology stack of the tool matches that of the company's personnel.
Does the automatic operation and maintenance tool have good support in terms of security?
The impact of automated operation and maintenance tools on the performance of the host, especially to test the pressure on the server of the operation and maintenance tool platform when the concurrency is large.
Also consider whether the selected automated operation and maintenance tools meet the needs of the company's follow-up technology stack development.
Question 2: the planning and integration of operation and maintenance tools in the construction of automatic operation and maintenance?
At present, most companies do have such a problem. In my opinion, the main reason for the existence of the problem is the lack of a macro overall plan in the early stage, and each organization does things in its own way without overall management.
So how to deal with the existing status quo? In my opinion, the following things should be done:
A governance team needs to be set up, including the Owner of each existing system, and then led by a leader.
Each system Owner describes the background of the construction of this system and what problems the system can solve and what problems have not been solved.
Carry out the merging work according to the discussion results of the second step, merge the systems that can be merged, and connect the data with overlapping functions that cannot be merged, and output them uniformly.
The subsequent construction of the system needs to be uniformly planned by the governance team to prevent similar things from happening again.
Question 3: how to choose automated operation and maintenance products?
Automated operation and maintenance covers a wide range of areas, including resource self-service, monitoring, scheduling tasks, application release and so on. Then you need to consider the following points when choosing a product:
Sort out your own pain points, that is, what is the problem that needs to be solved most at present
Planning: what effect do you plan to achieve in 3 years?
Product maturity of the selected automated operation and maintenance platform (how many cases in the same industry)
The degree of development of the automatic operation and maintenance platform, whether it can carry out secondary development or support function expansion.
Is the technical framework of the platform the mainstream technical framework?
Through the trial to test the degree of integration with the local actual situation.
V. other
What is the relationship between question 1:AIOps and automated operation and maintenance?
AIOps is a part of automated operation and maintenance. With the popularity of AI in recent years, automation begins to appear. Automation involves all aspects of operation and maintenance. AIOps only applies AI technology to the existing Ops platform, and it is generally used in combination with big data technology.
Question 2: can we combine some advanced technologies, such as cloud computing and big data, to make automated operation and maintenance more efficient and intelligent?
Combined with cloud computing capabilities, you can quickly expand the service capacity of the automated operation and maintenance platform; combined with big data and artificial intelligence technology, you can make the automated operation and maintenance platform provide more powerful functions, which is now many people begin to pay attention to the AIOps.
Risks need to be checked manually, such as automatic operation of a behavior based on big data and artificial intelligence technology, then manual double check is needed at the beginning of the use of this technology, and priority and importance are given. For a low priority and low importance level, it can be handled automatically.
Question 3: in the focus of operation and maintenance, what are the differences between traditional enterprises and Internet enterprises?
The differences between traditional industries and the Internet in operation and maintenance are as follows:
Operation and maintenance code: the operation and maintenance of the traditional industry is more at the level of manual operation and maintenance platform or even pure manual operation, while the Internet is more likely to carry out operation and maintenance through code to avoid manual operation. This is also the reason why Internet companies require development capabilities for operation and maintenance.
Point and linearization: the operation and maintenance of the traditional industry purchased a lot of operation and maintenance platforms at different times, while each operation and maintenance platform is independent and discrete. However, the operation and maintenance platforms of the Internet are mostly linear, which can realize end-to-end delivery and series.
Different requirements for personnel: Internet companies no matter what level of operation and maintenance requires a certain level of development capabilities or in-depth understanding of some principles (code level), while the traditional industry is more operational level requirements.
Question 4: how can the automated operation and maintenance platform be closer to the business? Identify the risks that have occurred and will occur in the business in a timely manner?
In order to be closer to the business, automated OPS first needs to collect the automated OPS requirements of the business, and meet the automated OPS needs of the business through the platform, which is the work to be done in * steps.
Secondly, we need to monitor the business system, and on this basis, we need to communicate with the business risk indicators, quantify the risk indicators, and configure them into the monitoring system of the automated operation and maintenance platform, and use the monitoring capability of the platform to carry out 724-hour monitoring. When the indicators reach the alarm threshold, the alarm will be given by means of SMS, Wechat, e-mail and so on.
* the configuration of risk indicators can be gradually improved through the combination of big data analysis and AI to form a forward feedback chain suitable for each business system.
Question 5: what is the difference between traditional IT operation and maintenance and automatic operation and maintenance?
In fact, the reason for the emergence of semi-automatic operation and maintenance is that they solve point problems, turning the manual operation of each point into scripted or platform-based automatic action, which is discrete and is essentially a point rather than a line, let alone a face. The real automatic operation and maintenance is to achieve end-to-end automatic delivery, which is the automation of the whole link from development to testing to operation and maintenance, eliminating manual operation.
For example, to create a Redis middleware, the semi-automated approach is:
Apply for a machine on the virtualization platform
Network assigns IP address (manual)
Initialize the machine with another script (execute the script manually)
Install Redis through an installation script (manual installation)
Inform the applicant by mail or manually.
The automation approach is to submit and create Redis requirements, the automation platform does everything, and then calls the mail interface to notify the applicant.
Question 6: how to define the boundary of independent research and development of automated operation and maintenance? Can we not only achieve self-control, but also give full play to and enhance the ability of employees?
There are two ideas of autonomous control, one is complete self-research, and the other is secondary development based on a purchasing automatic operation and maintenance platform.
In the case of * *, the company's personnel are required to have certain development capabilities. The advantage lies in that they can and fully combine the local needs, but the disadvantage is that the requirements for personnel are relatively high and the platform is slow to form.
For the second case, you need to purchase a platform technology stack to achieve a platform that matches the company's developers or operation and maintenance personnel, and require the platform to open source code or provide rich secondary development interfaces. The advantage is that it can quickly meet at least 80% of the needs, but the disadvantage is that it needs to understand the existing code and is not flexible enough.
The answers to the above 18 questions about the landing of enterprise automatic operation and maintenance, I hope it will be helpful to all friends.
This is the answer to the 18 key questions about how to crack the landing of Linux automatic operation and maintenance. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.