Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Daily work of operation and maintenance

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

1.1 the main work of inux operation and maintenance

1. What is linux operation and maintenance

Operation and maintenance refers to the maintenance of network software and hardware that has been established by large organizations, which is to ensure the normal operation and operation of the business.

In the course of his operation, we maintain him, and he integrates the technologies of network, system, database, development, security and monitoring.

There are many kinds of operation and maintenance, including DBA operation and maintenance, website operation and maintenance, virtual operation and maintenance, monitoring operation and maintenance, game operation and maintenance, and so on.

Classification of operation and maintenance:

1) Development, operation and maintenance: it is for the application operation and maintenance development tools and operation and maintenance platform.

2) Application operation and maintenance: it is to launch, maintain and troubleshoot the business, and use the tools developed by the development operation and maintenance to launch, maintain and troubleshoot the business.

3) system operation and maintenance: to provide business infrastructure for application operation and maintenance, such as system, network, monitoring, hardware, etc.

2. Common work contents of basic operation and maintenance

Service monitoring technology: including the research and development and application of monitoring platform, the guarantee of service monitoring accuracy, real-time and comprehensiveness

Service fault management: including service fault plan design, automatic execution, fault summary and feedback to the product / system design level for optimization to improve product stability.

Service capacity management: measuring the capacity of services, planning the construction of computer rooms for services, capacity expansion, migration, etc.

Service performance optimization: improve service performance and response speed and improve user experience from all directions, including network optimization, operating system optimization, application optimization, client optimization, etc.

Service global traffic scheduling: the traffic of access services, which is allocated among computer rooms according to capacity and service status.

Service security: including service access security, anti-attack, access control, etc.

Automatic release and deployment of services: research and development of deployment platforms / tools, and the use of platforms / tools to achieve secure and efficient release services

Service cluster management: including service server management, large-scale cluster management, etc.

Service cost optimization: reduce the resources used by service operation as much as possible, and reduce the service operation cost.

Database management (DBA): by designing, developing, and managing high-performance database clusters, database services are made more stable, more efficient, and easier to manage.

Platform development: development and management of docker-like platforms, and service access technology

1.2 the development process of Linux operation and maintenance

1. Manual management stage

1) Business scale

The business flow is not large, the number of servers is relatively small, and the system complexity is not high.

For the day-to-day business management operations, we are more likely to log on to the server for manual operation, belonging to each on its own.

Everyone has their own mode of operation, lack of the necessary operating standards, process mechanisms, such as the business directory environment is varied.

2) Job responsibilities

In the case of few personnel, the early operation and maintenance team mainly carried out data center construction, basic network construction, server procurement and server installation and delivery.

It rarely involves the change, monitoring, management and other work of online services.

At this time, the operation and maintenance team belongs more to the role of infrastructure, providing a simple and available network environment and system environment.

2. Tool batch operation phase

1) Business scale

With the increase of server scale and system complexity, the full manual operation mode can no longer meet the needs of the rapid development of business.

As a result, operators gradually began to use batch operation tools, and different scripts appeared for different types of operations.

At this time, although the efficiency improved in part, but soon encountered a bottleneck, the quality of the operation did not improve much.

We began to establish a large number of process specifications, such as review mechanism, go online to observe a server for 10 minutes before continuing with the later operation, and observe for at least 20 minutes after an upgrade is completed.

These mainly rely on human supervision and implementation, but in the actual process, the implementation is often not in place, but reduces the work efficiency.

2) Job responsibilities

At this time, the OPS team will also undertake some server monitoring work, and will also be responsible for LVS, Nginx and other layer 7 OPS work that has nothing to do with business logic.

At this time, service changes are more manual, or there are some simple batch scripts.

The focus of monitoring is more on the server status and resource usage, the monitoring of the status of service applications is almost less, monitoring more use of a variety of open source systems such as Nagios, Cacti and so on.

3. Platform management phase

1) Business scale

At this stage, we decided to start building an operation and maintenance platform to carry standards and processes through the platform, so as to liberate manpower and improve quality.

At this time, the change action of the service is abstracted, and a unified standard is formed, such as the operation method, the service directory environment, the service operation mode and so on.

Through the platform to restrict the operation process, such as the above-mentioned online a server observation for 10 minutes, the program start-stop interface must include start, stop, reload and so on.

The pause checkpoint is forcibly set in the platform, and after the operation of the first server is completed, the operation and maintenance personnel are required to fill in the corresponding check items before you can continue to perform subsequent deployment actions.

2) Job responsibilities

Due to the continuous increase in business scale and complexity, the operation and maintenance team will gradually be divided into two parts: application operation and system operation and maintenance.

The application of operation and maintenance begins to take over the online business and gradually carry out the work of service monitoring carding, data backup and service change.

With the deepening of the service, the application operation and maintenance engineer has the ability to start some simple optimization of the service.

At the same time, in order to cope with a large number of service changes every day, we also began to write all kinds of operation and maintenance tools, which can easily change in batches for some specific services.

With the increase of business scale, there are more and more failures in infrastructure due to insufficient capacity planning or weak ability to resist risks, forcing operators to devote more energy to the direction of multi-data center disaster recovery and plan management.

4. System self-scheduling phase

1) working environment

With a larger number of services, more complex service relationships, and a large number of operation and maintenance platforms, the original way of transforming batch operations into platform operations is no longer suitable.

The service change needs to be abstracted to a higher level, each server is abstracted into a container, and the service is scheduled and deployed to the appropriate server by the scheduling system according to the resource usage.

Automatically complete the linkage with the surrounding operation and maintenance systems, such as monitoring system, log system, backup system and so on.

Through the self-scheduling system, the capacity can be dynamically scaled according to the operation of the service, and the common service failures can be handled automatically.

The work of the operation and maintenance personnel will also be advanced to the product design stage to assist the R & D personnel to transform the service so that they can be connected to the self-scheduling system.

2) Job responsibilities

After the business scale reaches a certain extent, the open source monitoring system can no longer meet the business requirements in terms of performance and function.

With a large number of service changes and complex service relationships, the previous methods of manual recording and tool changes can not meet the business requirements in terms of efficiency and accuracy.

In terms of security, there have also been various incidents, large and small, forcing us to devote more energy to security defense.

Gradually, the operation and maintenance team formed the five major job categories mentioned earlier, each of which requires specialized personnel.

At this time, system operation and maintenance pay more attention to infrastructure construction and operation and maintenance, provide a stable and efficient network environment, and deliver servers and other resources to application operation and maintenance engineers.

Application operation and maintenance pay more attention to service running status and efficiency, database operation and maintenance belongs to the refinement of application operation and maintenance work, and focuses more on automation, performance optimization and security defense in the field of database.

Operation and maintenance R & D and operation and maintenance security provide various platforms and tools. MT5 uses the tutorial http://www.gendan5.com/mt5.html to further improve the work efficiency of operation and maintenance engineers and make business services run more stably, efficiently and safely.

1.3Classification of Linux operation and maintenance work

1. Application operation and maintenance (SRE):

The application operation and maintenance staff is responsible for online service change, service status monitoring, service disaster recovery and data backup, routine service troubleshooting, fault emergency handling, etc.

Responsibilities are as follows: design review, service management, resource management, routine inspection, pre-plan management, data backup.

2. System operation and maintenance (SYS):

Responsible for construction of IDC, network, CDN and basic services (LVS, NTP, DNS)

Responsible for asset management, server selection, delivery and maintenance, network construction, LVS load balancing and SNAT construction

3. Operation and maintenance development

It is for the application operation and maintenance to develop operation and maintenance tools and operation and maintenance platform.

The main platforms include: work order system, CMDB, monitoring system, ELK log system, CI/CD, LDAP, FAQ, training system, OpenStack platform

4. Database operation and maintenance (DBA):

Database operation and maintenance staff are responsible for data storage scheme design, database table design, index design and SQL optimization.

Change, monitor, backup and design the database. The details are as follows

Design review, capacity planning, data backup and disaster preparedness, database monitoring, database security, database high availability and performance optimization

Automation system construction, operation and maintenance research and development, operation and maintenance platform, monitoring system, automatic deployment system

5. Operation and maintenance Security (SEC):

The security of operation and maintenance is responsible for the security reinforcement of network, system and business.

Conduct routine safety scanning, penetration testing, research and development of security tools and systems, and emergency handling of security incidents

The contents of the work are as follows: safety system establishment, safety training, risk assessment, safety construction, safety compliance, emergency response.

1.4 Linux operators use software and skills on a daily basis

1. Operation and maintenance platforms and tools used by operation and maintenance engineers

Web servers: apache, tomcat, nginx

Monitoring: prometheus, zabbix, openfalcon, nagios, cacti

Automatic deployment: ansible, saltstack, puttet

Load balancing: keepalive, lvs, haproxy, nginx

Backup tools: rsync, wget

Problem tracking: netstat, top, tcpdump, last

Containers: docker, k8s, docker-compose, swarm

Security: kerberos, selinux, acl, iptables

Virtualization: openstack, xen, kvm

2. Skills to be mastered by operation and maintenance engineers

Solid basic computer knowledge, including computer system architecture, operating system, network technology, etc.

General applications need to understand the operating system, network, security, storage, CDN,DB, etc., and know its related principles.

Programming ability, from the development of operation and maintenance tools to the development of large-scale operation and maintenance system / platform, requires good programming ability.

Data analysis ability: be able to sort out and analyze the data of the system, find problems and find solutions.

Rich system knowledge, including system tools, typical system architecture, common platform selection, etc.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report