How to carry out the best practice of Linux operation and maintenance 04/21 Update SLTechnology News&Howtos

How to carry out the best practice of Linux operation and maintenance

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article shows you how to carry out Linux operation and maintenance best practices, the content is concise and easy to understand, can definitely brighten your eyes, through the detailed introduction of this article, I hope you can get something.

We are faced with an ever-changing world, where business requirements are changing, technology architectures are changing, open source tools and business systems are deployed heterogeneously, and new tools and technology concepts emerge in endlessly. Only a set of scientific technical methodology can respond to these changes. Very often, we are at a loss when faced with new problems.

Automated operation and maintenance in recent years has been a hot topic, technology has been improving, so for technicians, the most important thinking, ideological adaptation and change. After all, technology is not the ultimate pursuit of operation and maintenance personnel, thinking is the goal that operation and maintenance personnel should practice all their lives!

First, if you want to do a good job, you must first sharpen its tools, how to choose tools?

1. Can you recommend some open source tools for server security and monitoring? Monitoring seems to be nagios, cacti, zabbix. Is there anything else you can recommend? How to monitor security?

Monitoring tools have their own focus. Zabbix supports both snmp and its own agent, as well as custom templates, which is a good choice in most scenarios.

In addition, instead of regarding zabbix as only monitoring server information, you can also monitor metrics at the business level through custom templates. Security monitoring is divided into active detection, such as Tenable Nessus, and IDS, IPS.

2. What version of the server is used in Linux OPS? CentOS 5 or CentOS 6, Ubuntu? Why choose this version? What tests have been done?

At present, we are mainly based on CentOS6.X. Different Linux branches have their own characteristics. For example, the new version of Ubuntu is released quickly. If you pursue the speed of kernel version upgrade, you can consider it. CentOS has always been our main Linux distribution, mainly because of its stability and familiarity.

3. Do you have any recommendations for using caching? Usually Redis, Codis. Are there any open source software that are easier to use?

For data like session-id that can be stored non-persistently, consider memcached and use consistent hashing algorithm for distributed storage.

4. In addition to Jenkins continuous integration tools, what other useful tools are there for automated release?

What I know so far is generally Hudson or Jenkins, which is a branch of the former. These tools have a wealth of plug-ins, and flexible use of these plug-ins is the key.

5. Ask MySQL question, which version do you recommend to use in the three versions (MySQL (official version), Percona Server, MariaDB)?

Our team usually uses the official version. It's mainly about support and ecology.

6. Are there any good tools for server log collection and analysis? ELK seems a little complicated and doesn't know how to use it. Do you have any other recommendations?

ELK is indeed a widely used tool for log collection and analysis. Although there are some learning costs, it is still worth studying and trying.

7. Are there any open source tools and scripts in the book? where can I download them?

I am sorting out the scripts in the book, some of which can be downloaded from https://github.com/xufengnju/books.git via git.

8. Excuse me, are all your operations and maintenance based on Ansible? We used to use chef puppt to manage. Recently, I feel that Ansible is very popular and has not been used in practice. is it different to use this?

Have you ever practiced the IaaS platform in your operation and maintenance, and have you had any experience exchange?

A variety of different batch management tools have their own characteristics, according to their own familiarity and actual business needs to choose a complete mastery

At present, the IaaS platform is self-developed and based on KVM

Second, be sure to put it into practice and encounter problems in operation and maintenance.

1. What is the scale of LVS and HAPROXY backend servers, such as how many applications and backend servers are there?

This depends on the type of application. In the actual business scenario, you need to pay attention to the number of connections, PPS data and latency of load balancers such as LVS. If the backend throughput is large, consider LVS's DR mode. In general, load balancers are less likely to be a bottleneck.

How are the number of connections, PPS data, and latency of the load balancer calculated and counted?

This is not difficult to achieve through open source Zabbix templates or custom templates.

Is there a relevant command set for statistics, or detailed statistical examples?

For HAProxy, it is recommended to refer to the content of 29 HAProxy monitoring on page P76 of our book. Zabbix template technology, it is recommended to refer to the content of Chapter 12 in our book. Commands that can be used include ipvsadm,netstat and so on.

two。 Are there any good solutions or ideas for unified management (authentication, configuration, services) involving multiple platforms (Unix, Linux, Windows)?

Let's talk about certification first. Both Unix and Linux support OpenLDAP authentication. It can be considered that this is compatible with AD under Windows. Configurations and services can consider open source common products, such as Ansible or Salt. At present, the self-research system we use is similar to Ansible.

3. How to monitor services and business status monitoring? how do you do it?

Our monitoring system is self-developed, for the game, a very important business indicator is the number of people online, it is through the monitoring system to periodically poll the game server to collect and chart.

4. How do you manage the machine system and configuration of each business module in batch? Our directory uses Ansible to use batch commands and scripts, and business uses the online platform SVN to manage business programs and configurations. Is the CMDB platform developed?

The way we batch manage servers is ssh, which is similar to Ansible. CMDB provides the management of basic data and is self-developed.

5. Have you ever used traffic mirroring? Is to mirror the online traffic, lead it to the test environment, test with real user data, and want to understand the process of implementation from 0.

For the principle of traffic mirroring, you can refer to the network card hybrid mode and RawSocket technology in Chapter 15 of "Linux Operations and maintenance Best practices". After reading this section, you should be able to write your own set. I have not personally put it into practice, you can follow the tcpcopy project yourself.

6. How does CentOS 6 optimize the system and network? This parameter in / etc/sysctl.conf

Net.ipv4.tcp_max_tw_buckets = 6000

How to set it, the more the better? Set it to 16000?

Net.ipv4.tcp_max_tw_buckets = 16000

For system optimization, it should be targeted. Tcp_max_tw_buckets is aimed at timewait bucket. If there are more timewait states in the system, you can consider the adjustment of net.ipv4.tcp_tw_reuse and net.ipv4.tcp_tw_recycle. In addition, the use of persistent connections is effective in reducing the number of connections in that state.

7. If there are more than 100 servers, most of them are providing business servers, how to upgrade? Apart from downtime maintenance, is there any better solution now?

If you have a good business segmentation, such as stateless micro-service architecture, you can upgrade the grayscale through the front-end load balancer. If the application does not do well, only a single this, or centralized database, it will be more troublesome.

8. How many concepts like FARM can LVS and HAPROXY support respectively?

The FARM you mentioned should be a proper term for a hardware load balancer device and should be the concept of a load balancer group. In LVS and HAProxy, there is no hard limit on the number of load balancer groups, but in practice, there are generally not too many, because this involves maintenance costs and the overhead of master / slave switching in HA environment.

9. The system is CentOS release 6.5 (Final), the system does not automatically reclaim memory, 16g, I wrote a Shell script, every time the execution judgment is less than 1G, the memory is reclaimed.

You can follow some configurations of swap and swappiness in sysctl.

10. If there is a lot of ECS/VPS, the system is usually CentOS. At present, many fortress machines also have similar functions such as SSH synchronous key issuing commands, but there is little support for Win fortress machines. Are there any other open source tools or ways to mix and manage all Linux and Windows machines?

In my speech, I talked about the method of batch management of heterogeneous systems, which you can refer to.

Http://www.build.net/greatops/453250.html . In addition, you can refer to Ansible or salt.

Third, automatic operation and maintenance related, engineer thinking?

1. We can say what is automatic operation and maintenance, and how can we calculate that the server has done automatic operation and maintenance? What does it include? Automatic release, can you roll back if you have any questions?

Operation and maintenance automation is a concept that the benevolent see benevolence and the wise see wisdom. My understanding is that operation and maintenance automation needs to get through all the links from code development to formal launch, including version construction, automatic testing, automatic launch, and automatic monitoring.

Under this big proposition, you can choose one or two pain points to start automation practice according to your own working environment and automation level. Finally, a complete system is formed.

two。 I would like to ask how the automated operation and maintenance do? What aspects need to be considered? What I consider are the implementation of operation and maintenance, daily inspection and maintenance, as well as automatic fault handling, and reminders. In addition to these, what other aspects should you pay attention to? In addition, with the rapid development of IT technology, many new applications have emerged. If there is a basic way to do operation and maintenance, or rules, processes to meet the needs of operation and maintenance? For example, the relatively popular OpenStack Docker big data. These technologies to achieve the function is only a small step, more is the operation and maintenance after the launch. What's more, I want a way of thinking that can list the problems we have encountered, and how to deal with them.

Your question is good, but it's a big topic. Let me first talk about my understanding. The traditional operation and maintenance service process ITIL still has some value, but it needs to be reformed properly by combining some DevOps ideas to integrate the strengths of the two. Starting from embracing change, we should carry out operation and maintenance with an open attitude. But what remains the same is that creating value for the business is the ultimate goal, which is the goal of operation and maintenance.

3. The most important thing to realize the automation of operation and maintenance is configuration management, state management and change management. Is there any good way to share configuration management?

I think configuration management should be divided into "infrastructure resource configuration management" and "software / application configuration management".

The former is the category of CMDB in the general sense, which can be adapted to some extent on the basis of the open source CMDB scheme according to its own business characteristics.

For the latter, it is the system (for example, the combination of version control systems) and the process (for example, linked to change management). In our practice, these two aspects are involved.

4. Do you lead the functional design and implementation of the operation and maintenance automation platform through Python development and management tools? In addition, whether you redevelop or redevelop according to the Saltstack and so on.

The bottom layer uses SSH protocol to establish the server management channel, and the upper layer uses PHP to develop the management interface and encapsulate some common operations, such as password modification, script distribution and execution. Completely self-developed.

Fourth, it is very important to take good safety measures, safety-related issues

1. Operation and maintenance can not be separated from security, and the security of the server is also very important. is there anything in the book about the security of operation and maintenance, and how to control it?

There is a security theme in the book. Security is a huge system, and the book mainly talks about some measures to ensure the security of the Linux system. Other security topics, such as social engineering and intrusion detection, may need to read more professional books. You can first see if our "Linux Operations and maintenance Best practices" can meet your basic security needs. Thank you for your support.

2. Is there an open source solution for Web security monitoring, and can you block some possible vulnerabilities at the access layer? Suricata?

Suricata has not been studied or practiced. Several tools are mentioned in the Web server security section of Chapter 11 of Linux Operations and maintenance Best practices, which you can refer to. However, the ModSecurity rules should be tested strictly and carefully before they are launched, so as not to make misjudgments. In addition, regular safety scans of the production environment, such as the use of Tenable Nessus tools, are recommended. Manual penetration testing by security experts is also necessary.

5. Docker is so hot that it is used in combination with operation and maintenance.

1. Is the recently popular Docker technology used in NetEase game operation and maintenance and where is the application? what are the problems and how to solve them?

We are currently investigating Docker technology, which is only used in a small number of game tests. The corresponding network model and storage scheme need to be selected according to different business models. Docker technology will change the traditional mode of operation and maintenance, and it is necessary to consider the challenges brought by the integration with the original operation and maintenance system and the adjustment of operation and maintenance habits. In addition, I am not from NetEase Company. I am currently working in Shanda Games.

2. Does Docker have a far-reaching impact on operation and maintenance?

Docker has an impact on operation and maintenance, including continuous delivery, micro-service and the impact of DevOps concept. As operators, we should embrace this change and meet these challenges through continuous learning and practice.

3. Why is there not a mature Docker plan in China to release the details?

Docker is still a new thing, each family uses different scenarios and patterns, and there will be some secondary development management system and scheduling system.

6. Not all comparisons will cause harm. Engineers only think of the best plan.

1. What are the differences and similarities between game server operation and website server operation and maintenance as well as APP server operation and maintenance?

This question is very representative. The difference is that the website and APP operation and maintenance contact with the general open source software is more, the game operation and maintenance contact is mostly self-developed programs.

What they have in common is that they all need to master operating system knowledge, software and hardware knowledge and network knowledge, as well as troubleshooting ideas and capacity planning. Both of them need to introduce the thinking and system of operation and maintenance automation. The last two chapters of "Linux operation and maintenance best practices" describe the related systems and technologies of game operation and maintenance.

two。 As an operation and maintenance staff, what are the advantages of scripts like Python over Shell in system management and monitoring?

As a high-level programming language, Python has very rich libraries, including core libraries and third-party libraries, and most of the time there is no need to build wheels.

It has better control and retry mechanisms than Shell, such as setting timeouts on Socket, and so on.

3. What are the advantages of CentOS over Ubuntu? Why do most servers use CentOS?

Different Linux branches have their own characteristics. For example, the new version of Ubuntu is released quickly. If you pursue the speed of kernel version upgrade, you can consider it. CentOS has always been our main Linux distribution with the highest stability and familiarity.

When choosing a release, you should consider its ecology, such as upstream and downstream support, and there is another point, that is, the convenience of recruiting operation and maintenance personnel. There are a little more domestic people who are familiar with CentOS.

4. I would like to ask if there is only one server and there are multiple applications, is it better to use LVS for load or Nginx? Is there a big difference?

Are you talking about back-end applications based on HTTP or HTTPS? If yes, and the throughput is not large, Nginx can be used; for non-HTTP or HTTPS TCP applications, it is recommended to use LVS; if the throughput of HTTP or HTTPS is particularly large, use LVS DR mode.

7. You Need Backup, some problems related to backup

1. To what extent should the backup system achieve the size of 1000 machines?

1000 servers, to distinguish between business types, if the type is single, backup is easier to do. If there are many types, the areas to consider include: the frequency of database updates (full + incremental backup? Or only use full), the size of the data backup, the requirement of archiving in the data set.

two。 How do you make a backup? What is the elegant backup plan for hundreds of T pictures and accessories?

In the part of online backup, we can consider using erasure coding algorithm to increase the reliability, so that the cost of backup storage will not be too high. Also consider offline backups, such as tapes.

It is a long journey and the career of an operation and maintenance engineer

1. What do you think will be the core of operation and maintenance in the future, automation, prediction or other?

I think the future operation and maintenance should be intelligent. Intelligentize all the capacity planning, expansion and reduction, and troubleshooting that need to be done by people now. The task of operation and maintenance is to program and inculcate their abilities into the machine. Of course, the ideal is very plump, the reality is very bony. This requires our unremitting efforts.

two。 As a tester who has worked for more than 4 years, he also has some knowledge in operation and maintenance. sometimes he needs certain operation and maintenance skills to maintain his own testing environment in the company. From Windows Server to Linux, he has learned a lot and summed up a lot. The last company just left the company when it started to deploy Docker. It's a bit of a pity, and I don't have time to practice the follow-up work. At present, I use ng load and adopt Tomcat deployment plan. I'm really busy with my work. I really want to improve my operation and maintenance. Do not know where to start, ask the great god for advice.

According to your description, I am currently a part-time operator. I suggest that you consider studying the principles of the environment in addition to building the environment, while maintaining these environments with automated scripts. I believe you also have some programming experience, which is also helpful for your follow-up practice. In addition, we can look more at the operation and maintenance cases summarized by others and take fewer detours.

3. The operation and maintenance technology is quite complicated, how to treat this kind of miscellaneous? It makes people feel as if they can order everything. Are there any good learning suggestions for operators who have worked for 5-6 years?

As you said, the technical requirements of operation and maintenance are indeed quite wide. In my opinion, for the operation and maintenance students who have worked for a certain period of time, there are several directions that can be considered:

DevOps practice (strengthen your programming ability, systematically learn a high-level programming language, automation of operation and maintenance)

Focus on learning your own technological weaknesses, such as systematic learning of network knowledge

Read some good technical books on operation and maintenance and learn other people's practical information.

4. Because the operation and maintenance system has comprehensive data collection, automatic processing, alarm and automatic recovery mechanism, we combine operation and BI here. Expand the operation and maintenance tools and architecture, integrate the mature BI into the operation and maintenance system, liberate the work of the business specialist, and rely on this set of operation and maintenance system for routine business analysis, reporting and data monitoring. In our case, the operation and maintenance has gradually changed from an one-tier platform to a framework, and all the necessary scenarios can be applied. Technology is changing all the time, but the most important thing is not technology, but the idea of using technology to provide services.

In addition to combining with BI, what other related business scenarios can operations and maintenance thinking combine to generate value in new directions?

I agree with your idea and practice, "the idea of using technology to provide services". I personally think that the ultimate goal of operation and maintenance may be "self-operation and maintenance without operation and maintenance engineers", or intelligent operation and maintenance, which is the deep integration and practice of AI in the field of operation and maintenance. The continuous optimization of capacity planning algorithm and the automatic resource scheduling based on public cloud should be intelligent. Of course, there is still a long way to go to achieve this goal.

5. What are the changes that Devops has made to operation and maintenance? can you tell us briefly?

It has been a long time since the concept of Devops was put forward. On the whole, I think the change it brings is that the continuous delivery capability needs to open up the entire link of R & D, testing and deployment of operation and maintenance, and it has higher requirements for the automation of operation and maintenance. We must master some operation and maintenance automation framework and some programming capabilities to cope with this change according to the business scenario. In addition, for operators, it is necessary to embrace change and cooperate with an open attitude.

6. Now which version of Linux is the most widely used, and Linux operation and maintenance, do we need to learn some languages, such as Python, so that we can be regarded as a really good operation and maintenance?

Don't hesitate to start learning programming immediately, be it Perl or Python, be familiar with either. Here, I will not compare the advantages and disadvantages of Perl and Python. Stick to your own code (plus other people's frameworks and libraries) to solve repetitive operation and maintenance problems, and you will grow faster. CentOS is used more often.

Chapter 18 of "Linux Operations and maintenance Best practices" is about system automation programming using Perl, which you can take a look at first. If you are interested, start right away.

7. Excuse me, how do you insist on writing a book? Is it to write down the key issues of normal work, write down a little bit every day, and then summarize it? Are there any tools for writing books, or are they just written in Word? Can you share the method of writing operation and maintenance books?

This is a very good question, and it's something I want to share. The material of writing a book depends on the accumulation at ordinary times. It is suggested that we should write more standard documents at ordinary times. Word format can refer to the arrangement of our book. The three more important points are:

Visio diagrams should be preserved, not just pictures, because typesetting may have to be adjusted.

Some fault sites, try to record in detail, phenomena and analysis process, auxiliary logs and package capture files, etc., are recommended to be retained.

The script is saved by category so that it can be found.

The above is how to carry out the best practices of Linux operation and maintenance. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.