How do six people operate and maintain 10,000 servers? 07/12 Update SLTechnology News&Howtos

How do six people operate and maintain 10,000 servers?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

At GOPS2017 Beijing Railway Station, Zheng Songkuan from Qunar gave a speech entitled "the Evolution of Qunar Network Application Operation and maintenance Automation", sharing the obstacles encountered in the automation construction process and how we overcome these obstacles, which pits we encountered, and how to fill them.

I joined Qunar in 2013 and have been engaged in operation and maintenance development since I joined. Qunar network operation and maintenance development has a feature, all of our developers are both PM and QA, and there is no distinction between front-end work and back-end work. In popular words, we are all full-stack engineers. The work I have done in the past few years has been relatively piecemeal, and I will go wherever there is a need.

To sum up, it mainly involves the design, development, operation and maintenance of host management, application management, monitoring, alarm platform and so on. Let's briefly introduce our operation and maintenance team.

First, our operation and maintenance team is responsible for the operation and maintenance of all the company's servers, networks and other hardware platforms.

Second, some personnel are engaged in daily operation and maintenance work, including QVS deployment, Nginx configuration, application launch support, and storage deployment. These operation and maintenance work also includes alarm notification, fault notification and tracking.

Third, around 2013, we began to develop our own operation and maintenance platform.

The fourth aspect is responsible for the application of the company's intranet, including OA system, HR system, and IT asset management platform and so on.

Qunar network application operation and maintenance platform

First of all, a brief introduction to Qunar network application operation and maintenance platform.

We know that the life cycle of an application from development to online operation mainly involves four parts:

The first part, the resource management of the application, these resources include the host, the pictures and files of the application, the storage resources needed by the object storage, the application communication and other network bandwidth, and the computing resources needed by the application and so on.

In the second part, in order to improve the efficiency of application development and ensure the standardization of application development, our company will provide common middleware, including log collection, application configuration registration, monitoring alarm index collection, and application call path.

In the third part, in order to publish our application online, we need to manage the code and build and test the application to be released online, which requires continuous release and continuous integration of CI/CD.

In the fourth part, when an application is released online, we need to monitor, alarm and analyze the performance and business indicators of the application, so we need to apply the relevant monitoring, alarm and log analysis platform.

The business of Qunar network is also developing step by step, from dozens of machines to tens of thousands of machines, we have encountered a lot of problems in the process of development, and we have put forward different solutions at different stages.

To sum up, Qunar has experienced four stages:

In the first stage, the number of operation and maintenance machines is relatively small, and most of the work is emergency operation and maintenance. For example, if we find that there is a problem with an application, we log in to the relevant machine of the application and execute the Linux command manually to check the resource usage of the machine. For example, CPU is not too high, whether the disk is full, this stage does not use too complex scripts, basically manual operation, dozens of sets.

In the second stage, as the scale increases, we write a lot of scripts manually, with which we can execute tasks in batches and deploy applications and monitoring in batches on multiple machines. This stage, we call the script operation and maintenance stage, this stage is the use of scripts and combined with open source systems, we can complete the operation and maintenance of hundreds of machines.

In the third stage, as the scale becomes larger and larger, the script operation and maintenance is not enough, and the script operation and maintenance is far from satisfied. scripts may be classified scripts and have not been reasonably choreographed, so the execution order of scripts is more important. lack of reasonable scheduling may lead to some problems.

We develop some related systems, connect the related scripts with the system, and arrange them to form a separate operation. For example, the creation and deletion of a machine is a separate operation, which can be made into a system that can be operated by operators on the interface.

At this stage, it is called discrete system, and their data is basically not well shared among various systems. The number of hosts that can be operated and maintained at this stage is also relatively limited, and thousands of hosts are better.

In the fourth stage, the scale of Qunar's machines exceeded 10,000. At this time, we considered whether we could reasonably design our operation and maintenance platform from a higher point of view. Provide one-stop service for our operation and maintenance work, and we achieve data exchange on the basis of one-stop service, so that we can interact and do some automated work. In this period, which is also what I want to talk about today, that is, the construction of the operation and maintenance platform.

Three key points of applying operation and maintenance platform

In the process of building the operation and maintenance platform, we have encountered many difficulties as well as many pits. Among these difficulties, we have summed up three key points: host management, monitoring alarm and data exchange.

Host management

The host management system of Qunar Network is based on OpenStack and DNSDB, OpenStack is scheduling to create virtual machines, and DNSDB is our company's domain name management system. Through DNSDB, we can make a machine's name, department, use and its computer room into a unique domain name, and we use this unique domain name to identify our host.

On OpenStack and DNSDB, we write a large number of script documents and tools, which are organized and encapsulated into operations, and we give some permissions to these operations. We store the information of the host, the management of circulation, the configuration of permissions and the query of operation log in the log database. Finally, we will expose the interface of a host management system to the operators, who can manage our hosts through this interface.

With the host management platform, the operation and maintenance personnel can easily create and destroy the host on this platform, and view the relevant information of the host, such as its configuration, overguarantee information and so on. In the process of adding each new machine, we will add a monitoring alarm to the machine by default, and the relevant person in charge will be notified when the machine gives an alarm.

In fact, there is still a problem, a bigger problem is how our system is developed for the use of operators, developers do not have the right to log on to this system. If a developer puts forward a requirement, if I want to create a host, I need to send an email to OPS. When OPS created this host, he didn't record exactly who the person in charge was. He may write it in the remarks, which may not be correct with the passage of time. Because the person in charge at that time may have left or changed jobs, this often happens.

The department responsible for this machine does not have a good record, because many of this department is only reflected in the name of the mainframe, but it is possible that this machine may be transferred to other business line departments in the process of use. in this way, the department information we get is also inaccurate. There is another problem that the DB system is only open to operation and maintenance personnel, and there is little business line participation, which leads to the inaccuracy of the relevant information of the whole host, because the OPS personnel are limited after all, and it is impossible to maintain this information very accurately.

So we come up with a solution that can be solved by applying trees.

Qunar divides the business line into each BU according to the functional area, the application tree BU as the first level, there are departments below, and there are smaller departments under the department, this level may be multiple. The last level is the application that the department is responsible for, and the application is used as the last level. We treat all levels as a node, and we can bind hosts, add responsible persons to the nodes, and add approvers to the nodes. I will introduce the permissions and roles of the approvers below. With this application tree, business line developers participate in the management of hosts, and the information of their leaders and departments is more accurate.

There is something wrong with a machine, and I think it is very easy to find the person in charge of the machine very quickly. If the host is about to be overinsured, I need to find the person in charge of this virtual machine for all the virtual machines on it, and inform these people to perform relevant operations, such as virtual machine offline and application offline. This can avoid the failure caused by the overinsurance of many operation and maintenance hosts. Because the person in charge of the machine is more accurate, our alarm notice will inform the relevant person in charge of the monitoring and alarm of the machine by default, and the person in charge will handle the basic hardware alarm related to the machine.

The consumption of resources is counted every quarter, and the purchase of machines for the next quarter is planned and budgeted. If you get a higher-level department, such as a BU node, you can easily get what machines are under this department through the application tree, and what is its growth this month, so we can easily predict how many machines we need to purchase next quarter, so as to make a more reasonable budget. With users, the relationship between the person in charge, the department and the machine is relatively clear.

But there is a problem. When you apply for resources, you still need to operate OPS, and OPS is also responsible for adding accounts. What should a developer do if he wants to expand a machine or add an account to a machine? He needs to send an email to the team that operates OPS, saying that I want to expand the capacity of the application to two hosts, or add an account to which host. What's wrong with this? first, OPS can't be online or staring at the system in real time, so the OPS response is very slow, the email query is very inconvenient, the email may be lost for a long time, and the location problem is not easy.

How to solve this problem and then do two systems, the first is the host application system, the second is the account application system.

These two systems are based on host management, application tree and examination and approval center, and call host management, application tree and approval center as interfaces to arrange some reasonable processes of host application and account application. When we mentioned the host application just now, who has the authority to apply, the person in charge of each node on the application tree has the authority to apply for the host of this department or the host of the application, and the approver on the node has the authority to approve the host under this node. In this way, OPS does not have to participate too much, and they can automatically apply for hosts and accounts.

Finally, we made an interface to expose the interface to developers, who can apply for the host to apply for an account. Through the application tree, host management, host application, account application of these four platforms to do a closed loop, the core is the application tree node, the application tree node connects the four parts.

If there is any problem with the application tree node, we will change it. For example, at the beginning, there was a portal application under OPS development, but one day we found that it was placed in the wrong place and needed to be placed directly under OPS. In this case, we need to move portal from OPS development to OPS.

In addition, as the business grows, the application of portal becomes larger and larger, which needs to be split into several parts, such as portal-web and portal-api. What will this change in tree nodes lead to? What we record in each system is the application tree node, and each system needs to synchronize the changes of each application tree node, which is equivalent to a stateful module in a distributed system, that is, the application tree node. In fact, it is stateful, which makes it difficult for us to distribute. If we want to extend the application tree nodes to more systems, it will be very difficult and will continue to face the problem of synchronization.

How to solve this problem, for example, for an ordinary resident, how to share data between various systems, such as how I share my information between the public security system, the household registration system, the banking system and other systems. In reality, there is a very good practice, that is, the use of ID card, ID card has a unique ID, through such a unique ID, you can identify the application, and this ID will never change.

How can we find such an ID, the first solution, to identify applications with self-increasing ID or UUID in the database. This ensures that the application ID is unique and does not change, but because the self-added ID and UUID do not have a clear meaning in the text, it is not easy for us developers to get this ID.

If you want to use self-increasing ID or UUID, you need to use another system to see how many such ID I have. Find this ID first, and then interact and communicate with other systems, which is very inconvenient. The second plan is to draw lessons from the ID card and use the number, such as 110 to represent Beijing, followed by the county and district to represent your own date of birth.

Using ID ID for reference, we use such an Appcode to identify the application. The Appcode is basically divided under the slipline. The first is the department where the application is located, and the second is the description of the application. This level can also be very long. Using such an Appcode to replace the application node can ensure that it is unique and unchangeable, easy for everyone to remember and communicate more conveniently, and we finally chose the second set of solutions.

Monitoring and alarm

Let's take a look at how we do the monitoring and alarm on the operation and maintenance platform. As an Internet company, ensuring 7x24 hourly service is a basic requirement. How can we guarantee 7x24 hourly service? If there is something wrong with the system, we can find it in advance, and when there is a real problem with the system, we can find it in time. To ensure these two points, we need to monitor the alarm system.

Qunar network monitoring and alarm system has also experienced a long struggle, at the beginning, each department will maintain its own system, at first, the two modules Cacti and Nagios to build, what is the problem?

The first Cacti is deployed on a single machine and cannot be expanded horizontally, resulting in poor performance. If the stand-alone is abnormal or even downtime, then our monitoring and alarm system is completely unavailable, so this is a non-highly available solution.

Second, each department will maintain its own monitoring system, and even larger departments, such as hotel air tickets, may maintain many sets, each of which requires special personnel for operation and maintenance, and the cost of operation and maintenance is also very high.

Due to the lack of good authority management in the previous system, this system can only be taken care of by a special person, because it is dangerous to release the authority to other people, and someone may have accidentally operated something, deleted the alarm or modified the alarm configuration, so the alarm can only be handed over to a special person.

To customize an alarm monitoring communication cost is very high, we need to contact our own relevant person in charge, and then go to the alarm configuration. Developers think it is too troublesome, simply do not do, or do very little, resulting in our monitoring is not complete, there may be some anomalies or even faults are not found in time, the efficiency is relatively low. How to solve this problem? We have made a company-level unified monitoring and alarm platform Watcher. There are several goals:

The first is high availability, one or more machines are down, which has no or little impact on us.

Second, it is relatively easy for everyone to configure this alarm, we have done a rights management system, but also learn from the application tree to do a tree-like rights management system, the entire Watcher interface is open to all developers, so that everyone can easily match their own alarm and monitoring.

Let's briefly introduce Watcher. Watcher is developed in depth based on Graphite. The Watcher platform supports not only host basic monitoring alarms but also business monitoring alarms, all on a unified platform, and monitoring alarms can be viewed and configured by developers on a unified interface.

Watcher started around 2014, and now it has been around for three years, and it has been well promoted in the company. Now Watcher has connected more than 1500 applications, the number of Watcher indicators has exceeded 20 million, the number of alarms has exceeded 400000, and the number of machines connected to basic monitoring has also exceeded 40, 000. With Watcher on such a large scale, what kind of architecture do we use?

This architecture diagram is just an architecture diagram of one of our Watcher clusters. When we count, we will distinguish which cluster each indicator will hit. How can we tell the difference? Use Metrics as the logo. For example, all test data start with t, and all host data start with h. We use s.flat to represent this department of air tickets. When all indicators of this department of air tickets are counted, we need to configure a server. This server is also represented by a domain name, and it itself represents a monitoring and alarm cluster of air tickets.

In the above cluster architecture diagram, the green at the bottom is the original components of Graphite, on which we have developed several related components. The first is Relay. After each metric is called, we distribute the metrics across multiple machines through Relay, which is achieved by consistent hashing.

When we take the number, we also developed this part of Graphite-api, and we also have the same consistent hash algorithm in Graphite-api. Through this algorithm, we can find out which machine this indicator is on, call the api under Graphite-web on this machine, and then get the relevant data.

This is a cluster architecture, with multiple clusters. We need to make a unified interface for Watcher. When configuring our own monitoring on this interface, select the data source. For those who count, he knows where this indicator is. Can we make a unified data source for users to use, so that we add a database of pure metrics to the component, and every time the traffic comes, we will write the name of the metric into our database and record which cluster it is in.

In this way, we can report a unified Graphite-api to the outside. If we say an indicator, we want to start with the s.flat-xx indicator. First of all, we call api to find out what the s.flat-xx indicator is in the cluster. We find that it is in the air ticket cluster, and then we can take this indicator out through the consistent hash. The first part of the Graphite-api is to borrow this Dashboard, to call the police.

After talking about the whole Watcher architecture, take a look at how host monitoring is done.

First of all, there is a hardware management platform that maintains the relevant information of host monitoring. The most important thing is to arrange the agent to maintain the version configuration of the agent, constantly scan the host, deploy to the host, and regularly check whether the indicators have been collected. If the host indicator has a breakpoint or a problem, it will call the police to check whether there is something wrong with the Collectd or the system or the network.

After Collectd is deployed on each host, different metrics will be set according to different configurations, such as CPU usage, memory usage, and network bandwidth usage, all of which will be marked as Watcher. The indicators of each host may be the same, how to distinguish between the indicators of different hosts, we will use the name of the host as the distinction. After connecting to Watcher, we can call api, which can be called on Dashboard.

Business monitoring is also similar. After the application is connected, the api will be exposed, which is the monitoring data of the application within the last 1 minute. Every minute, Qmonitor server pulls the file from all the machines, takes the file and makes a centralized analysis. After the analysis, do the corresponding processing. For example, count the applications, then use Appcode as the logo to distinguish different indicators, and push the indicators to Watcher. After being pushed to Watcher, you can also query monitoring and check the health status of the application metrics.

Data interchange

Let's talk about how we achieve data exchange in the entire operation and maintenance platform. We have mentioned an Appcode in the monitoring alarm and host management. What on earth is Appcode in Qunar?

In fact, it is the only logo application, we abstract an application, the meaning is actually more broad. An application in Qunar can be a Web service, a GPU cloud instance, a MySQL instance, or even a set of switches, or other things.

Why should applications be abstracted? the advantage of abstraction is that we do not have to consider the specific details of services and resources, but use an App to represent a service or a resource. In the process of abstraction, we can not consider what the service does and what the resource looks like. Define common attributes for broad applications, including the person in charge of the application, the permissions of the application, the bill of the application, and so on.

With these common attributes, we can extend Appcode in multiple systems and distribute it among different systems to share data. What is the effect of this? With Appcode, we can form a common language in our various systems, and this common language is Appcode. With this common language, we can connect the data between various systems, and finally achieve a data exchange. What are the benefits of data exchange?

First, we monitor Appcode in various systems, such as mainframe, storage, computing, which is the resource part of the application. Appcode is distributed in multiple systems, and multiple systems interact with each other. The more nodes a data is distributed, the higher the accuracy of this data is required. Because this data may be used among multiple systems, its responsible persons will pay more attention to this data, so they are more willing to make this data more accurate.

After the data is more accurate, it becomes more useful. Because the data is accurate, each system is willing to use this data to form a more virtuous ecological cycle. Because of the data exchange, we can make a Portal platform, expose a unified interface, and manage all the parts involved in our application in one stop.

The second part is CI/CD. The hosts published by the application are also associated with Appcode, and the hosts that should be released after capacity expansion are also synchronized. You can publish these hosts directly by selecting these hosts, and you no longer need to manually fill in the list of these hosts.

The third is that monitoring is divided into two aspects, one is basic monitoring, the other is business monitoring. Basic monitoring is also the basic monitoring of related hosts through the Appcode dimension. For the collection of business monitoring metrics in application monitoring, you can also get its host list through Appcode, and automatically add these machine lists to the business monitoring metrics collection, and then collect the monitoring metrics and logs of these application-related hosts.

The fourth is the alarm system, because with Appcode, Appcode will correspond to some common monitoring alarm items, such as GC alarm in JAVA. Once we have Appcode, we can add GC alarms to all machines on every Appcode by default. The GC alarm contact is a person in charge of Appcode, and its GC alarm is automatically added after each machine is expanded. The same is true for log collection. Previously, we may still need to maintain manually on this platform, and with Appcode, we can synchronize this list.

Brief introduction of Portal platform

A brief introduction to the Portal platform, which is also under development.

Portal is based on Appcode and connects various OPS systems on the basis of Appcode, such as host, account, GPU cloud, ES cloud, application registration, application configuration, application middleware, environment configuration, code warehouse, testing, release, monitoring, alarm, log collection, and fault management. We aggregate these systems into a Portal interface and expose them to developers. After entering this system, developers can do all the things they want to do in one stop, which is very convenient for developers.

Another advantage of data exchange is that we talked about host management just now. Hosts may have different dimensions to explain that this host is different. For example, there is a list of hosts for publishing apps, a list of hosts for billing, a list of hosts for collecting logs, and a list of hosts for collecting monitoring alarms.

As long as the data is exchanged, we can concatenate the data. For example, for our application, its host needs to be expanded, two hosts need to be expanded, and after the expansion, we can automatically add a corresponding account for the host according to the person in charge of the application, so that the person in charge can use this account to log in to the corresponding system and operate accordingly.

There are other restrictions on IP whitelist in the database. After data interworking, there is no need to record every host in the whitelist configuration of an application, just record Appcode.

Another advantage of data exchange is that with Appcode, it is very convenient for us to calculate the bills spent by the application. Why calculate the bill for an application?

On the one hand, let's improve our cost awareness, which also needs to be considered in the selection process. For example, a business line has some data to be recorded. It can choose any system, database or Watcher. If the frequency of access to this business is very low, for example, several or more times a day, it is actually very expensive to record this data to Watcher, because Watcher data is very inflated, and it is actually more cost-effective to choose a database or log.

Second, you can optimize the implementation, if you use a lot of machine resources because of the algorithm, after you have a bill, they will save costs. With a sense of cost, we can allocate resources more reasonably. For example, some applications themselves are not very important, but also apply for a lot of machines, the machine utilization rate is not high, take a look at the bill, such an unimportant application actually consumes such a large bill, and then they will recover some of it.

At present, we are also constantly accessing a variety of application bills, such as host billing, network bandwidth billing, monitoring alarms, log collection, massive storage, computing resource bills, and a series of other bills. will slowly come in.

Summary

Finally, to sum up, in the process of operation and maintenance automation of Qunar network, we have gone through different stages. We find that when the application is expanded to a certain scale, the need for operation and maintenance platform, automatic or semi-automatic way is very labor-intensive, and it will also roughly find some errors or even failures. Qunar network operation and maintenance automation is also done very well, how to reflect?

When I started, there were about five or six daily operation and maintenance personnel, but now we still have six daily operation and maintenance personnel, and we have pushed another operation and maintenance robot, the seventh person of operation and maintenance. In fact, we still remain in the state of six people. our scale has expanded many times, from 100 to 10,000 units, a hundred times the scale, but our daily operation and maintenance personnel have not increased, which is the benefit of the automation of the operation and maintenance platform.

The availability of the application needs to be guaranteed by the monitoring alarm system, basically setting up all its key alarms and monitoring before an application is online, so that if there is something wrong with the application, it will quickly roll back or go to debug. Because we have a perfect monitoring and alarm system, there are relatively few faults in Qunar network. On average, there are only two or three failures in a day.

But Qunar network failure and other failures may be different, Qunar network fault requirements are more stringent, a network failure we will record the failure of batches. For example, the monitoring system of Watcher is no longer graphed. For more than 5 minutes, we may delve into the faults of P1 and P2. Under such strict requirements, our faults will not be too high. In the four years since I joined the job, the cumulative number of failures is now only about 3000.

In order to ensure the ecological development of our entire operation and maintenance, we need to get through the data, and we need to apply an ID. With this ID, we can share data on various operation and maintenance systems and platforms, forming a virtuous ecological cycle.

The author introduces: Zheng Songkuan, senior operation and maintenance engineer of Qunar. In 2013, he joined the platform Division of Qunar Network and engaged in the development of operation and maintenance. In the work, he is mainly responsible for the development of the company's monitoring system, the design, development, operation and maintenance of the application management platform Portal.

Transferred from: [efficient operation and maintenance

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.