The Construction of JD.com Database Operation and maintenance Automation system 04/17 Update SLTechnology News&Howtos

The Construction of JD.com Database Operation and maintenance Automation system

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Operation and maintenance automation comes from the pain point in the work. JD.com 's database team is faced with thousands of R & D engineers in the mall. This pressure drives us to change constantly, but the change is not achieved overnight. It has also experienced a difficult transformation from manual to scripted, automated, platform and intelligent, so it is said that the demand is driving the construction of the operation and maintenance system, and the essence of operation and maintenance automation is to liberate the operation and maintenance personnel. To promote the improvement of the human rate and reduce man-made faults, we should learn to cultivate the good habit of being "lazy". JD.com 's automatic operation and maintenance system construction began in 2012, the following two aspects are introduced.

1. JD.com database intelligent operation and maintenance platform

JD.com 's business is growing in the form of outbreaks every year, with a large number of database servers and thousands of product lines. To support such a huge business system, we need a set of perfect operation and maintenance automation management platform. At present, JD.com MySQL database management platform is referred to as DBS, which mainly covers the following contents: perfect asset management system, database process management system, database monitoring system, database fault management system, database report system, flexible database system and database auxiliary operation and maintenance tools, which involves all aspects of DBA operation and maintenance, and realizes the automation, self-service, visualization, intelligence and service management of MySQL by DBA. Avoid the production accidents caused by manual operation errors of DBA, and ensure the safe, stable and efficient operation of JD.com database. This article focuses on the following core functional components.

1.1. Metadata management

As the cornerstone of automatic operation and maintenance, its accuracy is directly related to the reliability of the whole database management platform. Starting from the database business side and the operation and maintenance habits of DBA, JD.com database management platform covers many dimensions, such as computer room, host, business, cluster, instance, library, table and so on.

 computer room and host dimension: mainly record hardware information.

 business dimension: mainly records the name, level and relevant information of the business department.

 cluster dimension: mainly records MySQL cluster architecture information.

 instance dimension: mainly records the relevant parameters of MySQL to provide guarantee for subsequent automatic operation and maintenance.

 library dimension: mainly records the database name and contact information of business personnel.

1.2. Automated deployment

In the face of complicated database addition, expansion and other operation and maintenance work, the use of automatic installation and deployment platform can completely liberate DBA. At present, JD.com 's automated deployment system includes application server, database instance deployment, data synchronization, consistency verification, split and switching operations. The whole process is streamlined, including business and DBA operation approval at all levels, and finally achieves the automation and flow deployment of comprehensive MySQL services, as shown in the following figure.

The main function points include the following:

 installs and deploys MySQL instances, builds architectures, and applies for domain names. The allocation rule requires that the master and slave instances of the same cluster cannot be in the same cabinet, and the host with good hardware performance is preferred as the master library.

 monitors deployment, backup deployment, and asset registration.

 MySQL service is created in the form of an image, which depends on the image repository of K8s.

The  application account is created by the application side through the automated online system.

 master-slave data consistency check is usually performed during the night business trough.

1.3. Intelligent analysis and diagnosis

JD.com 's intelligent analysis and diagnosis covers four important parts: database monitoring index collection, diagnosis analysis, fault self-healing and trend analysis.

1.3.1 Monitoring system

The monitoring system provides accurate data basis for database management, and allows operators to know the operation of the production service system like the back of their hand. The core monitoring indicators include: OS load, MySQL core indicators, database logs and so on. Through the analysis of the monitoring information obtained, the running state of the monitored database is judged, the possible problems are predicted, and the optimization scheme is given to ensure the stability and efficiency of the whole system.

JD.com 's distributed monitoring system adopts passive mode, and both server and proxy are highly available to prevent single point of failure. The following is the overall architecture and flow chart:

1.3.2 Monitoring performance analysis

Intelligent analysis of database performance is mainly the secondary analysis of database monitoring data to eliminate security risks. In the actual production, some hidden dangers do not reach the set alarm threshold and are at an alarm critical point. In fact, this situation is the most dangerous and may break out at any time. In order to solve these hidden dangers, we find the hidden dangers in advance by grouping and analyzing the monitoring data in terms of ring comparison, year-on-year and TOP indicators.

 slow SQL analysis

 index analysis

 spatial analysis and prediction

 lock analysis

1.3.3 self-healing of failure

The failure occurs in various forms, and the core content depends on the auxiliary analysis of monitoring. How to provide the most accurate information is as follows:

 alarm filtering: filter out unimportant alarms and repeated alarms

 generates derived alarms: generate all kinds of derived alarms according to the association relationship

 alarm association: whether there is an association between different types of derived alarms in the same time window

 weight calculation: calculates the possibility of becoming a source alarm according to the weight of all kinds of alarms set in advance

 generates root alarm: marks the most weighted derived alarm as root alarm

Merge  root alarm: if multiple types of alarm calculate the same root alarm, merge them

1.4. Intelligent switching system

The magnitude of JD.com database server is relatively large, which will lead to a relative increase in the probability of failure, and at the same time, the requirements for system stability are more stringent. Therefore, in order to ensure the high availability of the database and ensure 24-hour continuous service, our team independently developed an automatic database switching platform, which realized automatic and semi-automatic switching modes, and realized multi-dimensional scene switching according to single cluster level, multi-cluster level and computer room level. The switching process includes modification of monitoring, modification of asset information, modification of backup strategy, modification of master and slave roles, etc., which is completed with one click to avoid secondary failure caused by human factors.

1.4.1 distributed detection

As the core component of the switching system, the distributed detection function mainly solves the problem of system disaster recovery. According to the characteristics of JD.com database server deployment in multiple data centers, each independent data center deploys a detection node, which is distinguished by the specially identified interface domain name. When a handover operation occurs, the handover system randomly selects two data center interfaces to execute the call based on the incoming failed host IP and other information. If a node host is found to be alive, the host is considered to be alive, and if both nodes are found to be down, the host is considered to be down.

1.4.2 Master failover

When the main database instance fails, the switching system will first check the instance survival status through the distributed detection system. After confirming the downtime, it will switch the identity according to the instance in the basic information and choose to use automatic switching or manual switching. The principle of the two switching methods is the same: first, the switching task is created on the switching system, manual switching requires DBA to execute the switch button, and the switching operation inserts data through insert to verify the running status of the instance. Avoid instance tamping and hard disk read-only. If there is no surviving slave library, abandon the operation and notify DBA by email and SMS. The new master database is selected according to the principle of local first (few connections first, then low QPS load), and then remote location. After a successful switch, the corresponding metadata information will be changed. Examples are as follows:

For a cluster with a master and four slaves, the master database 10.66.66.66 has failed and needs to be switched, as shown below:

1. If the monitoring system detects that the main database is down, a switching task is automatically created to switch automatically or manually. Take manual switching as an example:

two。 Select the target instance, if the four slaves in the example are all alive, then according to the principle of "local first and then remote", select 10.66.66.68 QPS 336prima 10.66.66.69purl 3366, and then check the number of connections. In the case of the same number of connections, compare the QPS, and select 10.66.66.69dev3366 with low QPS load as the target instance.

3. Result of completion of switching

1.4.3 Slave failover

From the failure of the database instance, change the domain name under the failure instance to the non-fault instance under the cluster, and the method of selecting the target instance is consistent with the selection rule of the master database instance. The DBA will be notified by email and SMS if the switch succeeds or fails. After the failed instance is restored, DBA determines whether it needs to be failed back. Examples are as follows:

There is a cluster with one master and four slaves, and the slave database 10.88.88.89Japon3366 fails and needs to be switched, as shown below:

The monitoring system will automatically create tasks, and then check the number of connections according to the principle of "local first and then remote", and then check the number of connections, QPS, and determine that the target instance is 10.88.88.88.88. DBA can view the details in the switching task list.

If the task is switched successfully, the failback button will be displayed, and DBA can perform the failback and view the details of the failback.

1.4.4 Master-slave planned switching

Master-slave planned switching realizes batch switching according to single cluster and multi-cluster. You can check the specific steps of sub-task switching when performing batch switching, and there will be a comparison of the architecture before and after the switch. Specific examples are as follows:

Cluster 1

To create tasks in batches, the selection principle is based on the number of connections first and then the number of connections, and then the QPS,10.66.66.66:3366 selects the target master database as follows: 10.88.88.89Plus 3366.

Batch handover

To switch the details of subtasks, you can view the switching results of each subtask, the execution steps and the architecture before and after.

All the functional modules of JD.com MySQL database switching system have been componentized, and the service simplifies the operation flow of DBA and shortens the time of database switching.

1.5. Automatic backup and recovery of database

1.5.1 Architecture Design

At the beginning of the design of JD.com database backup system, it is to extricate DBA from the complicated backup management, realize automatic processing, reduce human intervention, and improve the availability of backup files. With regard to the availability of backup files, the strategy of polling recovery ensures that each cluster is restored within a cycle. The system architecture design is shown in the following figure:

The architecture has the following characteristics:

1) scheduling trigger diversification:

The dispatching center supports three types of trigger methods: interval, crontab and date.

Interval is a periodic schedule, which can specify task scheduling with fixed intervals. It supports time units such as weeks, days, hours, minutes and seconds, and supports setting the start and end time of scheduling as well as time zone settings.

Crontab is a scheduled schedule, which is basically the same as Linux's crontab, supporting year, month, day, week, day_of_week, hour, minute, second, and setting the start and end time of the schedule as well as the time zone.

Date is an one-time scheduled schedule that supports time zone setting.

2) concurrency control:

Due to the unbalanced setting of scheduling tasks, there may be more tasks to be scheduled at a certain time, which is easy to cause problems in the scheduling system, so the execution of tasks can run more smoothly by controlling the number of concurrency.

3) trigger and execute layering:

The task trigger itself is a lightweight set, and the task execution is generally heavy, so the trigger and execution are designed hierarchically to prevent subsequent trigger problems caused by the long execution time.

4) tasks are not lost during maintenance:

The crontab of Linux will not execute the tasks to be run during downtime maintenance after startup, while the APScheduler-based scheduling center will run the tasks that have not been executed within the specified interval after startup to reduce the missed execution of tasks due to maintenance.

5) add, delete, modify and check backup strategy:

In the past, the company's backup system needed to specify a specific IP, but the backup often failed because of server maintenance, so the backup strategy was combined with high availability at the beginning of the design, and the backup strategy specified the domain name instead of IP. Slave database because DBS will switch this domain name on the database to other slave libraries in the cluster when it fails over, and the corresponding backup will follow to this slave library, ensuring that the backup server is available.

6) fail to automatically retry:

Backup is likely to fail due to accidental factors, so a backup retry feature is added. Backup tasks that fail within 6 hours will be retried for a maximum of 3 times to achieve a higher backup success rate.

7) automatic recovery detection:

Backup should be strictly verified at every step, but the availability of backup files can not be guaranteed absolutely. Therefore, an automatic recovery detection mechanism is introduced to help DBA detect backup files and find in time that backup files are not available because of various unconsidered circumstances. And recovery detection is also a rigid requirement of audit. Automatic recovery detection also frees DBA from the heavy recovery detection work.

1.5.2 scheduling design

The whole automatic backup and recovery system is mainly composed of scheduling system, backup system, recovery system, recovery detection system and automatic repair system. Among them, the scheduling system is the core of the whole system, through the scheduling system to coordinate the operation of other systems. The scheduling system can deploy Standby to achieve high availability, and the executor uses cluster deployment to achieve high availability and horizontal expansion.

During each backup, the backup system will check the health status of the instance, check the running status of the backup, etc., to prevent the backup of invalid database instances. The recovery system is mainly used when data recovery, flexible expansion, etc., need to restore from backup files to running database instances, so that DBA can complete data recovery through simple operation; recovery detection automatically detects the availability of backup files under the command of the scheduling system to help DBA find unavailable backup files in time. Some backup failures can be solved by automatic retry, but some can not be solved by retry and need to be repaired accordingly, so an automatic repair system is developed to automatically repair backup failures caused by environmental problems.

The scheduling system is the core system, which is the brain of the whole backup and recovery system. At that time, we examined several implementation methods, such as Linux's crontab, Azkaban and python's open source framework Apscheduler, and finally thought that Apscheduler is more flexible and compact, the scheduling mode is also more diversified, and the maintenance cost is lower in the later stage of using Python, so the scheduling center is developed with Apscheduler.

1.5.3 system front end

It is mainly divided into four modules: backup strategy management, backup details, backup blacklist management and recovery details.

Backup policy management:

The backup policy management page contains backup status distribution, storage usage, and the current backup policy status of each cluster. If you have added a backup policy, you can modify, pause (continue) and delete operations here (time, server, backup method). If no backup policy has been added, you can add it.

Backup details:

Backup details show the total number of recent backups, the number of successes, the success rate, the running status of backup tasks on the same day, the 24-hour distribution curve of backup tasks, and backup detailed records. The detailed backup records can be queried according to the cluster name, project name and other information, so that DBA can better grasp the backup operation status.

Restore detection details:

The recovery detection page contains the number of recent daily recovery tests, the number of successful recovery tests, the success rate bar chart, the pie chart of the running status of the recovery detection task on the same day and the recent recovery detection completion rate, which will help DBA to have a clearer understanding of the recovery overview.

two。 Database reform

2.1. In the past

Before ContainerDB, JD.com 's database service was containerized. Although the database service has completely realized the basic functions such as fast delivery and automatic failover of database service through Docker container, which has improved the stability and efficiency of database service to a certain extent, the operation and maintenance mode of database service is basically the same as the traditional way. Typical problems are as follows:

2.1.1 Resource allocation granularity is too large

The resource standard of database server is fixed, the granularity is too large, and the resource standard that can be provided for database service is too few.

2.1.2 serious waste of resources

The criteria for resource allocation are determined by DBA based on experience, which is highly subjective and cannot be accurately evaluated according to the actual situation of the business. When allocating resources, DBA generally considers that there is no need to migrate or expand the service within 3 years, but more resources are allocated at one time, resulting in a serious waste of resources. Moreover, because the database resource standard is fixed and the standard is too large, the fragmentation in the host is too large, and it often occurs that a host can only create one container, while the remaining resources can not meet any resource standards, resulting in low utilization of resources on the host.

2.1.3 static resources, no scheduling

Once the database service is provided, the resources occupied will be fixed and can not be dynamically scheduled online according to the load of the database. Once the utilization rate of the hard disk of the database is too high, it needs the manual intervention of DBA to expand the capacity, which is inefficient.

2.2. Now

Based on the above problems, simple database service containerization has been unable to solve, we need to make the database service smarter, so that the database resources can be moved, to provide the function of phased delivery of resources, so ContainerDB came into being. ContainerDB's flexible load-based scheduling gives wisdom to JD.com 's database resources, makes them really flow, and has successfully served many 618s and 11.11s.

ContainerDB has a logic library for each business application. The logic library defines the range of modules for hashing module operations for the split key (Sharding Key) of all tables in the whole business (KeySpace). Multiple tables can be created in each logic library, but Sharding Key must be defined in each table. Through the Sharding Key, the data in the table is divided into a plurality of Shard, each of which corresponds to a KeyRange,KeyRange representing a range of values (Sharding Index) obtained after the hash module operation of the Sharding Key, and each Shard is supported by a set of MySQL master-slave architecture. The application only interacts with the Gate cluster, and Gate completes the automatic routing of data writing and query based on metadata information and SQL statements. The monitoring center in ContainerDB will monitor the usage of all basic services and resources in real time, and automatically carry out dynamic expansion, fault self-healing, fragment management and so on through the Hook program registered in the monitoring center, but this series of operations are completely imperceptible to the application.

2.2.1 continuous delivery of streaming resources

One of the main reasons for the waste of resources in the previous services of the database is that the initial resource allocation granularity is too large, and the resources are advanced 3 or even 5 years in advance for the business at the very beginning. However, the resources in the resource pool are limited, so it is impossible for all businesses to advance resources, resulting in some businesses without resources. ContainerDB uses a streaming approach to the continuous delivery of resources. At the beginning of each service access, only a standard 64G hard disk will be allocated. With the development of the service and the continuous increase in the amount of data, the hard disk capacity will continue to increase until it reaches the upper limit of 256g.

In this way, we have greatly lengthened the delivery cycle of database resources, so that we can first provide database services for all services before the budget of all resources for three or five years is in place, thus improving the business support capacity of the database.

2.2.2 flexible scheduling based on load

The resources used by database services are divided into two categories: instantaneous resources and incremental resources.

Instantaneous resources means that the utilization rate of meeting resources will fluctuate seriously in a short period of time, which mainly includes CPU and memory.

Incremental resources means that the utilization rate of resources will not fluctuate seriously within a short period of time, but will increase slowly, and support increasing, but will not decrease, this kind of resources mainly include hard drives. ContainerDB adopts different scheduling strategies for different resources. For instantaneous resources, ContainerDB assigns three criteria to each database:

lower limit: 2C/4G, upper limit: 4C/8G

lower limit: 4C/8G, upper limit: 8C/16G

lower limit: 8C/16G, upper limit: 16C/32G

The initial resource allocated by each container is the standard lower limit. When the database service has excessive CPU load or insufficient memory, it will try to apply for more than the lower limit of CPU or memory, but will never exceed the upper limit. After the load is restored, the requested resources will be released until the lower limit of CPU and memory is restored.

For incremental resources: at the beginning of service access, 64 GB hard disks are allocated uniformly, and vertical upgrades are carried out whenever the current disk utilization reaches 80% and does not reach the upper limit of 256g. If the current disk of the container reaches the upper limit of 256g, online Resharding is performed.

Vertical upgrade: first of all, the resource check will be carried out to see if the host has enough remaining hard disk resources for vertical upgrade. If the check is passed, a global resource lock will be imposed on the host, and the disk will be vertically expanded by adding 64G. If the check fails, provide a new container with disk capacity + 64g, CPU and memory the same as the current container on the host, and migrate the database service to the new container. The vertical upgrade is done instantly without affecting the database service.

Online Resharding: apply for two new Shard. The hard disk, CPU and memory standards of the database Container in the new Shard are exactly the same as those in the current Shard. According to the master-slave relationship of the database in the current Shard, the MySQL master-slave relationship is rebuilt for all databases in the new Shard, and then the copy and filter replication of Schema information is started. Finally, the routing rules are updated and the read and write traffic is switched to the new Shard, and the old Shard resources are offline.

Whether it is vertical upgrade or online Resharding, you need to pay attention to one problem: under the premise of ensuring that each shard's Master is in the main server room, try not to allocate all resources to one host / rack / server room. ContainerDB provides a strong affinity / anti-affinity resource allocation capability. The current affinity / anti-affinity strategies of ContainerDB are as follows:

Each KeySpace has a master room, and the resource allocation of a database instance in the same Shard (currently a shard contains one master and two slaves) should be satisfied as far as possible: Master must belong to the master room, no two instances belong to the same rack, and no three instances can be in the same IDC. This strategy can avoid power outage of a cabinet and lead to master-slave failure at the same time. You can also avoid IDC failures that make all database instances unavailable.

Because it is satisfied as far as possible, when the resources in the resource pool are unevenly distributed, it is possible that the above anti-compatibility strategy may not be satisfied when the resources are allocated. Therefore, ContainerDB has a resident background process that constantly polls all the Shard in the cluster to determine whether the instance distribution in the Shard meets the anti-affinity rule. If not, it will try to redistribute the instance. When redistributing, in order not to affect the online business, priority will be given to redistribution from the repository.

Based on the flexible scheduling capability, ContainerDB implements the following three functions:

Online capacity expansion: when the database load of a Shard reaches the threshold, the online vertical upgrade, migration or Resharding of Shard will be automatically triggered.

Online self-healing: when a MySQL instance in Shard fails, ContainerDB first determines whether the failed instance is master. If it is master, then select the largest slave of GTID as the new master, and perform replication relationship reconstruction and Slave completion. If it is not master, slave completion is performed directly.

Online access: ContainerDB allows users to start online data migration and access tasks in a completely self-service way, which migrates data from traditional MySQL databases to ContainerDB online, automatically switches domain names after data migration is completed, and completes online unaware migration of business system data sources.

ContainerDB realizes the Always Online guarantee of JD.com database service through three major functions: online service capacity expansion, online self-healing and online access.

2.2.3 more than scheduling

Flexible and streaming resource delivery and scheduling are the cornerstones of ContainerDB, but in addition to these two core functions, ContainerDB has also done a lot of work in terms of user ease of use, compatibility and data security, including:

Data protection

In the traditional directly connected database scheme, when the network is unreachable in Master, it is common to choose the new Slave to become Master, and then drift the domain name on the original Master to the new Master. However, in the case of network jitter, this scheme can easily lead to double Master and dirty writing due to the DNS cache on AppServer. As can be seen from the overall architecture diagram, ContainerDB is connected to the user through Gate. Gate is a clustered service, multiple Gate services are mapped to one domain name, Gate directly accesses each MySQL service through IP address, and Gate's identification of each MySQL role completely depends on the metadata service: Topology. When the Master of a MySQL in ContainerDB makes the network unreachable, a new Master is selected, and the routing metadata information is updated, and finally Master switching is done, which avoids writing double master and data dirty due to network jitter and DNS cache, thus strictly protecting the data.

Streaming query processing

ContainerDB provides the function of fast streaming query by implementing merging and sorting based on priority in the Gate layer. When querying a large number of data, it can instantly return part of the query result data, which greatly improves the customer experience.

Unaware data migration

ContainerDB develops an online data migration and access tool JTransfer by crossover the algorithms of partial stock data copy and incremental data addition in the Window function. Through JTransfer, the dynamic data in the traditional MySQL database can be migrated to ContainerDB. When the lag of the data in the ContainerDB and the data in the source MySQL is less than 5 seconds, the source MySQL will first stop writing, and when the lag becomes 0, the domain name of the source MySQL will be drifted to the Gate cluster. The user AppServer is not aware of the entire migration process.

Compatible with MySQL protocol

ContainerDB is fully compatible with the MySQL protocol, supports standard MySQL clients and official driver access, and supports most ANSI SQL syntax.

Routing rules are transparent

ContainerDB and users are connected through Gate cluster. Gate gets all the tables involved in the query according to the syntax tree and query execution plan formed by the query statement sent by the user, and obtains the fragment information of each table according to the metadata information in Topology. Finally, combined with the association condition in the Join in the statement and the predicate information in the Where sentence, the query or write is routed to the correct shard. The whole process is done automatically by Gate and is completely transparent to users.

Self-service

ContainerDB abstracts the functions of database service instantiation, DDL/DML execution, sharding upgrade and capacity expansion into independent interfaces, and provides a fully self-service user access service with the help of the process engine. After the user successfully applies for the database service, ContainerDB will automatically push the database access password to the user's mailbox.

3. Prospect

The past is gone and the future is here.

Later, we will think more about the value that the database can generate from the user's point of view. We believe that JD.com 's future database service will:

More Smart: we will conduct deep learning and cluster analysis based on the monitoring data of different resources such as CPU/ memory / hard disk in each database instance, analyze the preferred resources of different database instances, and intelligently increase the resource limit of each database instance and lower the limit of non-inclined resources.

More Quick: we will analyze the corresponding relationship between the host and the container, the limit parameters of each container and the historical resource growth rate of each container in real time, and sort out the fragments of the host where the container is located in advance, so as to ensure that each container can be expanded vertically, thus greatly speeding up the expansion.

More Cheap: we will provide a completely self-developed storage engine. We plan to integrate the query engine with the storage engine, and provide a multi-model database engine to unify multiple data models and greatly save the resources required for database services and the cost of research and development.

More Friendly: whether it is ContainerDB or our self-developed multi-model database, we will be fully compatible with MySQL protocol and syntax, thus making the migration cost of existing applications close to zero.

More Open:ContainerDB will embrace open source after being honed by various scenarios within JD.com, and hope to work with colleagues in the industry to continuously improve ContainerDB. At the same time, our follow-up multi-model database will eventually contribute to the open source community and look forward to serving the industry.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.