JD.com MySQL database master-slave switching automation 04/21 Update SLTechnology News&Howtos

JD.com MySQL database master-slave switching automation

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

1. Background

With the rapid growth of JD.com 's business, the importance of data to JD.com is self-evident. In the information age, data has more power than people, and the value of the database is evident. The existence of the database provides people with faster queries, so in order to better achieve the high availability of the database, ensure continuous service, simplify DBA operations, and save time for database failover. Therefore, the master-slave switching automation system of this database is developed.

two。 Realization principle

This system is based on MHA to do database switching, combined with the characteristics of JD.com database switching, customize their own switching system. MHA (Master High Availability) is currently a relatively mature solution for MySQL high availability. It was developed by Yoshinori Matsunobu, a Japanese company of DeNA, and is a set of excellent high availability software for failover and master-slave upgrade in MySQL high availability environment. In the process of MySQL failover, MHA can automatically complete the database failover operation within 30 seconds, and in the process of failover, MHA can ensure the consistency of data to the maximum extent, while maximizing the recovery of data after the failure, combined with zabbix monitoring and alarm, in order to achieve high availability in the real sense. Triple detection to ensure that the switch is correct: zabbix detection, task creation detection, MHA detection.

3. Realize the function

This system realizes dead cutting (slave database failover and failback, master database failover), live cutting (master database live cutting and master database failback), automatic, self-service and visual switching.

4. The concrete realization is 4.1. Dead cut (failover)

When the Zabbix automatic monitoring system detects a database fault, it automatically adjusts the failover program, and then determines whether it is a master database failure or a slave database failure, and deals with it on a case-by-case basis. All fault information can be viewed on the DBS system.

4.1.1 main library failure:

First, create a handover task on the DBS system. In addition, DBA can also add the fault master library IP in batch on the failover page to create a switchover task. Then the corresponding DBA executes the switch button, and various situations will be judged.

4.1.1.1 important steps and principles for switching

The mechanism of probing and probing is changed from select to insert, which can include instance tamping and hard disk read-only. If there is no surviving slave library, abandon the operation and notify DBA to handle it manually by email and SMS.

L Select the new main database, first local (physical machine first, then DOCKER, first few connections, then low QPS load), and then remote (physical machine before DOCKER, small number of connections, then low QPS load) principle to select the target instance

L call the MHA interface to change the information of the failover fault system

A.MHA will first use the slave library selected in the previous step as the new master library, otherwise it will use the latest data to promote the slave library to the new master library, and then redirect all other slave libraries to the new master library. After that, the domain name switching API is called to direct all the domain names under the original failed master database to the new master library IP. If the MHA switch fails or MHA has an alarm message, or if a domain name is not successfully switched, email and SMS will be used to notify DBA for manual processing.

B. When the MHA failover ends, the system removes the read_only=1 from the mysql.cnf configuration file of the new primary library and executes the reset salve all or stop slave directive on the new primary library.

C. Call the zabbix host rename API to modify the name of the fault master database and the new master database in the zabbix monitoring system.

d. Since the domain name does not take effect in real time after the domain name switch, there is a delay, so the system will check the validity of the domain name. If it does not take effect within 2 minutes, it will be prompted, and manual confirmation by DBA is required.

e. Finally, update the cluster information in the asset database, modify the master-slave relationship and change the database status, and update the fault information table. At the same time, send emails and text messages to notify DBA that the failover is complete.

f. Live cutting can support multiple clusters switching at the same time.

4.1.1.2 example

For example, if there is a cluster with one master and four slaves, the master database 10.66.66.66 3366 fails and needs to be switched, as shown below:

1.Zabbix automatically creates tasks, and then DBA performs the switch

two。 Select the target instance

If the four slaves in the example are all alive, then according to the local origin, we will select 10.66.66.68, 336, and 10.66.69, and then check the number of connections, then check the QPS.

Then compare QPS, and select 10.66.66.69 3366 with low QPS load as the target instance.

3. Result of completion of switching

4. Details of the switch

4.1.2 Slave failure (system completion automatically): 4.1.2.1 switching principle

Determine whether the downtime instance does not have a domain name, the downtime instance is set to switch manually, and there are no other normal running instances in the cluster where the downtime instance resides. In these cases, the corresponding DBA will be sent an email and SMS alarm, which needs to be handled manually by DBA.

In other cases, the failure system will automatically handle the failure. According to the principle of local first (few connections, low QPS load) and then remote location (few connections, low QPS load), the target instance is selected to switch the domain name. If the switch succeeds or fails, the corresponding DBA will be notified by email and text message.

If the slave library is successfully switched, the corresponding DBA can fail back the instance.

4.1.2.2 example

For example, there is a cluster with one master and four slaves, and the slave database 10.88.88.89Japon3366 fails and needs to be switched, as shown below:

Zabbix will automatically create tasks, then check the number of connections first, and then check the number of connections. According to the QPS principle, the target instance is 10.88.88.88 3366, and then automatically switch. DBA will view the switching results in the switching task list, and the mouse-over execution status will display the specific information of the switch.

If the task is switched successfully, the failback button will be displayed, and the failback can be performed.

DBA performs failback, and the system creates a failback task, and you can view the details of the failback.

4.2 Live cutting (general OPS shutdown switch) 4.2.1 batch creation tasks:

Enter any IP in the project to find out all the available clusters under the project, then check the clusters you want to switch and submit batch creation tasks.

When creating a task, you can choose whether the target instance is local or remote. Then, we first probe into the target instance, and then recommend the instance according to the principle that the physical machine is followed by DOCKER, the number of connections is small first, and then the QPS load is low. If there are any anomalies, it will be prompted.

In addition, you can choose whether the new master library is read only after switching.

4.2.2 Task switching

Click "switch" to switch this task in batches, and you can go to the sub-task to view each step of the switch and each step performed by MHA. The switch is completed, and you will wait 2 minutes to verify whether the domain name is actually switched.

There will be a comparison of the architecture before and after the switch.

You can kill all the application links of the old main library.

4.2.3 examples

There are two clusters under a Mysql_test project, as follows

Cluster 1

Cluster 2

1. Batch create task

The selection principle is based on the principle of local before remote location, physical machine before Docker, number of connections before QPS.

10.66.66.66 3366 choose the target main library: 10.88.88.8989R 3366

10.66.55.55 purl 3366 choose the target main library: 10.88.99.91 purl 3366

two。 Batch handover

Switch the details of subtasks. You can view the switching results and execution steps of each subtask, and the architecture before and after.

5. Summary

No matter whether the system is dead-cut or live-cut, it has been serviced and interface-oriented, and it only takes up to 2 steps (create tasks and perform switching) to complete the switch, or it can be fully automated (the consent of the business side is required, because some business databases need to confirm the switch after failure), and the live cutting can also be flowed to the business side for self-help switching. At present, the system has run well, greatly saving DBA time, better achieving the high availability of the database, ensuring continuous service, simplifying DBA operation, saving database failover time, and protecting JD.com 's database.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.