How does JD.com Zhaopin Cloud MySQL Database ensure the Reliability of data 07/12 Update SLTechnology News&Howtos

How does JD.com Zhaopin Cloud MySQL Database ensure the Reliability of data

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

JD.com Zhaopin Cloud MySQL database how to ensure the reliability of data, many novices are not very clear about this. In order to help you solve this problem, the following editor will explain it in detail. People with this need can come and learn. I hope you can get something.

As the most popular relational database at present, MySQL plays the most important role in the system of various industries. With the gradual deepening of people's recognition of the value of data, the reliability of data is one of the most frequently asked questions. How does MySQL ensure the reliability of data? What optimizations and new features has JD.com Zhaopin Cloud RDS-MySQL done to ensure the reliability and consistency of user data? This article will reveal the secrets for you one by one.

MySQL's Innodb storage engine supports ACID (atomic Atomicity, consistency Consistency, isolation Isolation, persistence Durability) features, and it is precisely because of ensuring consistency and persistence that data is reliable. In order to ensure the reliability of the database and maximize the performance, many relational databases use the method of pre-writing log (Write-Ahead Logging), and MySQL is no exception. It first writes the data changes to the log, and then immediately returns to the client to update successfully, and then asynchronously updates the real data to the data file on disk. If the intermediate system fails, as long as the log will not be lost in the data, which ensures the reliability of the data.

The logs written by MySQL are binlog and redo log files. Let's describe the writing process of the next two kinds of logs.

During the execution of a transaction, MySQL records all changes to binlog cache and writes them to the binlog file together when the transaction is commit.

Binlog cache is controlled by the parameter binlog_cache_size, which defaults to 32KB. If the transaction is large and the change exceeds binlog cache, it will be written to disk. You can check the number of times binlog cache is written to disk by using the command show global status like 'Binlog_cache_disk_use';. If the number is too large, it is recommended to increase the binlog_cache_size parameter value.

Each thread allocates a binlog cache, but all share a binlog file. The flow chart is as follows:

There are two steps in the log file written to the system, write and fsync. Wirte is written to the operating system cache, fsync is persisted to disk files, this operation takes up the system's IOPS, and the timing of their operation is controlled by the parameter sync_binlog.

Sync_binlog=0, when the transaction is committed, only write operation is done, and the fsync operation is controlled by the operating system. This is the most dangerous, once the operating system goes down, all changes in binlog cache will be lost.

Sync_binlog=1, both write and fsync operations are performed when a transaction is committed. Security is the highest, but performance loss is also the greatest.

Sync_binlog=N, when the transaction is committed, the write operation will be done, and the fsync operation will be done when N transactions are accumulated. Once the operating system goes down, some of the changes in binlog cache will be lost.

During the execution of a transaction, it is also written to the memory redo log buffer first, and then to the disk file. Where redo log buffer is common to all threads. Like binlog writing to a file, writing redo log also has two operations, write and fsync, which are actually controlled by the parameter innodb_flush_log_at_trx_commit.

Nnodb_flush_log_at_trx_commit=0, when the transaction commits, only the changes are written to redo log buffer, and the background Master thread write and fsync to the disk file per second.

Innodb_flush_log_at_trx_commit=1, write and fsync operations are performed when the transaction is committed. This is the most secure configuration.

Innodb_flush_log_at_trx_commit=2, when the transaction commits, only the write operation is performed, that is, it is only written to the operating system cache, and the background Master thread fsync to the disk file every second.

The relationship between this parameter and data reliability is shown in the following table:

Innodb_flush_log_at_trx_commit

Database process exception

Operating system exception

Lose up to 1 second of data

one

Do not lose data

two

Do not lose data

Lose up to 1 second of data

The parameters sync_binlog=1 and innodb_flush_log_at_trx_commit=1 are what DBA often calls the "double one" configuration, and they are also the most secure and reliable configuration for online environment data.

Compare the differences between binlog and redo log:

Binlog

Redo log

Recorder

MySQL server

Innodb engine

Record time

When transacting commit

Triggered by multiple conditions, recorded at any time

Record content

Logical log

Row format or statement format

Physical log

Changes in data pages, idempotent

How do binlog and redo log work together to play a role in data reliability? we have to mention two-phase commit. It can guarantee the data consistency of binlog and redo log. The following figure shows the recording process of the two logs when the transaction is committed:

If a system exception occurs in this process, data consistency can be guaranteed in each state.

Status

Deal with

If the innodb is already commit, then the binlog must have an event for the transaction.

Transactions are consistent and do not need to be processed.

If innodb already has prepare,binlog, there is already an event that records the transaction, but innodb does not commit.

Roll forward, innodb needs to continue to commit these transactions.

If the innodb has not recorded the event in preprare,binlog, the event has not been copied from the library.

Roll back.

If the innodb does not complete the prepare,binlog, the event should not be recorded.

Roll back.

The innodb_suport_xa parameter, which controls whether a two-segment commit is turned on. It is on by default, and if closed, transactions are written to binlog in a different order. If downtime recovery and xtarbackup recovery, there is a risk of data inconsistency. This parameter is deprecated after MySQL5.7.10 and must be turned on.

With the development of MySQL, the cluster has been developing and evolving from active and standby asynchronous replication, semi-synchronous replication and group replication. But their core foundation is binlog, so it can be said that the data replication of MySQL depends on it, and the data consistency between clusters is related to binlog. There are two main points that require special attention.

1. Format of binlog. Statement, row and mixed. The statement format records the SQL statements directly in the binlog file, because the master and slave libraries are two separate services that run in completely different environments, so there is a risk of inconsistency, such as executing delete from t limit 100s. Therefore, the row format is recommended for online environments.

two。 Data delay. When there is a delay in the slave database, it will cause cluster data inconsistency. There are many reasons for the delay from the library. Here are some reasons for the delay that often occur online:

A) Major business. Binlog records the file only when the transaction commit, and then the slave library can read the data changes, so when there is a large transaction, the slave library commits execution after the master library commits.

B) large concurrency. Both versions 5.6 and 5.7 support parallel replication, but the degree of parallelism is limited. When the concurrency of the master library is high, there will be a delay in the slave library.

C) Table structure. The master database table does not have a primary key, and the binlog is in row format. When the master database holds a large number of rows of updated SQL, the slave database will perform multiple full table scans, resulting in delay.

D) wait for the lock. The slave library generally undertakes the backup function, and using xtrabackup for backup will perform FLUSH NO_WRITE_TO_BINLOG TABLES and FLUSH TABLES WITH READ LOCK operations. In special cases, these two operations will block the replicated SQL thread and cause delay.

JD.com Zhaopin Cloud RDS-MySQL cluster uses master-slave replication architecture. In order to ensure the reliability and security of users' stored data, we have done a series of optimization and improvement work on key processes. Take user data security as its responsibility and user experience as the center.

1. Physical environment

Hardware, using high-performance NVME hard disk, the latest model of physical machine configuration.

Network, the network delay across AZ machines is within 1.2ms, and 10 Gigabit network cards are configured.

two。 Software environment

On the data side, with reference to JD.com 's experience in optimizing business systems with high concurrency and high reliability, JD.com Zhaopin Cloud has made some optimizations on RDS operating system configuration and MySQL parameter configuration to ensure the reliability of database cluster data.

In the control plane, for the delay of the cluster, there are several groups of delay monitoring and alarm; for different reasons of delay, different optimization logic will be triggered to automatically reduce the delay.

When there is a problem with the physical machine or data migration, the highly available operation of the MySQL cluster will be involved. Because of the replication characteristics of the MySQL cluster, data loss may occur. JD.com Zhaopin Cloud RDS-MySQL should give priority to ensuring the consistency of user data when switching. When judging that the cluster data is completely reliable, do the switching operation to ensure that users' data is not lost or written.

The complexity of MySQL high availability handover process is not in the handover process, but in the judgment of trigger switching conditions. The judgment process of RDS-MySQL automatic high availability handover is described below.

If the Sentinel service checks the database and operating system status and finds that the instance service is abnormal, it triggers the database service check and voting mechanism of multiple groups of Sentinel services, confirming that the service is really unavailable and then carries out the switching process.

The master database reports GTID information in real time. If automatic high availability occurs, that is, when the master database service is unavailable, it will first compare the Retrieved_Gtid_ set value of the slave database to ensure that all the binlog contents of the master database have been pulled from the IO thread of the slave database.

Then compare the Retrieved_Gtid_Set and Executed_Gtid_Set range values of the slave library to ensure that all the binlog extracted from the library is completed.

After the highly available process switching is completed, the cluster data is checked for consistency and the process of establishing a new slave database is triggered.

Database backup is the most important barrier to data security. In extreme cases, the data of all nodes in the cluster are not available, so we need to rely on backup to ensure the reliability and security of the data. We have optimized the backup and recovery process of RDS-MySQL to ensure that the recovery time of the user system during disaster recovery is as short as possible and the recovery data is as up-to-date as possible.

Daily full backup, real-time binlog backup

All backups are uploaded to object storage, saved by multiple backups, and stored in multiple areas

Verify the validity of backup data regularly

High availability, expansion, deletion and other important processes force data backup of the database.

Support soft delete function, single database table recovery function.

JD.com Zhaopin Cloud RDS-MySQL users have seen many cases related to data reliability during their use. Here are some typical cases to share:

problem

The user deleted part of the data of the online system due to human misoperation.

find

The user submits a ticket and wants to quickly restore the data of the deleted table to the specified time point.

Solve

The console provides the function of fast recovery by point in time with a single database and a single table. The technical service personnel feed back directly to the user the documentation for the use of this function. The user completes the recovery operation of the deleted data through self-help operation.

Meaning

RDS-MySQL uses backup and recovery functions to the extreme, and the two types of backup methods correspond to a variety of recovery processes, making it convenient for users to quickly and safely achieve database recovery requirements.

The RDS-MySQL recovery process supports:

1. Created based on a point in time

two。 Local recovery based on single database and single table at point in time

3. Create and locally overwrite restore based on backup

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.