What is the solution of highly available master-slave replication delay 07/12 Update SLTechnology News&Howtos

What is the solution of highly available master-slave replication delay

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What this article shares with you is about the solution of highly available master-slave replication delay. The editor thinks it is very practical, so I share it with you to learn. I hope you can learn something after reading this article. Let's take a look at it with the editor.

The delay of MySQL master-slave replication has been a perplexing problem in the industry for a long time. The emergence of delay will reduce the value of master-slave read-write separation, which is not conducive to the use of MySQL for services with high real-time data.

UDB is a cloud database service launched by UCloud. It has been online for six years and has operated tens of thousands of UDB MySQL instances. In addition to providing highly available, high-performance, easy-to-use product features, the team also helps users solve the delay of 2-3 master-slave replication of MySQL instances per day. From a large number of practices, we have summarized the various causes and solutions of master-slave replication delay, which are shared here.

The importance of the delay problem

The master-slave replication mechanism is widely used in the internal implementation of UDB: the slave database and master database created by UDB adopt the data replication of "master-slave replication"; in addition, the main product of UDB, "UDB MySQL high availability instance", also uses the "double master mode" of two databases to replicate data, and the core of the double master mode is the master-slave replication mechanism.

If there is a delay between master-slave replication, it will affect the consistency of master-slave data.

In the highly available replication scenario, we consider in the UDB high availability disaster recovery design that if there is inconsistency between master and slave data, the switching of high availability disaster recovery is not allowed by default. Because in the case of inconsistency between master and backup data, disaster recovery switching occurs at this time, and data is written in the new main database, then from a business point of view, it will have unexpected and serious consequences.

Replication delay will not only lead to adverse consequences in the high availability of UDB, but may also have an impact on business if replication delay is generated from the library in the scenario of read-only slave library, such as read-write inconsistency-new / modified data cannot be found, and so on.

Thus it can be seen that the delay of master-slave replication needs special attention in database operation. In general, DBA executes' SHOW SLAVE STATUS', on the library and observes

'The value of Seconds_Behind_Master', you can understand the data replication latency between the current database and its main database. This value is so important that in the UDB monitoring interface, we extract this value separately and design the "synchronization delay from the library" monitoring item, so that the operation and maintenance staff can observe it directly on the console.

Analysis and solution of delay problem in production Environment

We summarize the most common master-slave replication delay cases into several categories, and the following is the phenomenon description, cause analysis and solution summary of the related cases.

◆ case 1: frequent DML requests in the main database

During the peak period of business, especially when there are a large number of write request operations in the database master database, that is, a large number of concurrent operations such as insert, delete, update and so on, master-slave replication delay will occur.

Phenomenon description

By observing the QPS value of the write operation of the master database, we can see that the QPS value of the write operation of the master library suddenly increases. With the increase of the master-slave replication delay, we can judge that it is due to the frequent DML requests of the master library.

As shown in the above picture, you can see that the QPS increases suddenly around 17:58. Check the write-related QPS on the console, and there is also a corresponding improvement. While the time of the sudden increase in QPS, the corresponding delay is also gradually increasing, as shown in the following figure.

Cause analysis

After analysis, we believe that this is due to a large number of write request operations in the main library, resulting in a large number of binlog in a short time. These operations need to be synchronized to the slave library and executed, resulting in master-slave data replication latency.

In-depth analysis of the reason is that during the business peak, the data written by the master database is written concurrently, while the slave library SQL Thread plays back the binlog log in a single thread, which can easily cause relaylog accumulation and delay.

Solution idea

If it is a version below MySQL 5.7, you can do sharding to break up write requests through horizontal expansion (scale out) and improve the parallelism of writing requests to binlog.

In the case of MySQL 5.7 or later, parallel replication based on logical clock (Group Commit) is used in MySQL 5.7. In MySQL 8.0, parallel replication based on Write Set is used. Both of these schemes can improve the performance of playback binlog and reduce the delay.

◆ case 2: the main library executes big transactions

A major event refers to the execution of a transaction, which takes a very long time. Common statements that generate large transactions are:

A large number of slow import data statements are used, such as INSERT INTO $tb, SELECT * FROM $tb, LOAD DATA INFILE, etc.

Use UPDATE, DELETE statements, full table UPDATE and DELETE for a large table, and so on.

When this transaction is performing a playback operation from the slave library, a master-slave replication delay may occur.

Phenomenon description

When we analyze the results of SHOW SLAVE STATUS, we will find that the Exec_Master_Log_Pos field has not changed, and the second_behinds_master continues to increase, while the value of the Slave_SQL_Running_State field is "Reading event from the relay log". At the same time, by analyzing the master database binlog and looking at the transactions currently executed by the master database, we will find that there are some large transactions, which can basically be determined to be the master-slave replication delay caused by the execution of large transactions.

Cause analysis

When a large transaction is recorded in binlog and synchronized to the slave library, it takes a long time to execute the transaction from the slave library, which results in a master-slave replication delay.

For example, if the master database takes 200s to update a large table, if the configuration of the master and slave libraries is similar, the slave database needs to update the large table in almost the same time, and the subsequent events cannot be updated since the database delay begins to accumulate.

Solution idea

For the master-slave replication delay caused by this situation, our improved method is to split the major event statements into several small transactions, so that we can commit in time and reduce the master-slave replication delay.

◆ case 3: the main library executes DDL statements on large tables

DDL, whose full name is Data Definition Language, refers to statements that modify the structure of a table, such as adding a field or an index to a table, and so on. Master-slave replication delays may occur when DDL executes DDL statements on a large table of the master database.

Phenomenon description

From the phenomenon, if you execute the output of SHOW SLAVE STATUS from the library and check that the Exec_Master_Log_Pos has not been moved, then it is possible that the DDL of the large table is being executed if the main library is excluded from performing large transactions. This is combined with the analysis of the main library binlog, which can be confirmed by looking at the transactions currently executed by the main library.

The execution of DDL statements can be further subdivided to better judge:

1.DDL does not start and is blocked. The result of SHOW SLAVE STATUS can check that Slave_SQL_Running_State is waiting for table metadata lock and Exec_Master_Log_Pos remains unchanged.

2.DDL is being executed, and SQL Thread single-threaded applications result in increased latency. In this case, by observing the result of SHOW SLAVE STATU, we can find that Slave_SQL_Running_State is altering table, but Exec_Master_Log_Pos is unchanged.

If this is the case, it is likely that the master library executes the DDL statement on the large table, synchronizes to the slave library and plays back the slave library, resulting in a master-slave replication delay.

Cause analysis

The cause of the master-slave replication delay caused by DDL is similar to that of large transactions, and the master-slave replication delay is also caused by the slow execution of binlog of DDL from the slave library.

Solution idea

In this case, we mainly find the blocking DDL statement through SHOW PROCESSLIST or query against information_schema.innodb_trx, and KILL the related query, so that DDL can execute normally in the slave library.

The delay caused by DDL itself is unavoidable, so it is recommended to consider:

Avoid business peak and arrange for implementation during the business trough as far as possible

After set sql_log_bin=0, manually execute DDL on the master / slave database respectively (this operation will cause data inconsistency in some DDL operations, so be sure to test it strictly). If you use cloud database UDB, you can contact the UCloud UDB operation and maintenance team for assistance.

◆ case 4: the configuration of master library and slave library is inconsistent.

If the master library and slave library use different computing and storage resources, or use different kernel tuning parameters, it may cause master inconsistency.

Phenomenon description

We will compare the performance monitoring data of the master database and the slave database in detail. If we find that there is a great difference in the monitoring data, we can make a clear judgment by viewing the configuration of the master and slave.

Cause analysis

Differences in the configuration of various hardware or resources may lead to differences in master-slave performance, resulting in master-slave replication latency:

Hardware: for example, if the master library instance server uses SSD disks, while the slave library instance servers use ordinary SAS disks, then the writes generated by the master library cannot be digested immediately on the slave library, resulting in master-slave replication delay

Configuration: for example, inconsistent raid card write policy, inconsistent OS kernel parameter settings, inconsistent MySQL disk setting policies, and so on, are all possible reasons.

Solution idea

Consider unifying the configuration of the DB machine (including hardware and option parameters) as much as possible. Even for some OLAP services, the hardware configuration of the slave instance is slightly higher than that of the master database.

◆ case 5: table lacks primary key or appropriate index

If the table of the database lacks a primary key or an appropriate index, a master-slave replication delay may occur when the binlog_format of master-slave replication is set to 'row'.

Phenomenon description

When we check the database, we will find that:

Observe the output of SHOW SLAVE STATUS and find that Slave_SQL_Running_State is Reading event from the relay log

SHOW OPEN TABLES WHERE in_use=1 's table has always existed.

Observe that the Exec_Master_Log_Pos field of SHOW SLAVE STATUS remains unchanged

The CPU of the mysqld process is close to 100% (when there is no read business), there is little pressure on the IO.

In the case of these phenomena, it can be considered that there are tables that lack primary keys or unique indexes.

Cause analysis

When the binlog_format of master-slave replication is set to 'row', for example, there is a scenario where the master database updates 200000 rows of data in a 5 million table. Binlog records 200000 update operations to binlog in row format, that is, one record is updated for each operation. If this statement happens to have a bad execution plan, such as a full table scan, then each update statement requires a full table scan. At this point, SQL Thread playback will be very slow, causing serious master-slave replication latency.

Solution idea

In this case, we will check the table structure to ensure that each table has an explicit self-increasing primary key and assist the user in building the appropriate index.

◆ case 6: too much pressure from the library itself

Sometimes, when the performance of the slave library is under great pressure, it can not keep up with the update speed of the master library, resulting in a master-slave replication delay.

Phenomenon description

When observing the database instance, we will find that the CPU load is too high and the IO utilization is too high, which leads to the slow application of SQL Thread. In this way, it can be judged that the master-slave replication delay is caused by the excessive pressure on the slave library itself.

Cause analysis

Some UCloud users use read-write separation mode for the master and slave of the database, and most of the read requests are executed on the slave database. In the scenario where there are a large number of read requests in the business, the performance pressure of the slave database is much greater than that of the master database. Some users even run the OLAP business which consumes computing resources in the slave library, which also poses a higher performance challenge to the slave library, which will cause the delay of master-slave replication.

Solution idea

In this case, we suggest that users build more slave libraries, break up read requests, and reduce the pressure on existing slave instances. For OLAP business, we can set up a slave library to do OLAP business, and allow appropriate master-slave replication delay for this slave library.

When using MySQL's master-slave replication mode for data replication, master-slave replication delay is a key factor to be considered. It will affect the consistency of the data, and then affect the highly available disaster recovery switching of the database.

In the case of master-slave replication delays between databases, based on past experience, our team has summed up the following methods and processes to assist in troubleshooting:

Check the current slave library through SHOW SLAVE STATUS and SHOW PROCESSLIST. (by the way, you can also rule out similar reasons when backing up from the library)

If the Exec_Master_Log_Pos remains unchanged, consider large transactions, DDL, and no primary key, and check the binlog and position corresponding to the main library.

If the Exec_Master_Log_Pos changes, the delay increases gradually, consider the slave machine load, such as IO, CPU, etc., and consider whether the write operation of the master library and the pressure of the slave library itself is too large.

The above is what the highly available master-slave replication delay solution is, and the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.