Analysis of the causes of master-slave replication delay! 07/12 Update SLTechnology News&Howtos

Analysis of the causes of master-slave replication delay!

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Write at the front:

Before in the maintenance of online master-slave replication architecture, encountered some master-slave delay problems, the author, is also relatively easy to learn, ! Therefore, after consulting a lot of materials, and then going to the dross, I sorted out this article according to my own understanding and consulting materials. I would like to declare here that the content of this article is the author's own understanding, does not represent authority, and is only for your colleagues to make a reference. I also make a study record. In line with the principle of seeking truth from facts, for many materials on the Internet, the author only believes 5 points, suspects 5 points, even if the official information, some bug is not, only their own practice, can believe, can have the right to speak, other, everything is empty!

MySQL master-slave replication process:

1) the master library Binlog Dump thread actively sends the latest binlog to the slave library when the binlog changes.

2) after receiving the binlog from the master database passively from the library Ipool O thread, record it to the relay log of the slave library, and wait when there is no data coming in. At the same time, the relay log thread replays SQL.

3) when the data from the master database is not received from the database for a long time, and the waiting time exceeds the time defined by slave_net_timeout (default is 3600 seconds), the status of Slave_IO_State will be set to No. After that, every time defined by MASTER_CONNECT_RETRY [Connect_Retry: 30] (the default is 60 seconds), a reconnection will be attempted until the connection is successful or the number of retries is exceeded MASTER_RETRY_COUNT [Master_Retry_Count: 6666] (the default is 86400).

Slave_net_timeout can be modified in the configuration file or set variable online settings

MASTER_CONNECT_RETRY and MASTER_RETRY_COUNT need to be specified in advance when CHANGE MASTER TO establishes a replication relationship.

Change slave_net_timeout online:

SHOW VARIABLES LIKE 'slave_net_timeout'

Variable_name `Value`

Slave_net_timeout 3600

SET GLOBAL slave_net_timeout=1800

Modify the MASTER_CONNECT_RETRY=30,MASTER_RETRY_ count value:

Mysql > change master to MASTER_CONNECT_RETRY=30,MASTER_RETRY_COUNT=6666

ERROR 1198 (HY000): This operation cannot be performed with a running slave; run STOP SLAVE first

Mysql > stop slave

Query OK, 0 rows affected (0.01 sec)

Mysql > change master to MASTER_CONNECT_RETRY=30,MASTER_RETRY_COUNT=6666

Query OK, 0 rows affected (0.01 sec)

Mysql > start slave

Query OK, 0 rows affected (0.02 sec)

How is the MySQL master-slave replication delay formed:

1. The worker thread of the master library works concurrently when writing binlog (parallel writing), while the dump thread of the master database and the IO thread of the slave library both push and pull the binlog with a single thread, especially the SQL thread plays back one by one with the event in the relay log. (version 5.6 enables slave_parallel_workers to support parallel replication in specific cases, and after version 5.7, the delay problem at the replication level has been greatly improved). Therefore, even without considering network latency, consumption in mainstream MySQL versions may not catch up with production in the case of high concurrency, and slave libraries with asynchronous replication may not be able to keep up with the progress of the master library.

2. During the replication period, whether the load of the master or slave database is high (especially the pressure of removing the disk from the database, which is related to the setting of sync_binlog and innodb_flush_log_at_trx_commit) or slow network transmission (especially the synchronization across data centers), master-slave delay will occur, and it is inevitable. If you want to achieve strong consistency, you can use Semi-sync, but the adoption of this plugin does not guarantee continuous strong consistency (rpl_semi_sync_master_timeout can cause degradation of replication mode)

Depending on the cause of the MySQL master-slave replication delay, the following conditions may cause MySQL master-slave replication latency:

1) MASTER has high concurrency, resulting in a large number of transactions

2) the network condition is poor

3) the hardware configuration of slave library is not as good as that of master library.

4) it is originally asynchronous replication.

About the explanation of some parameters at the time of closing:

Sync_binlog:

MySQL official documentation reference: https://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html

Explanation:

Controls how often the MySQL server synchronizes the binary log to disk

Controls how often MySQL synchronizes binary logs (binary log) to disk

Sync_binlog=0:

Disables synchronization of the binary log to disk by the MySQL server. Instead, the MySQL server relies on the operating system to flush the binary log to disk from time to time as it does for any other file. This setting provides the best performance, but in the event of a power failure or operating system crash, it is possible that the server has committed transactions that have not been synchronized to the binary log.

Sync_binlog=1:

Enables synchronization of the binary log to disk before transactions are committed. This is the safest setting but can have a negative impact on performance due to the increased number of disk writes. In the event of a power failure or operating system crash, transactions that are missing from the binary log are only ina prepared state. This permits the automatic recovery routine to roll back the transactions, which guarantees that no transaction is lost from the binary log.

Sync_binlog=N:

Where N is a value other than 0 or 1: The binary log is synchronized to disk after N binary log commit groups have been collected. In the event of a power failure or operating system crash, it is possible that the server has committed transactions that have not been flushed to the binary log. This setting can have a negative impact on performance due to the increased number of disk writes. A higher value improves performance, but with an increased risk of data loss.

Innodb_flush_log_at_trx_commit:

MySQL official documentation reference: https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit

Other reference documents:

Https://blog.csdn.net/codepen/article/details/52160715

Https://www.cnblogs.com/mayipapa/p/4685313.html

Explanation:

It is unique to the InnoDB engine, the refresh mode of ib_logfile.

MySQL log write order:

Log buffer = > MySQL (write) = > log file = > OS Refresh (flush) = > disk

Innodb_flush_log_at_trx_commit value explanation: (referenced from some blogs)

0, delay writing:

Log buffer = > every second = > log file = > OS real-time flush = > disk

1. Write in real time and swipe in real time:

Log buffer = > Real-time = > log file = > OS Real-time flush = > disk

In this way, the database has a very high demand for IO. If the underlying hardware provides poor IOPS, then the concurrency of the MySQL database will soon be unable to improve due to hardware IO problems.

2. Write in real time, delay brushing:

Log buffer = > Real time = > log file = > OS every 1 second = > disk

If only the MySQL database is down, because there is no problem with the file system, then the corresponding transaction data is not lost. Only if the host operating system where the database is located is damaged or suddenly powered off, the transaction data of the database may lose transaction data such as 1 second. This benefit reduces the probability of transaction data loss, and the IO requirements for the underlying hardware are not so high (log buffer is written to the file system, generally only from the memory cache of the file system transferred from log buffer's memory, and there is no pressure on the underlying IO).

Chinese explanation of the official document: (the author's own understanding)

When innodb_flush_log_at_trx_commit is set to 0, the log buffer (log buffer) is written to the log file (log file) once a second, and the log file is flush disk (disk refresh). In this mode, the write to disk is not actively triggered when the transaction commits.

When innodb_flush_log_at_trx_commit is set to 1, when each transaction commits, the log buffer (log buffer) is written to the log file (log file), and the log file is flush disk (flush disk) [simultaneous]

When innodb_flush_log_at_trx_commit is set to 2, the log buffer (log buffer) is written to the log file (log file) when each transaction is committed, but no disk refresh (flush disk) is performed on the log file. In this mode, MySQL performs a flush (flush to disk) operation every second.

However, when the innodb_flush_log_at_trx_ value is 2, the disk refresh (flush disk) of the log file (log file) occurs once a second.

Because of scheduling problems, a refresh per second is not 100% guaranteed. You can achieve better performance by setting the innodb_flush_log_at_trx_commit value not to 1, but if you set this value to 0, the MySQL crash will lose the transaction 1 second before the crash (the best performance in this mode, but not × × whole); if you set this value to 2, the transaction in the last second will be lost only when the operating system crashes or is powered off (it has better performance in this mode and is safer than 0 mode). If you set this value to 0, the mode has the lowest performance, but is the safest. In the case of a MySQL service crash or the operating system crash, binlog loses only one statement or transaction.

Note that many operating systems and some disk hardware can spoof flushing to disk operations (flush disk). Although the refresh is not carried out, the MySQL is told that the refresh has taken place. Even if this value is set to 1, the persistence of the transaction is not guaranteed, and in the worst case, a power outage can even destroy the database. In the SCSI disk controller, or in the disk itself, using a disk cache with a backup battery speeds up file refresh and makes the operation more secure. You can try using the hdparm command to disable disk write caching in the hardware cache, or use some other commands specific to the hardware provider

Finally, the official recommendation of MySQL:

A higher value improves performance, but with an increased risk of data loss

For the greatest possible durability and consistency in a replication setup that uses InnoDB with transactions, use these settings:

Sync_binlog=1

Innodb_flush_log_at_trx_commit=1

How to calculate the MySQL master-slave delay:

The first method of calculation: position (simple and rude, you can only see if there is a delay)

Mysql > show slave status\ G

Read_Master_Log_Pos: 4

Exec_Master_Log_Pos: 0

The second method of calculation: Seconds_Behind_Master

Refer to the MySQL official documentation:

Https://dev.mysql.com/doc/refman/5.6/en/replication-administration-status.html

Https://dev.mysql.com/doc/refman/5.6/en/show-slave-status.html

Refer to other articles:

Https://blog.csdn.net/leonpenn/article/details/76489480

1) when the SQL thread executes event, the time stamp attached to the event is subtracted from the timestamp value of the execution time of the library (timestamp on the main library at that time), the difference between the two

2) once the SQL thread is not executing event, the Seconds_Behind_Master is 0

3) if IO thread or SQL thread is not running, Seconds_Behind_Master is NULL

4) when Seconds_Behind_Master is 0, there is no delay in master slave replication (this is the case in most cases)

When the network is very fast, io thread quickly gets the realy log from binlog to slave from master, and then sql thread replay. At this time, the Seconds_Behind_Master value can really show the number of seconds that slave lags behind master.

In the case of poor network, the binlog of io thread synchronizing master is slow, while sql thread replay is very fast, and Seconds_Behind_Master is also 0, but the real situation is that slave lags far behind master.

If the following three situations occur, although Seconds_Behind_Master still has a non-NULL value, it has become inaccurate

1. Master-slave time is inconsistent (although clock_diff_with_master is introduced to adjust the impact of time difference as far as possible, this value is only obtained when a connection is established between the slave database and the master database, and will not be updated later. If you modify the master-slave time later, the value will be unreliable).

2. There is a network problem between the master and slave libraries, or the IO thread of the slave library does not find that the binlog dump thread of the master database is dead and is still waiting for the data to be transferred, and the SBM lasts for a long time to zero.

3. The main library has a large binlog event before and after execution, the SBM seen from the library will have a big jump (glitches are likely to occur in the monitoring chart)

4. For parallel replication, SMB is based on Exec_Master_Log_Pos and is not accurate.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.