The misuse of mysql to measure the delay time of MYSQL active and standby by SECONDS_BEHIND_MASTER 07/19 Update SLTechnology News&Howtos

The misuse of mysql to measure the delay time of MYSQL active and standby by SECONDS_BEHIND_MASTER

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Link: http://www.woqutech.com/?p=1116

MySQL itself provides Seconds_Behind_Master through show slave status, which is used to measure the replication delay between master and slave. Today, however, we encountered a scenario where the Seconds_Behind_Master is 0 and the show slave status of the slave database shows that the IO/SQL thread is normal, but the changes on the master database of MySQL cannot be synchronized to the slave database for a long time. If there is no human intervention, MySQL will not automatically reconnect to the primary library until an hour later and continue to replicate the changes to the primary library.

Scope of influence: all versions of MySQL, Percona, MariaDB.

Although this scenario is very special and the probability of encounter is not high, I personally feel it necessary to remind DBA people who use MySQL. Through the analysis of this scenario, it also helps us to have a deeper understanding of the MySQL replication retry mechanism.

1. Reproduction steps

Set up master / slave replication, temporarily disconnect the network of the master database, and kill the binlog dump thread of the master library MySQL.

Observe the replication of slave database at this time. In show slave status:

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

Seconds_Behind_Master: 0

But at this time, after you restore the network, if you make any changes in the main database, the standby database will not be able to get data updates. And the show slave status on the repository shows that the IO thread SQL thread is fine, and the replication delay has been zero.

Everything is normal, ordinary monitoring software will not find that there is a data delay in the database.

Second, principle analysis

MySQL's Replication is a key difference from other databases. It is also the basis for scalability and high availability. It itself is very intelligent, as long as we call Change Master to specify the Binlog file name and offset location, we can build a replication relationship from the main library to the standby library.

The MySQL replication thread automatically records the current replication location, automatically connects to the primary library when the primary and standby replication is interrupted, and resumes replication from the location where it was last interrupted. These operations are fully automated and do not require human intervention. This brings a lot of convenience to MySQL DBA, but also hides a lot of details.

To really understand the truth of the previous problem and how to solve it, we still need to really understand the principle of MySQL replication.

2.1 push or pull

First of all, the replication of MySQL is "push", not "pull". "pull" means that the slave database of MySQL constantly asks the master database whether there is any data update, which consumes a lot of resources and is inefficient. "push" means that the master database of MySQL pushes the change to the slave database when it has data updates, which will interact only when the data changes, resulting in low resource consumption. If you are a programmer, you will definitely choose to "push".

Then MySQL exactly how to "push" the column, in fact, when the standby database applies for data change records from the main database, it needs to specify the specific number of bytes from which file (MASTER_LOG_FILE) of the main database Binlog offset position (MASTER_LOG_POS). Accordingly, the main library will start a thread of Binlog dump and send the records of changes to the standby database one by one from this location. The standby database has been listening for changes from the main database and receives one before the data change is applied locally.

2.2 Analysis of the reasons

From the above analysis, we can roughly guess why show slave status shows that everything is normal, but in fact, the changes to the main database cannot be synchronized to the standby database:

When there was a problem, we kill the Binlog dump program. As the listening party, the standby database has not received any changes, it will think that there are no changes on the main database for a long time, resulting in no change data pushed. The standby database cannot tell whether the corresponding Binlog dump thread on the main database has terminated unexpectedly or has not changed any data for a long time. So, in both cases, the repository appears to be normal.

Of course, MySQL will try to avoid this situation. For example:

When Binlog dump is dropped by kill, notify the standby thread that it is dropped by kill. Therefore, when we reproduce, we need to make sure that this notification cannot be sent to the slave database, that is, the key to the recurrence of this problem is that the Binlog dump message caused by kill cannot be sent to the slave database due to network congestion or other reasons.

If the standby library does not receive changes from the main library for a long time, it will reconnect to the main library at regular intervals.

2.3 problem avoidance

Based on the above analysis, we know that MySQL is really unavoidable in this case, so what can we do to avoid columns:

1. Passive processing: modify the monitoring method of delay, find problems and deal with them in time.

two。 Active prevention: correctly set-master-retry-count,-- master-connect-retry,-- slave-net-timeout replication retry parameters.

L passive processing

Most of the delay monitoring of MySQL collects Seconds_Behind_Master from show slave status directly. In this case, Seconds_Behind_Master cannot really measure the replication delay between master and standby. We suggest that the master / slave delay scheme can be obtained by polling the insertion time information in the main database and through the time difference between the copy to the standby database. Percona provides a similar scheme, pt-heartbeat.

After discovering this problem, we only need stop slave; start slave; to restart replication to solve this problem.

L active prevention

MySQL can specify three parameters for the replication thread to reconnect to the main library:-- master-retry-count,-- master-connect-retry,-- slave-net-timeout.

Master-connect-retry and master-retry-count need to be specified when Change Master builds master / slave replication, while slave-net-timeout is a global variable that can be set online when MySQL is running.

The specific retry strategy is: if the slave database has not received the data from the master database after slave-net-timeout seconds, it will start to retry for the first time. Then every master-connect-retry second, the standby library tries to reconnect to the primary library again. It won't give up the retry until it retries master-retry-count several times. If the main library is connected during the retry, it thinks that the current main library is good and will start waiting for slave-net-timeout seconds.

Slave-net-timeout defaults to 3600 seconds, master-connect-retry defaults to 60 seconds, and master-retry-count defaults to 86400 times. In other words, if no data changes are sent to the master database for an hour, the standby database will try to reconnect to the master database. This is why, in our simulated scenario, it takes an hour for the standby database to reconnect to the main database and continue to synchronize data changes.

In this way, if your main library changes frequently, you can consider making the slave-net-timeout setting smaller to prevent the main library Binlog dump thread from terminating and unable to push the latest updates.

Of course, it is also a problem if the slave-net-timeout setting is too small, which will lead to frequent reconnection of the standby database to the main database if there are really few changes to the main database, resulting in a waste of resources.

The delay monitoring of active and standby replication in Warhol Technology's Q Monitor monitoring is not monitored through Seconds_Behind_Master. It uses a way similar to pt-heartbeat to monitor the replication delay between the master and the slave.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.