What are the MySQL failures worth understanding in depth? 07/13 Update SLTechnology News&Howtos

What are the MySQL failures worth understanding in depth?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail what MySQL faults are worthy of in-depth understanding, and the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

I. interpretation of the mode and principle of master-slave replication

MySQL master-slave replication can simply be interpreted as data that can be replicated from a MySQL database server master node to one or more slave nodes. MySQL uses asynchronous replication by default, so that the slave node does not have to access the master server all the time to update its own data. The data update can be carried out on the remote connection, and the slave node can copy all databases or specific databases or tables in the master database.

The main uses of MySQL master-slave replication include read-write separation, real-time data backup (convenient failover when a node in the system fails), architecture expansion, high availability of HA, and so on.

The main forms of MySQL master-slave replication include: one-master-multi-slave, multi-master-slave, double-master replication, cascade replication (some slave data synchronization does not connect master nodes, but connects slave nodes. Because if the master node has too many slave nodes, it will lose some of its performance for replication, so you can let some slave nodes connect to the master node, other slave nodes connect to the slave node as secondary or tertiary, and so on.

MySQL master-slave replication involves three threads, one running on the master node (log dump thread) and the other two (SQL thread) running on the slave node, as shown in the following figure:

Master node binary log dump thread

When the salve node connects to the master node, the master node creates a log dump thread to send the contents of the binlog. When reading an operation in binlog, this thread locks the binlog on the primary node, and when the read is complete, the lock is released before it is sent to the slave node.

Slave Node I Dot O Thread

When the start slave command is executed on the slave node, the slave node creates an I _ master O thread to connect to the master node and request an updated binlog in the master node. After receiving the update from the binlog dump process of the master node, the iPot O thread saves it in the local relay-log.

Slave node SQL thread

The SQL thread is responsible for reading the contents of the relay log, parsing it into specific operations and performing them, and finally ensuring the consistency of the master-slave data.

To achieve master-slave replication, the binary log function of the Master node must be turned on. Because the whole replication process is actually the operations recorded in the log that the Slave node takes from the Master node and then performs the log in full sequence on itself.

The Iramo process on the slave node connects to the master node and requests the log contents from the specified location of the specified log file

After the master node receives the Igamot O request from the slave node, it reads the log information after the specified location of the log according to the request information and returns it to the slave node. In addition to the information contained in the log, the returned information also includes the binlog file and binlog position of the returned information. After receiving the content, the slave node's IWeiO process updates the received log content to the local relay log, and saves the read binary log file name and location to the master-info file. The next time it is read, it can tell the master node where the binlog needs to start the subsequent log.

When the SQL thread of the slave node detects a new addition to the relay-log, it parses the content of the relay-log into the operation actually performed on the master node and executes it in this database.

MySQL master-slave replication mode is divided into asynchronous mode, semi-synchronous mode and full synchronous mode.

The ① asynchronous mode master node does not actively push binlog to the slave node, which may result in the fail over case, or the slave node does not immediately synchronize the latest binlog locally.

② semi-synchronous replication mode ensures that at least one slave node (configurable) returns an ack signal to the master node after receiving the binlog log file sent by the master node and writing it to the relay log, informing the master node that the log has been received before the master node thread returns to the current session submission information.

③ full synchronization mode

Full synchronization mode means that the slave node receives the binlog log file sent by the master node and writes it to the relay log. After the playback is completed, the master node returns an ack signal to the master node before the master node returns success to the client.

2. How does DBbrian judge the master-slave delay

From the principle of master-slave replication mentioned earlier, it is not difficult to see that master-slave latency may occur in MySQL when using "asynchronous" and "semi-synchronous" replication modes. MySQL database replication delay will bring a series of serious problems to the business: read-write separation architecture is not conducive to high real-time consistency business; high reliable architecture design is also difficult to determine RTO/RPO indicators. Detecting, locating and solving the problem of MySQL master-slave replication delay has always been one of the key tasks of DBA.

DBbrain provides 24-hour intelligent database operation and maintenance service for cloud users. How does DBbrain diagnose the failure of "master-slave replication" delay? Next, let's uncover this problem for everyone.

So, first of all, let's briefly introduce how the master-slave delay (replication delay) occurs.

MySQL repository replication starts two types of threads: the I thread is responsible for connecting to the main database to read binlog events and then writes them to the local binlog file; the SQL thread reads the event apply from the copied binlog file to the standby library. You can view the replication status of the standby library through "show slave status". Where the SecondsBehindMaster value represents the delay time in seconds. The time when the main database executes the SQL statement is marked as T1, and the time when the standby database executes SQL is marked as T2. The difference between these two moments is the master / standby delay time. However, the delay seen only from the "show slave status" results may be "incorrect". In addition to the accuracy problem, this value is also related to the main database transaction. If the IUD operation is performed when the transaction is opened in the main database, but the commit has an one-minute lag, then this time difference will also be reflected in the standby replication delay state. We usually see that there is always a delay fluctuation of 1 minute and 2 seconds on the slave database delay performance curve, which is most likely caused by the master database transaction. If the delay is calculated from the commit time of the transaction, the large delay does not exist. In order to ensure that the master / slave data is consistent with the master / slave data, you need to confirm that the master / slave binlog log files are consistent with the location before you can operate.

The discovery mechanisms and methods adopted by DBbrain for master-slave delay (replication delay) exception scenarios can be divided into the following three types:

① takes advantage of the value of seconds_ behind_ master

Seconds_ behind_ master (In essence, this field measures the time difference in seconds between the slave SQL thread and the slave I thread.) in the show slave status result. Can be used to measure the delay time between the master and slave (in seconds). Determine whether seconds_ behind_ master is equal to 0, and if this parameter is equal to 0, it means that there is basically no delay in master-slave replication. Seconds_ behind__master is the difference obtained by comparing the timestamp of event executed by sql_thread and the timestamp of event replicated by io_thread.

Inaccurate representation of replication latency by seconds_ behind_ master can also occur in some scenarios, such as:

In a particularly poor network environment, I thread O synchronization is very slow, every time the synchronization comes, SQL thread can immediately execute, so that the seconds_ behind__master found on slave is 0, but really, slave has lagged behind master by a certain distance.

There was no data submission for a period of time, and slave thread time O thread time and slave SQL thread time remained at the old value, such as T (but in fact, the time on master has run out of time for titled I). At this time, the main database commits, and slave I Maple O starts to synchronize the binlog,slave I thread time O thread time with master, but the slave SQL thread time is kept at T value, and the seconds_behind_master=I at this time, but it is uncertain whether there is a delay.

② by comparing the sites

Master_Log_File and Read_Master_Log_Pos represent the latest sites in the main library that you read. Relay_Master_Log_File and Exec_Master_Log_Pos represent the latest sites executed by the repository. If the values of Master_Log_File and Relay_Master_Log_File, Read_Master_Log_Pos and Exec_Master_Log_Pos are exactly the same, the received logs have been completed synchronously.

③ compares GTID sets

For database instances with GTID enabled, DBbrain uses the comparison of GTID collections to detect whether replication latency exists. (Auto_Position=1, indicating that the GTID protocol is used for the active / standby relationship)

Retrieved_Gtid_Set, which is the GTID collection of all logs received by the repository

Executed_Gtid_Set is a collection of all the GTID that has been executed in the repository.

If the two sets are the same, it also means that the logs received by the standby database have been completed synchronously. It is more accurate than judging whether seconds_ behind_ master is 0.

Third, the case of one-click optimization of DBbrian

Usually, IO threads do not cause significant delays in data replication, unless network problems cause disconnections, or network latency and bandwidth bottlenecks. The deletion or corruption of the main library binlog in an automated environment is also one of the reasons for the disconnection of the IO thread. This paper mainly analyzes the delay of applying event to SQL threads:

Insufficient resources for standby / read-only instances: in addition to applying data changes, standby / read-only instances may need more resources to undertake query tasks. When a delay is found in the repository, you need to confirm the resource usage of the repository first.

High concurrent data update of the main database: during the peak of business, there are a large number of concurrent insert, delete and update operations in the main library, resulting in a significant increase in QPS, resulting in a large number of binlog files. At this time, the speed of the application of event in the slave library is the same as that of the delay. The way that the standby library applies event evolves and optimizes from the initial single thread to the current parallel replication. The implementation of parallel replication is different among different versions and vendors, such as Schema-based parallel replication, table-based parallel replication, commit-parent-based parallel replication, lock-interval-based parallel replication and so on. Enabling parallel replication submitted by the group in the repository can improve playback binlog performance and reduce latency.

A large number of updates to the single table of the main database: in row mode, the database changes of one sql statement in the main database will be copied to the standby database by multiple event. It is recommended that developers try to break down large transactions into small transactions and submit them in a timely manner. In addition, I have seen a user update the timestamp of 100000 pieces of data in a single table in the main database. In this kind of scenario, the delay of preparing the database will become larger and larger, and the standby database will never catch up with the main database. In the end, users can only be advised to change the application design.

DDL operation of the master database: when the large table DDL statement is copied to the slave database, it will cause parallel replication to fail, and subsequent events cannot be updated, thus delaying accumulation. In this scenario, you will see the monitoring curve grow at a slope of 45 degrees. It is important to mention here that many developers like to use optimize table operations frequently in the main library, but ignore the problem that this statement can easily lead to delays in preparing the library. The problem of latency caused by DDL is easier to locate: the executing DDL statement can be seen through the "show processlist" session snapshot. In view of the delay caused by the large table DDL, the experienced DBA will choose to disable the SQLLOGBIN parameter, and the slave database will execute the DDL statement manually.

Slave execution of SQL statements is slow: in row mode, the lack of a primary key or appropriate index is the main reason for slow SQL execution. In the online cloud environment, if the user does not specify a primary key when creating a table, the database usually automatically introduces an implicit primary key to avoid this problem.

Standby transaction blocking replication: replication delays or abortions caused by long or uncommitted transactions in standby / read-only instances are easily overlooked. For example, the slave database opens the transaction and does not commit after executing the query; at this time, the DDL statement from the main database will wait for the MDL lock, while the DDL statement will continue to block the execution of other subsequent events.

The following is to select one of the problems through the scene description to simply restore the logic of the whole optimization process. After the query is started for the read-only instance, the transaction is not committed. (note: it is a common misuse to start a transaction only by making a query. This operation is not necessarily shown by the developer and written in the code, but is caused by the framework used. )

At this point, we can see that the slave database delay is generated from the monitoring data:

On a read-only instance, we can see the cause of the replication delay through a series of commands. In the slave replication status information, you can see that the current SQL execution status is "Waiting for table metadata lock".

In addition, the currently blocked DDL statements can also be seen directly from the session snapshot:

Instance to view transactions that have not been committed for a long time:

DBbrain, the intelligent housekeeper of the database, will actively discover the cause, and the delay will disappear immediately after the submission or kill session:

IV. The ingenious use of master-slave delay

Although master-slave delay (replication delay) will have a negative impact on business in most scenarios, in some scenarios, manually setting "delay" can perfectly solve some special business needs. For example, the log is copied to the standby database in time, but the intentional implementation method that is not applied immediately is often used in the disaster recovery system. The probability of disaster recovery switching is very low, but the existing resources can be used to "roll back" misoperation in time. Replication latency is a very valuable undo option. For example, if someone accidentally deletes MySQL databases or tables, you can easily recover those databases and tables from a delayed MySQL slave. MySQL already supports MASTER_DELAY parameters to achieve similar functionality.

5. Optimization of Tencent Cloud MySQL based on master-slave replication

The time-consuming of MySQL under synchronous replication mainly consists of three parts. The first is the SQL part, the second is the storage engine, and the third part is replication. Compared with asynchrony, we focus on optimizing the latency of the third part. There are two main parts of replication delay: the first part is the past transmission time of binlog network, and the second part is the delay of slave landing binlog. Binlog transmission time depends on the network RTT value. Our optimization focuses on the delay of slave landing binlog.

In the process of this optimization, a test is made for quantitative analysis. In the case of asynchronous MySQL in full Cache, the time consumed by a single transaction is 3.37ms, that is, its SQL plus engine takes a total of 3.37ms, but we find that the delay becomes 8.33ms in the case of semi-synchronous, and it is found that RTT is 2.6ms, so the slave landing binlog costs 1.9ms. Is 1.9ms reasonable? Then I did a test to simulate the operation of slave landing binlog, and found that only 0.13ms is needed, which is actually close to 1.8ms optimization space.

The second is how to improve the system throughput. When the delay of a single transaction comes down, does it mean that the throughput of the whole system comes up? This is not necessarily, the whole throughput depends on two factors, one is the number of supported concurrency, and the other is the delay of a single transaction. If there is a big competition for some common resources, then there may be a problem that the number of concurrency is not coming up, and we find that there is a lot of room for optimization in master's binlog send / response thread. So we do the optimization of system throughput based on these two aspects.

How to solve the time-consuming of slave landing binlog? At that time, we analyzed that there are three main bottlenecks in MySQL slave's IO thread receiving binlog: the first is lock conflicts, lock conflicts between IO/SQL threads, such as metadata file locks; the second part is small IO consumption, IO threads discrete small disk IO consumes too much IOPS; the third problem is serialization, IO threads receive and drop disk operation serial.

About what is worth in-depth understanding of MySQL faults to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.