How to analyze and deal with the problem of MySQL backup failure 07/06 Update SLTechnology News&Howtos

How to analyze and deal with the problem of MySQL backup failure

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article shows you how to analyze and deal with the problem of MySQL backup failure. The content is concise and easy to understand. It will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Today, I dealt with a strange MySQL space exception problem with my colleagues, from which we can find some ways to deal with the problem.

The background of the problem is that the backup of an instance always fails. After many times of troubleshooting, the backup of an instance is shelved to ensure the availability of Slave. It just takes advantage of the past few days to finish and sort out.

The error message for backup failure is:

Innobackupex: Error writing file'/ tmp/xbtempevLQbf' (Errcode: 28-No space left on device) xtrabackup: Error: write to logfile failed xtrabackup: Error: xtrabackup_copy_logfile () failed.

It seems that how straightforward the problem is, the lack of space is not a matter of space allocation.

However, when doing the simulation test locally, the following script is used to start the native test.

/ usr/local/mysql_tools/percona-xtrabackup-2.4.8-Linux-x86_64/bin/innobackupex-defaults-file=/data/mysql_4308/my.cnf-user=xxxx-password=xxxx-socket=/data/mysql_4308/tmp/mysql.sock-stream=tar / data/xxxx/mysql/xxxx_4308/2020-02-11 > / data/xxxx/mysql/xxxx_4308/2020-02-11.tar.gz

It is found that there is no space exception in the / tmp directory, on the contrary, there is an exception in the space usage of the root directory. This is a screenshot of a space exception captured in the test.

After an exception is thrown, the backup fails and the space utilization is restored immediately.

Based on the information I've got so far, my intuitive feeling is that the problem doesn't seem to be directly related to / tmp, and it must be an exception in other directories during the use of the root directory.

So I started the second test, and this time I focused on the overall use of the root directory to see which directory was abnormal, but embarrassingly, despite doing a quick collection of scripts, I didn't find any spatial anomalies under our common directories.

332K. / home 411m. / lib 26m. / lib64 16K. / lost+found 4.0K. / media 4.0K. / misc 4.0K. / mnt 0. / net 184m. / opt du: cannot access `. / proc/40102/task/40102/fd/4': No such file or directory du: cannot access`. / proc/40102/task/40102/fdinfo/4': No such file or directory du: cannot access `. / proc/ 40102Universe: No such file or directory du: cannot access `. / proc/40102/fdinfo/4': No such file or directory 0. / proc 2.3G. / root 56K. / tmp.

So from the current situation, it should be the space exception in the directory related to / proc.

At this point, it seems that there are not many ways available.

I checked the script and parameter files, and there was no obvious problem compared with other environments as a whole, but one detail caught my attention, that is, when using top, I saw that the memory of this instance used 6G (server memory is 8G), but the configuration of buffer pool was about 3G, which is a slave library environment, and there is no application connection, so it is unlikely that there will be too much connection resource consumption. So on the whole, it should be related to the abnormal memory of the server.

At this time, I tried online resize and found that there was no room for contraction. Because it was from the library service, I began to restart the service from the library.

But unexpectedly, it got stuck when I restarted the database, about 2 minutes later, I just saw some decimal points of the output, about two lines of output, but there was still no response. I checked the background log without any output, so I began to try plan B to prepare the Kill process to restart the service.

This time the kill operation takes effect, and after a while the service starts. But a copy exception from the library was reported.

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'. Master_Server_Id: 190 Master_UUID: 570dcd0e-f6d0-11e8-adc3-005056b7e95f. Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: 200211 14:20:57 Retrieved_Gtid_Set: 570dcd0e-f6d0-11e8-adc3-005056b7e95f:821211986-2157277214 Executed_Gtid_Set: 570dcd0e-f6d0-11e8-adc3-005056b7e95f:1-820070317 Master_Bind 821211986-2157277214

The error message is obvious that the binlog of the main library has been dropped by purge, resulting in a failure in copying the application from the library.

Why is there such a strange problem, because the binlog of the main library still retains some days by default, so that the binlog of an hour ago will not be deleted.

Some variable values about GTID are as follows:

Retrieved_Gtid_Set: 570dcd0e-f6d0-11e8-adc3-005056b7e95f:821211986-2157277214

Executed_Gtid_Set: 570dcd0e-f6d0-11e8-adc3-005056b7e95f:1-820070317purl 821211986-2157277214

Gtid_purged: 570dcd0e-f6d0-11e8-adc3-005056b7e95f:1-820070317purl 821211986-2131381624

The GTID_Purged on the Master side is:

Gtid_purged: 570dcd0e-f6d0-11e8-adc3-005056b7e95f:1-2089314252

Taken together, the GTID on the server is not fully connected with the main library, which means that some operations have been done on the Slave before, resulting in some deviation between the Master and the server on the GTID.

And this missing part of the change 570dcd0e-f6d0-11e8-adc3-005056b7e95f:821211986 conservatively estimated to be a month ago, binlog is certainly not reserved.

Let's fix this replication problem for the time being.

Stop Slave did not expect another problem, a seemingly simple stop Slave operation unexpectedly lasted more than a minute.

> > stop slave

Query OK, 0 rows affected (1 min 1.99 sec)

Try to reduce Buffer pool configuration, restart, stop slave, this operation is still slow, so you can eliminate latency in this direction has less to do with Buffer Pool, but more to do with GTID.

The steps to repair the Slave end are as follows:

Reset master; stop slave; reset slave all; SET @ @ GLOBAL.GTID_PURGED='570dcd0e-f6d0-11e8-adc3-005056b7e95f:1-2157277214; CHANGE MASTER TO MASTER_USER='dba_repl', MASTER_PASSWORD='xxxx', MASTER_HOST='xxxxx',MASTER_PORT=4308,MASTER_AUTO_POSITION = 1

The configuration of GTID_PURGED is the key.

After the repair, the delay problem on the Slave side is solved, and when you try to back up again, there is no space consumption in the root directory.

The main purpose of this process is to solve the problem quickly, and the log capture of some steps is not rich and detailed enough. From the analysis of the problem, there is still a lack of something more convincing. For the cause of the problem, in essence, unreasonable problems (such as bug or abnormal configuration, etc.) lead to unreasonable phenomena.

What can be used for reference in this part is the overall train of thought of the analysis, not the problem itself.

The above content is how to analyze and deal with the problem of MySQL backup failure. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.