The repair method of MGR MYSQL Cluster crash 04/09 Update SLTechnology News&Howtos

The repair method of MGR MYSQL Cluster crash

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly talks about "how to repair the collapse of MGR MYSQL cluster". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Now let the editor to take you to learn the "MGR MYSQL cluster crash repair method" it!

Production is about to deploy MYSQL MGR environment, docking bank project, tested on the test system for two months of MGR without any problems, once, or developers are not familiar with the principles of MYSQL, the use of big transactions, MYSQL MGR stuck, after a few minutes of processing, everything is normal, basically millions of data queries are in seconds. It is not inferior to the response of other database products.

Something has happened recently, so the launch of any system must be strictly controlled. The MYSQL MGR online needs to be strictly tested to get the amount of data that the MYSQL cluster can HOLD. Only in this way can we give OPS and develop a parameter and data tolerance range to avoid some bad things.

Build SYSBENCH on PROXYSQL middleware, start stress testing without any problems at first, then increase the amount of data of pressure testing, 30 tables each 500MB, 30 threads, lasting 5 minutes, in PERPARE, found that the 10th table system has no response, query memory has begun to use SWAP, should be memory burst, stop the system thread, found that can not stop. SHOW PROCESSLIST

Found that a large number of insert into statements open card thread pool, manual cleaning, invalid.

Inadvertently check the disk space, OMG,UNDO LOG and RELAY LOG space has exploded, 100% utilization.

At this time, query the status of the cluster, the cluster has been in a failed state, (the figure is not left, the MEMBER_STATE in the following figure, should be the status of OFFILINE)

At this time, the cluster has been scattered, check the log, in fact, the log has already been reported to the police yesterday (because the stress test was done the day before yesterday, but due to the emergency did not take care of the stress test, a few days in the test, found that something has gone wrong)

The lack of good system testing, whether in hardware or software, foreshadowing future "blows".

Fortunately, it was done this time, so the problem was found before the system went online.

The following is the repair, after notifying the operation and maintenance students, the disk space is added, and then the system is rebuilt and started. (MYSQL MGR is still very robust, except to restart the cluster and add failed nodes.)

But at this time the second problem came out, due to the disk space problem of RELAY LOG, there have been a lot of log squeeze.

The easiest way is to BACKUP master library, and then in RESTORE to slave library, from the MGR on it. In this way, I give up the skill of repairing MGR at a deeper level. First of all, we force the slave node to boot, and then start to execute a bunch of squeezed LOG. Because of the disk space problem at that time, the test database has been deleted in both the master library and the slave database. (for experimental purposes only, it is absolutely forbidden to go online.)

When the flood log hits the STANDBY node, it is the error caused by the inconsistency between the log and the current operation environment, which is still carried out all the time, but then it suddenly stops, and it turns out that the operation such as DROP DATABASE has been stuck again. (this is related to the MYSQL's tolerable copy ERROR at that time, and the DDL operation is an intolerable error.) under this situation, the database is completely locked from the node.

All right, let's start repairing the cluster.

(if you are not familiar with the principles of MGR and GTID, you will begin to struggle below.)

1 Let's check the GTID number of the primary node

2620d5df-07e7-11e9-8fa8-005056ad4145:1-2740

81f38f19-09d2-4144-8925-484ad1cb5c6f:1-1002

Bf79695c-0e78-11e9-9c14-005056ad1469:1

The log of the main library has reached this GTID and is MYSQL-BIN.000012

We are looking at the status of Congku.

Sure enough, it is different, of course, it should be different, otherwise the cluster would not be in a state of failure.

3 begin to repair

Let's stop the slave node first.

Stop group_replicatiton

Clear the current GTID EXECUTED

Reset master

Set the same EXECUTED_GTID_SET as the master node

Start replication

Start group_replication

Check the status of the repaired slave node again, and it is online.

Check the corresponding log, OK has been synchronized with the master, (if you are familiar with the principle of GTID, you will know why)

Using this method, repair the second node

The repair is successful and the cluster resumes work.

Verify again if there is any extruded TRANSACTION LOG

No, of course not.

The use of this method is conditional, according to the current state, if it is a complex state, it is recommended to use the traditional method to recover, but if you know the operation of your MYSQL MGR cluster and related systems, you can do it in an emergency or test environment, which will benefit a lot from understanding the principles of MGR and emergency repair.

At this point, I believe that everyone on the "MGR MYSQL cluster crash repair methods" have a deeper understanding, might as well to practical operation it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.