What to do if the galera mysql cluster failure node reconnects to the cluster and has a problem? 04/17 Update SLTechnology News&Howtos

What to do if the galera mysql cluster failure node reconnects to the cluster and has a problem?

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about what to do when the galera mysql cluster fault nodes are connected to the cluster again. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

Galera cluster is a multi-host cluster of mysql.

At present, we have set up a test cluster of 3 nodes.

The first round of testing, found a problem, node failure, offline, and then rejoin the cluster, unable to join.

Then directly join the whole node content as a new node, which is also a failure. I've been doing it for two days, but I have a big head. Failure came to an end.

The error message is as follows:

170609 16:55:59 [Note] WSREP: Read nil XID from storage engines, skipping position init

170609 16:55:59 [Note] WSREP: wsrep_load (): loading provider library'/ usr/lib64/galera-3/libgalera_smm.so'

170609 16:55:59 [Note] WSREP: wsrep_load (): Galera 3.20 (r7e383f7) by Codership Oy loaded successfully.

170609 16:55:59 [Note] WSREP: CRC-32C: using hardware acceleration.

170609 16:55:59 [Note] WSREP: Found saved state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824276, safe_to_bootsrap: 0

170609 16:55:59 [Note] WSREP: Passing config to GCS: base_dir = / var/lib/mysql/; base_host = 192.168.11.152; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4 Evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = / var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = / var/lib/mysql//galera.cache; gcache.page_size = 300m; gcache.recover = no; gcache.size = 300m; gcomm.thread_prio =; gcs.fc_debug = 0; gcs.fc_factor = 1.0 Gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc

170609 16:55:59 [Note] WSREP: GCache history reset: old (51391c6d-4bff-11e7-a1c3-b797743e8629:0)-> new (51391c6d-4bff-11e7-a1c3-b797743e8629:824276)

170609 16:55:59 [Note] WSREP: Assign initial position for certification: 824276, protocol version:-1

170609 16:55:59 [Note] WSREP: wsrep_sst_grab ()

170609 16:55:59 [Note] WSREP: Start replication

170609 16:55:59 [Note] WSREP: Setting initial position to 51391c6d-4bff-11e7-a1c3-b797743e8629:824276

170609 16:55:59 [Note] WSREP: protonet asio version 0

170609 16:55:59 [Note] WSREP: Using CRC-32C for message checksums.

170609 16:55:59 [Note] WSREP: backend: asio

170609 16:55:59 [Note] WSREP: gcomm thread scheduling priority set to other:0

170609 16:55:59 [Warning] WSREP: access file (/ var/lib/mysql//gvwstate.dat) failed (No such file or directory)

170609 16:55:59 [Note] WSREP: restore pc from disk failed

170609 16:55:59 [Note] WSREP: GMCast version 0

170609 16:55:59 [Warning] WSREP: Failed to resolve tcp:// 192.168.11.98:4567

170609 16:55:59 [Warning] WSREP: Failed to resolve tcp:// 192.168.12.75: 4567

170609 16:55:59 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567

170609 16:55:59 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') multicast:, ttl: 1

170609 16:55:59 [Note] WSREP: EVS version 0

170609 16:55:59 [Note] WSREP: gcomm: connecting to group 'mycluster', peer' 192.168.11.152, 192.168.11.98, 192.168.12.75:'

170609 16:55:59 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') connection established to 753e6ee4 tcp://192.168.11.152:4567

170609 16:55:59 [Warning] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') address' tcp://192.168.11.152:4567' points to own listening address, blacklisting

170609 16:56:02 [Warning] WSREP: no nodes coming from prim view, prim not possible

170609 16:56:02 [Note] WSREP: view (view_id (NON_PRIM,753e6ee4,1) memb {

753e6ee4,0

} joined {

} left {

} partitioned {

})

170609 16:56:02 [Note] WSREP: (753e6ee4, 'tcp://0.0.0.0:4567') connection to peer 753e6ee4 with addr tcp://192.168.11.152:4567 timed out, no messages seen in PT3S

170609 16:56:03 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50193S), skipping check

170609 16:56:32 [Note] WSREP: view ((empty))

170609 16:56:32 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110( Connection timed out)

At gcomm/src/pc.cpp:connect (): 158

170609 16:56:32 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open (): 208: Failed to open backend connection:-110( Connection timed out)

170609 16:56:32 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open (): 1404: Failed to open channel 'mycluster' at' gcomm://192.168.11.152, 192.168.11.98, 192.168.12.75? gmcast.segment=0 & evs.max_install_timeouts=1':-10 (Connection timed out)

170609 16:56:32 [ERROR] WSREP: gcs connect failed: Connection timed out

170609 16:56:32 [ERROR] WSREP: wsrep::connect (gcomm://192.168.11.152, 192.168.11.98, 192.168.12.75? Gmcast.segment=0 & evs.max_install_timeouts=1) failed: 7

170609 16:56:32 [ERROR] Aborting

170609 16:56:32 [Note] WSREP: Service disconnected.

170609 16:56:33 [Note] WSREP: Some threads may fail to exit.

170609 16:56:33 [Note] / usr/sbin/mysqld: Shutdown complete

Then I will not be able to join the cluster.

People are confused. I once wondered how the largest cluster in China maintained this problem.

Delete all test vm and reinstall os. Start all over again.

These two days to start testing this problem again.

Continue to repeat the test of the case.

After the node is deleted, the same problem is reproduced.

Whether it is to clear all the data, rejoin, or retain the original data to join the cluster. All are failures, and the error message is the same as above.

There is no solution again.

It's starting to get depressed again. It is not supposed to be. Start analyzing error messages. In terms of information. Always seems to read the first node, that is, the node itself.

An error was reported that the connection could not be made. Then repeat 7 times, and then timeout quit.

Our cluster has 3 nodes, should not ah, the first can not connect, should roundrobin try to connect behind the node ah.

But this problem is not reflected in the diary.

I suddenly began to wonder if there was something wrong with the design of the software code in this department.

There is no need to read the source code, we can modify the configuration.

So I modified the configuration of wsrep_cluster_address to get the location of the ip of the first node to the end.

Then restart the database, and a miracle happened.

170609 16:57:09 [Note] WSREP: Read nil XID from storage engines, skipping position init

170609 16:57:09 [Note] WSREP: wsrep_load (): loading provider library'/ usr/lib64/galera-3/libgalera_smm.so'

170609 16:57:09 [Note] WSREP: wsrep_load (): Galera 3.20 (r7e383f7) by Codership Oy loaded successfully.

170609 16:57:09 [Note] WSREP: CRC-32C: using hardware acceleration.

170609 16:57:09 [Note] WSREP: Found saved state: 51391c6d-4bff-11e7-a1c3-b797743e8629:-1, safe_to_bootsrap: 0

170609 16:57:09 [Note] WSREP: Passing config to GCS: base_dir = / var/lib/mysql/; base_host = 192.168.11.152; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4 Evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = / var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = / var/lib/mysql//galera.cache; gcache.page_size = 300m; gcache.recover = no; gcache.size = 300m; gcomm.thread_prio =; gcs.fc_debug = 0; gcs.fc_factor = 1.0 Gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc

170609 16:57:09 [Note] WSREP: GCache history reset: old (51391c6d-4bff-11e7-a1c3-b797743e8629:0)-> new (51391c6d-4bff-11e7-a1c3-b797743e8629:824276)

170609 16:57:09 [Note] WSREP: Assign initial position for certification: 824276, protocol version:-1

170609 16:57:09 [Note] WSREP: wsrep_sst_grab ()

170609 16:57:09 [Note] WSREP: Start replication

170609 16:57:09 [Note] WSREP: Setting initial position to 51391c6d-4bff-11e7-a1c3-b797743e8629:824276

170609 16:57:09 [Note] WSREP: protonet asio version 0

170609 16:57:09 [Note] WSREP: Using CRC-32C for message checksums.

170609 16:57:09 [Note] WSREP: backend: asio

170609 16:57:09 [Note] WSREP: gcomm thread scheduling priority set to other:0

170609 16:57:09 [Warning] WSREP: access file (/ var/lib/mysql//gvwstate.dat) failed (No such file or directory)

170609 16:57:09 [Note] WSREP: restore pc from disk failed

170609 16:57:09 [Note] WSREP: GMCast version 0

170609 16:57:09 [Warning] WSREP: Failed to resolve tcp:// 192.168.12.75:4567

170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567

170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') multicast:, ttl: 1

170609 16:57:09 [Note] WSREP: EVS version 0

170609 16:57:09 [Note] WSREP: gcomm: connecting to group 'mycluster', peer' 192.168.11.98, 192.168.12.75: 192.168.11.152

170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection established to 9f2dfc7e tcp://192.168.11.152:4567

170609 16:57:09 [Warning] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') address' tcp://192.168.11.152:4567' points to own listening address, blacklisting

170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection established to 017c00ff tcp://192.168.11.98:4567

170609 16:57:09 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.12.75:4567

170609 16:57:10 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection established to 325d47d6 tcp://192.168.12.75:4567

170609 16:57:10 [Note] WSREP: declaring 017c00ff at tcp://192.168.11.98:4567 stable

170609 16:57:10 [Note] WSREP: declaring 325d47d6 at tcp://192.168.12.75:4567 stable

170609 16:57:10 [Note] WSREP: Node 017c00ff state prim

170609 16:57:10 [Note] WSREP: view (view_id (PRIM,017c00ff,13) memb {

017c00ff,0

325d47d6,0

9f2dfc7e,0

} joined {

} left {

} partitioned {

})

170609 16:57:10 [Note] WSREP: save pc into disk

170609 16:57:10 [Note] WSREP: gcomm: connected

170609 16:57:10 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636

170609 16:57:10 [Note] WSREP: Shifting CLOSED-> OPEN (TO: 0)

170609 16:57:10 [Note] WSREP: Opened channel 'mycluster'

170609 16:57:10 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3

170609 16:57:10 [Note] WSREP: Waiting for SST to complete.

170609 16:57:10 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.

170609 16:57:10 [Note] WSREP: STATE EXCHANGE: sent state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0

170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 0 (1198)

170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 1 (12: 75)

170609 16:57:10 [Note] WSREP: STATE EXCHANGE: got state msg: 9f7acc66-4cf1-11e7-878b-a2d0231f71b0 from 2 (11,152)

170609 16:57:10 [Note] WSREP: Quorum results:

Version = 4

Component = PRIMARY

Conf_id = 12

Members = 3ax 3 (joined/total)

Act_id = 824276

Last_appl. =-1

Protocols = 0thumb 7Comp3 (gcs/repl/appl)

Group UUID = 51391c6d-4bff-11e7-a1c3-b797743e8629

170609 16:57:10 [Note] WSREP: Flow-control interval: [28, 28]

170609 16:57:10 [Note] WSREP: Restored state OPEN-> JOINED (824276)

170609 16:57:10 [Note] WSREP: New cluster view: global state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824276, view# 13: Primary, number of nodes: 3, my index: 2, protocol version 3

170609 16:57:10 [Note] WSREP: SST complete, seqno: 824276

170609 16:57:10 [Note] WSREP: Member 2.0,11,152) synced with group.

170609 16:57:10 [Note] WSREP: Shifting JOINED-> SYNCED (TO: 824276)

170609 16:57:10 [Note] Plugin 'FEDERATED' is disabled.

170609 16:57:10 InnoDB: The InnoDB memory heap is disabled

170609 16:57:10 InnoDB: Mutexes and rw_locks use InnoDB's own implementation

170609 16:57:10 InnoDB: Compressed tables use zlib 1.2.3

170609 16:57:10 InnoDB: Using Linux native AIO

170609 16:57:10 InnoDB: Initializing buffer pool, size = 122.0m

170609 16:57:10 InnoDB: Completed initialization of buffer pool

170609 16:57:10 InnoDB: highest supported file format is Barracuda.

170609 16:57:11 InnoDB: Waiting for the background threads to start

170609 16:57:12 InnoDB: 5.5.54 started; log sequence number 6024720364

170609 16:57:12 [Note] Server hostname (bind-address): '0.0.0.0; port: 3306

170609 16:57:12 [Note]-'0.0.0.0' resolves to' 0.0.0.0'

170609 16:57:12 [Note] Server socket created on IP: '0.0.0.09.

170609 16:57:12 [Note] Event Scheduler: Loaded 0 events

170609 16:57:12 [Note] / usr/sbin/mysqld: ready for connections.

Version: '5.5.54' socket:'/ var/lib/mysql/mysql.sock' port: 3306 MySQL Community Server (GPL), wsrep_25.19.20170106.aa7e07d

170609 16:57:12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

170609 16:57:12 [Note] WSREP: REPL Protocols: 7 (3,2)

170609 16:57:12 [Note] WSREP: Assign initial position for certification: 824276, protocol version: 3

170609 16:57:12 [Note] WSREP: Service thread queue flushed.

170609 16:57:12 [Note] WSREP: Synchronized with group, ready for connections

170609 16:57:12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

170609 16:57:13 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') connection to peer 9f2dfc7e with addr tcp://192.168.11.152:4567 timed out, no messages seen in PT3S

170609 16:57:13 [Note] WSREP: (9f2dfc7e, 'tcp://0.0.0.0:4567') turning message relay requesting off

The node connects smoothly and joins the cluster.

Then I tested again, emptied all the data files, joined the cluster smoothly, and completed the data synchronization automatically.

You can see the data synchronization from another log at what time it is:

170608 12:05:48 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3

170608 12:05:48 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.

"/ var/log/mysqld.log" 744L, 62162C 112jue 1 13%

170609 16:42:43 [Note] WSREP: Quorum results:

Version = 4

Component = PRIMARY

Conf_id = 4

Members = 2 + 3 (joined/total)

Act_id = 824275

Last_appl. = 824274

Protocols = 0thumb 7Comp3 (gcs/repl/appl)

Group UUID = 51391c6d-4bff-11e7-a1c3-b797743e8629

170609 16:42:43 [Note] WSREP: Flow-control interval: [28, 28]

170609 16:42:43 [Note] WSREP: New cluster view: global state: 51391c6d-4bff-11e7-a1c3-b797743e8629:824275, view# 5: Primary, number of nodes: 3, my index: 0, protocol version 3

170609 16:42:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

170609 16:42:43 [Note] WSREP: REPL Protocols: 7 (3,2)

170609 16:42:43 [Note] WSREP: Assign initial position for certification: 824275, protocol version: 3

170609 16:42:43 [Note] WSREP: Service thread queue flushed.

170609 16:42:43 [Note] WSREP: Member 1.0,11,152) requested state transfer from'* any*'. Selected 0.0 (1198) (SYNCED) as donor.

170609 16:42:43 [Note] WSREP: Shifting SYNCED-> DONOR/DESYNCED (TO: 824275)

170609 16:42:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

170609 16:42:43 [Note] WSREP: Running: 'wsrep_sst_rsync-- role' donor'-- address' 192.168.11.152GetWord 'socket' / var/lib/mysql/mysql.sock'-- datadir'/ var/lib/mysql/'-- defaults-file'/ etc/my.cnf'-- gtid '51391c6dmurf 4bff11e7mura1c3Mub797743e862924275Muff11e7mura1c3Mub 7943e862924275Muff4bff11e7mura1c3Mub7943e862924275

170609 16:42:43 [Note] WSREP: sst_donor_thread signaled with 0

170609 16:42:43 [Note] WSREP: Flushing tables for SST...

170609 16:42:43 [Note] WSREP: Provider paused at 51391c6d-4bff-11e7-a1c3-b797743e8629:824275 (831018)

170609 16:42:43 [Note] WSREP: Tables flushed.

Through this point, my conjecture has been basically verified.

When a node rejoins after exiting the cluster, if the ip of the failed node is the first ip in its own configuration file wsrep_cluster_address.

Then this node can never join the cluster again.

What should we do? change his ip from this configuration item. This problem will be solved perfectly.

Pass the further test. If the node is master, the node started by-- wsrep-new-cluster will have this problem if ip comes first.

If this node can rejoin the group after the above steps. Then this node should not get the role of master.

At this time, the above problems will not occur, even if ip is in the first place, you can join the cluster.

This should be a bug.

After further verification, you can submit the bug record.

The way to avoid this problem is to configure wsrep-cluster-address on the node's machine, and don't put the native ip in the first place. The above is what to do when the galera mysql cluster failure nodes shared by the editor are connected to the cluster again. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.