Detailed analysis of Redis cluster failure 04/06 Update SLTechnology News&Howtos

Detailed analysis of Redis cluster failure

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Fault appearance:

Business level display prompt query redis failed

Cluster composition:

3 Master and 3 Slave, the data of each node has 8GB

Machine distribution:

In the same rack

Xx.x.xxx.199

Xx.x.xxx.200

Xx.x.xxx.201

Redis-server process status:

Use the command ps-eo pid,lstart | grep $pid

Found that the process has been running for 3 months.

Node status of the cluster before the failure:

Xx.x.xxx.200:8371 (bedab2c537fe94f8c0363ac4ae97d56832316e65) master

Xx.x.xxx.199:8373 (792020fe66c00ae56e27cd7a048ba6bb2b67adb6) slave

Xx.x.xxx.201:8375 (5ab4f85306da6d633e4834b4d3327f45af02171b) master

Xx.x.xxx.201:8372 (826607654f5ec81c3756a4a21f357e644efe605a) slave

Xx.x.xxx.199:8370 (462cadcb41e635d460425430d318f2fe464665c5) master

Xx.x.xxx.200:8374 (1238085b578390f3c8efa30824fd9a4baba10ddf) slave

The following is a log analysis

Step 1:

Master node 8371 loses connection to slave node 8373:

46590:M 09 Sep 18:57:51.379 # Connection with slave xx.x.xxx.199:8373 lost.

Step 2:

Master node 8370Universe 8375 decides 8371 missing:

42645 quorum reached M 09 Sep 18V 50.117 * Marking node bedab2c537fe94f8c0363ac4ae97d56832316e65 as failing (quorum reached)

Step 3:

From the slave node 8372 hand 8373max 8374 received the master node 8375 said 8371 missing:

46986 S 09 Sep 18V 57V 50.120 * FAIL message received from 5ab4f85306da6d633e4834b4d3327f45af02171b about bedab2c537fe94f8c0363ac4ae97d56832316e65

Step 4:

Master node 8370 Universe 8375 authorization 8373 upgrade primary node transfer:

42645:M 09 Sep 18:57:51.055 # Failover auth granted to 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 for epoch 16

Step 5:

The original master node 8371 modified its configuration to become the slave node of 8373:

46590:M 09 Sep 18:57:51.488 # Configuration change detected. Reconfiguring myself as a replica of 792020fe66c00ae56e27cd7a048ba6bb2b67adb6

Step 6:

The master node 8370Universe 8375 / 8373 has a clear 8371 failure status:

42645 master without slots is reachable again M 09 Sep 18V V 51.522 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65.

Step 7:

The new slave node 8371 starts from the new master node 8373 to fully synchronize data for the first time:

8373 Log:

4255 M 09 Sep 1815 51.906 * Full resync requested by slave xx.x.xxx.200:8371

4255disk M 09 Sep 18V 51.906 * Starting BGSAVE for SYNC with target: disk

4255 M 09 Sep 1815 51.941 * Background saving started by pid 5230

8371 Log:

46590 d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993 S 09 Sep 1815 51.948 * Full resync from master: d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440721826993

Step 8:

Master node 8370Universe 8375 decides 8373 (new master) missing:

42645 Sep M 09 Marking node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6 as failing (quorum reached).

Step 9:

Master node 8370Universe 8375 decision 8373 (new master) restore:

60295 Sep M 09 Clear FAIL state for node 792020fe66c00ae56e27cd7a048ba6bb2b67adb6: is reachable again and nobody is serving its slots after some time.

Step 10:

The master node 8373 completes the BGSAVE operation required for full synchronization:

5230VR C 09 Sep 18V 59R 01.474 * DB saved on disk

5230 MB of memory used by copy-on-write C 09 Sep 18 MB of memory used by copy-on-write 59 race 01.491 * MB of memory used by copy-on-write

4255VR M 09 Sep 18V 59R 01.877 * Background saving terminated with success

Step 11:

Data is received from the primary node 8373 starting from node 8371:

46590 receiving S 09 Sep 18V 59 bytes from master 02.263 * MASTER SLAVE sync:

Step 12:

Master node 8373 finds that slave node 8371 restricts output buffer:

4255:M 09 Sep 19:00:19.014 # Client id=14259015 addr=xx.x.xxx.200:21772 fd=844 name= age=148 idle=148 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16349 oll=4103 omem=95944066 events=rw cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.

4255:M 09 Sep 19:00:19.015 # Connection with slave xx.x.xxx.200:8371 lost.

Step 13:

Slave node 8371 failed to synchronize data from master node 8373, the connection was disconnected, and the first full synchronization failed:

46590:S 09 Sep 19:00:19.018 # I/O error trying to sync with MASTER: connection lost

46590 S 09 Sep 1915 00v 20.102 * Connecting to MASTER xx.x.xxx.199:8373

46590 S 09 Sep 1915 00v 20.102 * MASTER SLAVE sync started

Step 14:

Restart synchronization from node 8371, connection fails, and the number of connections on primary node 8373 is full:

46590 rig S 09 Sep 19R 00rig 21.103 * Connecting to MASTER xx.x.xxx.199:8373

46590 rig S 09 Sep 19R 00rig 21.103 * MASTER SLAVE sync started

46590 S 09 Sep 1900 V 21.104 * Non blocking connect for SYNC fired the event.

46590 ERR max number of clients reached' S 09 Sep 1900 ERR max number of clients reached' 21.104 # Error reply to PING from master:

Step 15:

Reconnect to master node 8373 from node 8371 and start full synchronization for the second time:

8371 Log:

46590 S 09 Sep 1900 Swiss 49.175 * Connecting to MASTER xx.x.xxx.199:8373

46590 S 09 Sep 1900 Swiss 49.175 * MASTER SLAVE sync started

46590 S 09 Sep 1900 Swiss 49.175 * Non blocking connect for SYNC fired the event.

46590 replication can continue... S09 Sep 1900RV 49.176 * Master replied to PING,

46590 no cached master S 09 Sep 1915 00JV 49.179 * Partial resynchronization not possible (no cached master)

46590 d7751c4ebf1e63d3baebea1ed409e0e7243a4423:440780763454 S09 Sep 1900RV 49.501 * Full resync from master:

8373 Log:

4255VR M 09 Sep 1900RV 49.176 * Slave xx.x.xxx.200:8371 asks for synchronization

4255VR M 09 Sep 1900RV 49.176 * Full resync requested by slave xx.x.xxx.200:8371

4255VR M 09 Sep 1900RV 49.176 * Starting BGSAVE for SYNC with target: disk

4255VR M 09 Sep 1900UR 49.498 * Background saving started by pid 18413

18413 C 09 Sep 1914 01R 52.466 * DB saved on disk

18413 MB of memory used by copy-on-write C 09 Sep 19V 01V 52.620 * RDB: 2124

4255VR M 09 Sep 19R 01R 53.186 * Background saving terminated with success

Step 16:

Synchronize data from node 8371 successfully, start loading via memory:

46590 receiving S 09 Sep 1915 01lo 53.190 * MASTER SLAVE sync: receiving 2637183250 bytes from master

46590 Flushing old data S09 Sep 1914 Flushing old data 51.485 * Flushing old data

46590 Loading DB in memory S 09 Sep 1915 05VR 58.695 * MASTER SLAVE sync: Loading DB in memory

Step 17:

The cluster returns to normal:

42645 slave is reachable again M 09 Sep 19 05VR 58.786 * Clear FAIL state for node bedab2c537fe94f8c0363ac4ae97d56832316e65.

Step 18:

Synchronization of data from node 8371 was successful and took 7 minutes:

46590 Finished with success S 09 Sep 198 Sep 19.303 * MASTER SLAVE sync:

8371 Analysis of the causes of loss of contact:

Since several machines are in the same rack, network interruption is unlikely to occur, so we check the slow query log through the SLOWLOG GET command and find that a KEYS command has been executed, which takes 8.3 seconds. Then check the cluster node timeout setting and find that it is 5s (cluster-node-timeout 5000).

The reason why the node is missing:

The client executed a command that took 8.3 seconds

The KEYS command will be executed from 18:57:43 on 2016-9-9.

18:57:50 on 2016-9-9 8371 was judged to be missing (redis log)

18:57:51 on 2016-9-9, finish executing the KEYS command

To sum up, there are the following questions:

1. Due to the short cluster-node-timeout setting, the slow query KEYS caused the cluster judgment node 8371 to lose contact.

two。 Due to the loss of 8371, 8373 was upgraded mainly, and master-slave synchronization began.

3. Due to the limitation of configuring client-output-buffer-limit, the first full synchronization failed

4. And because there is a problem with the connection pool of the PHP client, it is crazy to connect to the server, resulting in an effect similar to the SYN attack.

5. After the first full synchronization failed, it took 30 seconds to reconnect the master node from the slave node (more than 1w of the maximum number of connections)

About the client-output-buffer-limit parameter:

# The syntax of every client-output-buffer-limit directive is the following: # # client-output-buffer-limit # # A client is immediately disconnected once the hard limit is reached, or if # the soft limit is reached and remains reached for the specified number of # seconds (continuously). # So for instance if the hard limit is 32 megabytes and the soft limit is # 16 megabytes / 10 seconds, the client will get disconnected immediately # if the size of the output buffers reach 32 megabytes, but will also get # disconnected if the client reaches 16 megabytes and continuously overcomes # the limit for 10 seconds. # # By default normal clients are not limited because they don't receive data # without asking (in a push way), but just after a request, so only # asynchronous clients may create a scenario where data is requested faster # than it can read. # # Instead there is a default limit for pubsub and slave clients, since # subscribers and slaves receive data in a push fashion. # # Both the hard or the soft limit can be disabled by setting them to zero. Client-output-buffer-limit normal 0 0 0 client-output-buffer-limit slave 256mb 64mb 60 client-output-buffer-limit pubsub 32mb 8mb 60

Take measures:

1. The single instance is cut below 4G, otherwise the master-slave switch will take a long time.

two。 Adjust client-output-buffer-limit parameters to prevent synchronization from failing in the middle of synchronization

3. Adjust cluster-node-timeout, which cannot be less than 15s

4. Prohibit any slow queries that take more time than cluster-node-timeout, as it will cause master-slave switching

5. Fix the crazy connection mode of the client similar to SYN attack

Summary

The above is the whole content of this article about the detailed analysis of Redis cluster failure. I hope it will be helpful to you. Interested friends can refer to: Spring AOP to achieve Redis cache database query source code, a brief description of the difference between Redis and MySQL, oracle database startup phase analysis, if there are deficiencies, please leave a message. The editor will correct it in time. Thank you for your support to the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.