How to achieve fast recovery of Redis after downtime 04/27 Update SLTechnology News&Howtos

How to achieve fast recovery of Redis after downtime

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

It is believed that many inexperienced people are at a loss about how to achieve rapid recovery of Redis after downtime. Therefore, this article summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Let's take a look at how Redis implements automatic fault recovery, which is based on data persistence and multiple copies of data mentioned earlier.

As a very hot in-memory database, Redis not only has very high performance, but also needs to ensure high availability. When a failure occurs, minimize the impact of the failure. Redis also provides a perfect fault recovery mechanism: Sentinel.

Let's take a specific look at how Redis's fault recovery is done and how it works.

Deployment mode

When Redis is deployed, it can be deployed in a variety of ways, each corresponding to a different level of availability.

Single-node deployment: only one node provides services, and both reads and writes are on this node. If this node goes down, all data will be lost, which will directly affect the business.

Master-slave deployment: two nodes form master-slave mode, which is written on master and read on slave. Read-write separation improves access performance. After master outage, you need to manually upgrade slave to master. The business impact depends on the delay of manually upgrading master.

Master-slave+ Sentinel deployment: master-slave is the same as above, except that a group of sentinel nodes are added to check the health status of master in real time. After a master outage, slave is automatically upgraded to a new master, minimizing unavailability time and having a short impact on business.

From the above deployment models, we can see that the key to improving the availability of Redis is: multi-replica deployment + automatic failure recovery, and multiple replicas rely on master-slave replication.

Highly available practices

Redis natively provides master-slave data replication to ensure that slave is always consistent with master data.

When there is a problem with master, we need to upgrade slave to master and continue to provide services. If the operation of upgrading the new master is manual, it will not be able to ensure timeliness, so Redis provides sentinel nodes to manage master-slave nodes, and can automatically perform fault recovery operations when problems occur in master.

The whole work of fault recovery was done automatically by the Redis Sentinel.

Introduction to the Sentinel

Sentinel is a highly available solution for Redis. It is a service tool for managing multiple Redis instances. It can monitor, notify, and automatically fail over Redis instances.

When deploying the sentry, we only need to configure the master node that needs to be managed in the configuration file, and the sentry node can manage the Redis node according to the configuration to achieve high availability.

Generally speaking, we need to deploy multiple sentinel nodes, because in a distributed scenario, it may not be accurate to use only one machine to detect whether a failure has occurred on a node of a certain machine. It is very likely that the network of these two machines has failed, and there is nothing wrong with the node itself.

Therefore, for node health detection scenarios, it is generally used to detect multiple nodes at the same time, and multiple nodes are distributed on different machines, and the number of nodes is odd, to avoid sentinel decision errors caused by network partition. When multiple sentinel nodes exchange detection information with each other, the final decision can confirm whether there is a real problem on a node.

After the Sentinel node is deployed and configured, the Sentinel will automatically manage the configured master-slave, and in case of master failure, upgrade the slave to the new master in time to ensure availability.

So how does it work?

The working principle of Sentinel

The work flow of the Sentinel is mainly divided into the following stages:

State awareness

Heartbeat detection

Elect Sentinel leaders

Select a new master

Fault recovery

Client is aware of the new master

These stages are described in detail below.

State awareness

After starting, the sentry only specifies the address of the master. If the sentry wants to recover from a master failure, he needs to know the slave information corresponding to each master. There may be more than one slave per master, so the Sentinel needs to know the complete topological relationships in the entire cluster, how to get this information?

The sentry sends info commands to each master node every 10 seconds, and the information returned by the info command contains the master-slave topology relationship, including the address and port number of each slave. With this information, the sentry will remember the topology information of these nodes and select the appropriate slave node for fault recovery in the event of a subsequent failure.

In addition to sending info to master, Sentinel also sends master current status information and Sentinel's own information to a special pubsub of each master node. Other Sentinel nodes can get the message from each Sentinel by subscribing to this pubsub.

This is done for two main purposes:

The sentinel node can find the addition of other sentinels, which in turn facilitates the communication of multiple sentinel nodes and provides the basis for subsequent joint negotiation.

Exchange the status information of master with other sentinel nodes to provide a basis for subsequent judgment of master failure.

Heartbeat detection

When a failure occurs, it is necessary to start the fault recovery mechanism immediately, so how to ensure timeliness?

Each sentinel node sends ping commands to master, slave and other sentinel nodes every 1 second. If the other node can respond within a specified period of time, the node is healthy and alive. If it does not respond within a specified time (configurable), then the sentinel node thinks that this node is subjectively offline.

Why is it called subjective referral?

Because the current sentinel node probe does not get a response, it is very likely that the network between the two machines has failed, and there is nothing wrong with the master node itself, so it is considered that the master failure is incorrect.

To confirm whether the master node really failed, multiple sentinel nodes need to confirm it together.

Each Sentinel node jointly confirms whether there is a real failure on this node by asking other Sentinel nodes about the status of this master.

If more than the specified number of (configurable) sentinel nodes think that this node is subjectively offline, then this node will be marked as objective offline.

Elect Sentinel leaders

After confirming the real failure of this node, you need to enter the failure recovery phase. How to carry out fault recovery also needs to go through a series of processes.

First of all, a Sentinel leader needs to be elected, and this special Sentinel leader will perform the fault recovery operation. There is no need for multiple Sentinels to participate in the fault recovery. The process of electing Sentinel leaders requires multiple sentinel nodes to negotiate to select them.

The process of election negotiation is called consensus in the distributed field, and the algorithm of negotiation is called consensus algorithm.

The main purpose of the consensus algorithm is to solve how multiple nodes agree on a scenario in a distributed scenario.

There are many consensus algorithms, such as Paxos, Raft, Gossip algorithm and so on. Students who are interested can search the relevant information by themselves, so I won't talk about it here.

The process of selecting a leader by the Sentinel is similar to the Raft algorithm, which is simple and easy to understand.

To put it simply, the process is as follows:

Each sentry sets a random timeout, after which it sends a request to other sentinels to apply for leadership.

Other sentinels can only reply and confirm the first request received.

First of all, reach the sentinel node where the majority of votes are confirmed and become the leader.

If, after confirming the reply, none of the sentinels can achieve the result of a majority of votes, then re-election will be held until a leader is elected.

After the Sentinel leader is selected, subsequent recovery operations are carried out by the Sentinel leader.

Search the official account of Java bosom friend, reply to "back-end interview" and send you a treasure book of Java interview questions. Pdf

Select a new master

For the failed master node, the sentry leader needs to select a node in its slave node to replace it.

This process of selecting a new master is also prioritized. In multiple slave scenarios, the priority is selected according to: slave-priority configuration > data integrity > those with lower runid.

In other words, priority is given to the slave node with the minimum slave-priority. If all slave configurations are the same, then the slave node with the most complete data is selected. If the data is the same, finally, the slave node with smaller runid is selected.

Upgrade the new master

After priority selection and selection of alternative master nodes, the next step is to make a real master-slave handover.

The Sentinel leader sends a slaveof no one command to the alternate master node to make it master.

The sentry leader then sends a slaveof $newmaster command to all slave of the failed node, making those slave slave nodes of the new master and starting to synchronize data from the new master.

Finally, the Sentinel leader demotes the failed node to slave and writes it to his own configuration file. When the failed node recovers, it automatically becomes the slave of the new master node.

At this point, the entire failover is complete.

Client is aware of the new master

Finally, how does the client get the latest master address?

After the failover is completed, the sentry writes a message to the specified pubsub of its own node, and the client can subscribe to this pubsub to be aware of the change notification of the master. Our client can also get the latest master address by actively querying the latest master address on the Sentinel node.

In addition, the Sentinel also provides a "hook" mechanism, and we can also configure some script logic in the Sentinel configuration file to trigger the "hook" logic when the failover is completed, notifying the client that the switch has occurred and allowing the client to re-obtain the latest master address on the Sentinel.

Generally speaking, the first method is recommended. Many client SDK have integrated the method of obtaining the latest master from the Sentinel node, and we can use it directly.

It can be seen that in order to ensure the high availability of Redis, the sentinel node should accurately judge the occurrence of the fault, and quickly select a new node to provide services, the intermediate process is still more complex.

The knowledge of distributed consensus and distributed negotiation are involved in order to ensure the accuracy of failover.

It is necessary to understand how Redis works with high availability so that we can use it more accurately when using Redis.

After reading the above, have you mastered how to achieve rapid recovery of Redis after downtime? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.