How to understand Redis distributed Lock 07/13 Update SLTechnology News&Howtos

How to understand Redis distributed Lock

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article focuses on "how to understand Redis distributed locks", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to understand Redis distributed locks.

Do you really need distributed locks?

The use of distributed locks indicates that multiple processes have encountered the problem of accessing the same resource together.

In general, repeated access to the same resource is prevented in two scenarios:

Improve efficiency. For example, if multiple nodes calculate the same batch of tasks, if a task already has a node in the calculation, then other nodes do not have to repeat the calculation, so as to avoid wasting computing resources. However, double counting is fine, and it will not cause other greater losses. That is, to allow occasional failure.

Make sure it's correct. In this case, the requirement of the lock is very high, and if the calculation is repeated, it will affect the correctness. This kind of failure is not allowed.

To introduce distributed locking, it is necessary to introduce a third-party infrastructure, such as MySQL,Redis,Zookeeper, etc. These infrastructure for implementing distributed locking has problems and will also affect the business, so we can consider whether it can be implemented without locking before using distributed locking.

However, this is outside the scope of this article, which assumes that the locking requirement is reasonable and favors the second case above, why is it biased? Because there are no 100% reliable distributed locks, you will understand after reading the following.

Start with a simple distributed lock implementation

Redis implementations of distributed locks are common, and it is easy to implement and use third-party libraries, at least as it seems. Here is the simplest and most reliable Redis implementation.

The simplest implementation

The realization is very classic, here are only two main points?

Locks and unlocks must be the same. A common solution is to give each lock a key (unique ID), which is generated when locked and judged when unlocked.

A resource cannot be permanently locked. A common solution is to give a lock an expiration time. Of course, there are other options. We'll talk about it later.

A copy-and-paste implementation is as follows:

Add lock

Public static boolean tryLock (String key, String uniqueId, int seconds) {

Return 'OK'.equals (jedis.set (key, uniqueId,' NX', 'EX', seconds))

}

This is actually a call to SET key value PX milliseoncds NX.

SET key value [EX seconds | PX milliseconds] [NX | XX] [KEEPTTL]: https://redis.io/commands/set under the reference of this command

Unlock

Public static boolean releaseLock (String key, String uniqueId) {

String luaScript ='if redis.call ('get', KEYS [1]) = = ARGV [1] then' +

'return redis.call ('del', KEYS [1]) else return 0 end'

Return jedis.eval (

LuaScript

Collections.singletonList (key)

Collections.singletonList (uniqueId)

) .equals (1L)

}

The essence of this implementation lies in the simple lua script, which determines whether the unique ID is equal before operating.

Is it reliable?

What's wrong with such an implementation?

Single point problem. The above implementation can be done with only one master node. The single point here refers to a single master. Even in a cluster, if the lock is successfully copied from master to slave, the same resource will be locked by multiple client.

The execution time exceeds the expiration time of the lock. It says above that in order not to lock all the time, an expiration time has been added, and the lock will be released automatically, but what if the task is not finished during this period? Due to the longer task time caused by GC or network delay, it is difficult to guarantee that the task will be completed within the expiration time of the lock.

How to solve these two problems? Try a more complex implementation.

Redlock algorithm

For the first single point problem, following redis's way of thinking, the next thing that comes to mind must be Redlock. In order to solve the problem of stand-alone machine, Redlock needs multiple (> 2) redis master nodes, multiple master nodes are independent of each other, and there is no data synchronization.

The implementation of Redlock is as follows:

1) get the current time.

2) acquire the locks of N nodes in turn. The locking of each node is implemented in the same way as above. Here is a detail, that is, the expiration time is different each time the lock is acquired, and you need to subtract the time taken by the previous operation to acquire the lock:

For example, the expiration time of the incoming lock is 500ms.

It took 1ms to acquire the lock of the first node, so the expiration time of the lock of the first node is 499ms.

It took 2ms to acquire the lock of the second node, so the expiration time of the lock of the second node is 497ms.

If the expiration time of the lock is less than or equal to 0, the whole operation to acquire the lock timed out and the whole operation failed.

3) determine whether the lock was acquired successfully. If client acquires 2 + 1 node locks in the above steps, and the expiration time of each lock is greater than 0, the lock acquisition succeeds, otherwise it fails. Release the lock on failure.

4) release the lock. The instruction to release the lock is sent to all nodes, and the implementation logic of each node is the same as the simple implementation above. Why operate on all nodes? Because failure to acquire a lock from one node in a distributed scenario does not mean acceleration failure on that node, locking may actually be successful, but the return time has timed out due to network jitter.

The above is the description of the common redlock implementation. At first glance, it looks like a simple version of the multi-master version. If this is true, it will be too simple. Next, we will analyze how the algorithm is broken in various scenarios.

Distributed lock pit

Problems in high concurrency scenarios

The following problem is not that it is not easy to occur in scenarios where concurrency is not high, but that it is more likely to occur in scenarios with high concurrency.

Performance issues. Performance problems come from two aspects.

1) the time to acquire the lock. If redlock is used in high concurrency scenarios, there are N master nodes, one to request, which will take a long time, thus affecting the performance. This is easy to solve. From the above description, it is not difficult to find that the operation of obtaining locks from multiple nodes is not a synchronous operation, but can be asynchronous, so that multiple nodes can be acquired at the same time. Even if it is processed in parallel, it is necessary to estimate the lock acquisition time to ensure that the lock TTL > lock acquisition time + task processing time.

2) the locked resource is too large. The locking scheme itself will sacrifice concurrency for the sake of correctness, which is proportional to the size of the resource. At this time, you can consider splitting resources. There are two ways to split resources:

The locked resources are divided into multiple segments from the business, and each segment is locked separately. For example, I have to do several operations on a merchant, and I have to lock the merchant before the operation. At this time, I can split several operations into multiple independent steps and lock them separately to improve concurrency.

Using the idea of dividing buckets, split a resource into multiple buckets, and one failed to lock and immediately try the next. For example, in the scenario of batch task processing, if you want to handle the tasks of 200w merchants, in order to improve the processing speed, use multiple threads, and each thread takes 100 merchants for processing, you have to add locks to these 100 merchants. If it is not processed, it is difficult to ensure that the merchants locked by two threads do not overlap at the same time. At this time, you can divide the merchants into buckets according to a dimension, such as a tag, and then process one bucket for each task. Deal with this sub-barrel and then deal with the next sub-barrel to reduce competition.

The question of retrying. Whether it is a simple implementation or a redlock implementation, there will be logic to retry. If you follow the above algorithm directly, there will be multiple client acquiring the same lock at almost the same time, and then each client locks some nodes, but no client acquires most of the nodes. The solution is also common, staggering multiple nodes when retrying, by adding a random time to the retry time. This can not cure the problem, but it can effectively alleviate the problem.

Node downtime

For scenarios with a single master node and no persistence, the outage will fail, so you must support repeated operations in the implementation and do a good job of idempotency.

For multi-master scenarios, such as redlock, let's take a look at such a scenario:

Suppose there are five redis nodes: a, B, C, D, E, which are not persisted.

If client1 successfully acquires locks from nodes A, B and C, then client1 acquires locks successfully.

Node C is dead.

Client2 successfully acquires locks from C, D, E, and client2 acquires locks successfully, so client1 and client2 acquire locks at the same time, and redlock is broken.

How to solve it? The easiest way to think of is to turn on persistence. Persistence can persist every redis command, but this will have a great impact on performance and will not be used. If we do not use this method, we will certainly lose a small part of the data when the node is hung, and maybe our lock is among them.

Another option is to delay start-up. That is, after a node has been repaired, it does not join immediately, but waits for a period of time to join again. The waiting time is greater than the maximum TTL of all locks at the moment of downtime.

However, this scheme still can not solve the problem. If both B and C are dead in step 3 above, then there are only A, D and E nodes left. If the lock is successfully acquired from D and E, there will still be problems. Then we can only increase the total number of master nodes to alleviate this problem. Adding master nodes improves stability, but also increases costs, which requires a tradeoff between the two.

Task execution time exceeds locked TTL

In the past, there have been cases on the production line that the execution time of the task is much longer than expected because of the network delay, the lock expires and is executed by multiple threads.

This problem is a problem faced by all distributed locks, including distributed locks based on zookeeper and DB, which is a contradiction between lock expiration and client's ignorance of lock expiration.

When adding a lock, we usually give a lock TTL, which is to prevent the client from downtime after locking and the lock cannot be released. But the same problem with all these postures is that there is no guarantee that the execution time of the client will be less than the TTL of the lock. Although most programmers are optimistic that this will not happen, I used to think so, too, until I was hit in the face by reality again and again.

Martin Kleppmann questioned that, too.

Client1 acquires the lock

Client1 starts the task, and then the GC of STW occurs, which exceeds the expiration time of the lock.

Client2 acquired the lock and started the task.

When Client1's GC ends and the task continues, both Client1 and Client2 think that they have acquired the lock and will process the task, resulting in an error.

Martin Kleppmann gave the example of GC, and I encountered network latency. No matter which case it is, it is undeniable that this situation is inevitable and it is easy to be confused once it occurs.

How to solve the problem? One solution is not to set TTL, but to add a watchdog,watchdog to the lock after the lock is successfully acquired, which will be renewed when the lock is not released and is about to expire. This is a bit abstract, so let's say it in combination with redisson source code:

Public class RedissonLock extends RedissonExpirable implements RLock {

...

@ Override

Public void lock () {

Try {

LockInterruptibly ()

} catch (InterruptedException e) {

Thread.currentThread () .interrupt ()

}

@ Override

Public void lock (long leaseTime, TimeUnit unit) {

Try {

LockInterruptibly (leaseTime, unit)

} catch (InterruptedException e) {

Thread.currentThread () .interrupt ()

}

...

}

Redisson commonly used lock api is the above two, one is not introduced TTL, at this time, redisson self-maintenance, will take the initiative to renew; the other is their own input TTL, this redisson will not help us automatically renew, or their own leaseTime value to-1, but this way is not recommended, since there is already a ready-made API, why use this strange way to write it.

Next, analyze the locking logic of the method that does not pass parameters:

Public class RedissonLock extends RedissonExpirable implements RLock {

...

Public static final long LOCK_EXPIRATION_INTERVAL_SECONDS = 30

Protected long internalLockLeaseTime = TimeUnit.SECONDS.toMillis (LOCK_EXPIRATION_INTERVAL_SECONDS)

@ Override

Public void lock () {

Try {

LockInterruptibly ()

} catch (InterruptedException e) {

Thread.currentThread () .interrupt ()

}

@ Override

Public void lockInterruptibly () throws InterruptedException {

LockInterruptibly (- 1, null)

}

@ Override

Public void lockInterruptibly (long leaseTime, TimeUnit unit) throws InterruptedException {

Long threadId = Thread.currentThread () .getId ()

Long ttl = tryAcquire (leaseTime, unit, threadId)

/ / lock acquired

If (ttl = = null) {

Return

}

RFuture future = subscribe (threadId)

CommandExecutor.syncSubscription (future)

Try {

While (true) {

Ttl = tryAcquire (leaseTime, unit, threadId)

/ / lock acquired

If (ttl = = null) {

Break

}

/ / waiting for message

If (ttl > = 0) {

GetEntry (threadId). GetLatch (). TryAcquire (ttl, TimeUnit.MILLISECONDS)

} else {

GetEntry (threadId). GetLatch (). Acquire ()

}

} finally {

Unsubscribe (future, threadId)

}

/ / get (lockAsync (leaseTime, unit))

}

Private Long tryAcquire (long leaseTime, TimeUnit unit, long threadId) {

Return get (tryAcquireAsync (leaseTime, unit, threadId))

}

Private RFuture tryAcquireAsync (long leaseTime, TimeUnit unit, final long threadId) {

If (leaseTime! =-1) {

Return tryLockInnerAsync (leaseTime, unit, threadId, RedisCommands.EVAL_LONG)

}

RFuture ttlRemainingFuture = tryLockInnerAsync (LOCK_EXPIRATION_INTERVAL_SECONDS, TimeUnit.SECONDS, threadId, RedisCommands.EVAL_LONG)

TtlRemainingFuture.addListener (new FutureListener () {

@ Override

Public void operationComplete (Future future) throws Exception {

If (! future.isSuccess ()) {

Return

}

Long ttlRemaining = future.getNow ()

/ / lock acquired

If (ttlRemaining = = null) {

ScheduleExpirationRenewal (threadId)

}

});

Return ttlRemainingFuture

}

Private void scheduleExpirationRenewal (final long threadId) {

If (expirationRenewalMap.containsKey (getEntryName () {

Return

}

Timeout task = commandExecutor.getConnectionManager () .newTimeout (new TimerTask () {

@ Override

Public void run (Timeout timeout) throws Exception {

RFuture future = commandExecutor.evalWriteAsync (getName (), LongCodec.INSTANCE, RedisCommands.EVAL_BOOLEAN

'if (redis.call ('hexists', KEYS [1], ARGV [2]) = = 1) then' +

'redis.call (' pexpire', KEYS [1], ARGV [1]);'+

'return 1;'+

'end;'+

'return 0There'

Collections.singletonList (getName ()), internalLockLeaseTime, getLockName (threadId))

Future.addListener (new FutureListener () {

@ Override

Public void operationComplete (Future future) throws Exception {

ExpirationRenewalMap.remove (getEntryName ())

If (! future.isSuccess ()) {

Log.error ('Can't update lock' + getName () + 'expiration', future.cause ())

Return

}

If (future.getNow ()) {

/ / reschedule itself

ScheduleExpirationRenewal (threadId)

}

});

}

}, internalLockLeaseTime / 3, TimeUnit.MILLISECONDS)

If (expirationRenewalMap.putIfAbsent (getEntryName (), task)! = null) {

Task.cancel ()

}

...

}

As you can see, the final locking logic will enter the org.redisson.RedissonLock#tryAcquireAsync, and after successfully acquiring the lock, it will enter the scheduleExpirationRenewal, in which a timer is initialized, and the time of dely is internalLockLeaseTime / 3. In redisson, internalLockLeaseTime is 30s, that is, it is renewed every 10s, 30s each time.

If it is a distributed lock based on zookeeper, you can use zookeeper to check whether the node is alive or not, so as to achieve renewal. Zookeeper distributed lock has not been used, so I won't go into details.

However, this approach cannot ensure that only one client acquires the lock at the same time. If the renewal fails, for example, the GC of STW as mentioned by Martin Kleppmann occurs, or the client and redis cluster lose contact, as long as the renewal fails, it will cause multiple client to acquire the lock at the same time. In my scenario, I reduced the granularity of the lock, and the renewal mechanism of redisson is sufficient.

If you want to be more stringent, you have to add a logic to renew the failure to terminate the task. This practice has been implemented in previous Python code, and Java has not encountered such a strict situation.

Here is also mentioned the solution of Martin Kleppmann, I think this solution is not reliable, the reasons will be mentioned later.

His solution is to let locked resources maintain a set of guarantees that multiple client will not access the same resource at the same time due to lock failure.

When the client acquires the lock, it also gets the token of a resource. This token is monotonously increasing. Every time the resource is written, it checks whether the current token is an older token, and if so, it is not allowed to write. In the above scenario, when Client1 acquires the lock and assigns a 33 token,Client2 to acquire the lock, a 34 token is assigned. During the client1 GC, the Client2 has already written the resource, and the maximum token is 34. When client1 returns from the GC and writes the resource with 33 token, it will be rejected because the token expires. This requires a token generator to be provided on the resource side.

I have a few questions about this fencing scheme:

Unable to guarantee transaction. In the diagram, only 34 accesses the storage, but in the actual scenario, there may be multiple visits to the storage within a task and must be atomic. If client1 visits storage before GC with 33token, then GC occurs. Client2 acquires the lock, and the token with 34 also accesses the storage. Can the data written by the two client still ensure that the data is correct? If not, then this solution is flawed, unless storage itself has other mechanisms to guarantee, such as the transaction mechanism; if it can, then the token here is superfluous, and fencing's solution is superfluous.

High concurrency scenarios are not practical. Because only the largest token can be written each time, the access to storage is linear. In high concurrency scenarios, this approach will greatly limit throughput, and distributed locks are mostly used in this scenario, which is a very contradictory design.

This is a problem with all distributed locks. This solution is a general solution, which can be used with Redlock or other lock. So I understand that it's just a solution that has nothing to do with Redlock.

System clock drift

This problem has only been considered, but it has not been encountered in the actual project, because it is possible in theory, which is also mentioned here.

The expiration time of redis depends on the system clock. If the clock drift is too large, it will affect the calculation of the expiration time.

Why does the system clock drift? To start with the system time, linux provides two system times: clock realtime and clock monotonic. Clock realtime is xtime/wall time, this time can be changed by users, changed by NTP, gettimeofday takes this time, redis expiration calculation also uses this time.

Clock monotonic, literally translated into monotonous time, will not be changed by the user, but will be changed by NTP.

Ideally, the clocks of all systems are synchronized with the NTP server at all times, but this is clearly impossible. There are two reasons for system clock drift:

The system clock is out of sync with the NTP server. At present, there is no very good solution, so we can only trust the operation and maintenance students.

Clock realtime was artificially modified. Do not use clock realtime when implementing distributed locks. Unfortunately, redis uses this time. I took a look at the Redis 5. 0 source code and used clock realtime. Antirez said to change it to clock monotonic, but the boss hasn't changed it yet. In other words, the time of artificial modification of the redis server can cause problems with redis.

Summary

This paper starts from a simple distributed lock based on redis to the more complex implementation of Redlock, and introduces some pits and solutions that have just been stepped on in the process of using distributed lock.

At this point, I believe you have a deeper understanding of "how to understand Redis distributed locks". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.