In-depth understanding: distributed choice distributed lock 07/04 Update SLTechnology News&Howtos

In-depth understanding: distributed choice distributed lock

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Foreword:

At present, most of the articles on the Internet based on zookpeer and redis distributed locks are not comprehensive enough. Either it is deliberately avoiding the situation of the cluster, or it is not fully considered, and the reader still looks confused. Frankly speaking, this old theme, it is difficult to write new ideas, the blogger is trembling, as if walking on thin ice, there is anything imprecise in the article, welcome to criticize.

In this article, there is no code, only analysis.

(1) in redis, there is an open source redisson jar package for you to use.

(2) in terms of zookpeer, there is an open source curator jar package for you to use

Because there is already an open source jar package for you to use, there is no need to package one by yourself. Everyone can go out to Baidu with an api, and there is no need to list a bunch of implementation codes.

It is important to note that Google has a coarse-grained distributed lock service called Chubby. However, Google Chubby is not open source, and we can only learn the details through its papers and other related documentation. Fortunately, Yahoo! Learn from the design ideas of Chubby to develop Zookeeper, and open source it, so this article will not discuss Chubby. As for Tair, it is a distributed Kmuri V storage scheme of Ali open source. We basically use a lot of redis in our work, and it is not representative to discuss the distributed locks implemented by Tair.

Therefore, the main analysis is the distributed locks implemented by redis and zookpper.

Article structure

This article draws lessons from two articles by foreign gods, "Is Redlock safe?" by antirez, author of redis, and "How to do distributed locking" by Martin, a distributed system expert, and adds his own humble opinions to form this article. The directory structure of the article is as follows:

(1) Why do you use distributed locks

(2) comparison of stand-alone situation

(3) comparison of cluster situation

(4) comparison of other characteristics of lock

Text

Come to a conclusion first:

The reliability of zookpper is much stronger than that of redis, but the efficiency is a little inefficient. If the concurrency is not very large, the first choice is zookpeer. For the sake of efficiency, redis implementation is preferred.

Why use distributed locks?

The purpose of using distributed locks is to ensure that only one client can operate on shared resources at a time.

But Martin points out that according to the purpose of the lock, it can also be subdivided into two categories

(1) allow multiple clients to share resources

In this case, the operation on shared resources must be idempotent, and no matter how many times you operate, there will be no different results. The use of locks here is nothing more than to avoid repeated operations to share resources so as to improve efficiency.

(2) only one client is allowed to operate shared resources.

In this case, the operation on shared resources is generally non-idempotent. In this case, if multiple client operations share resources, it may mean that the data is inconsistent and the data is lost.

The first round, stand-alone situation comparison

(1) redis

Let's start with locking. According to the description of the redis official website documentation, use the following command to add the lock.

SET resource_name my_random_value NX PX 30000

My_random_value is a random string generated by the client, which is equivalent to the flag that the client holds the lock.

NX indicates that SET can succeed only if the corresponding key value of resource_name does not exist, which means that only the client of the first request can acquire the lock.

PX 30000 indicates that the lock has an automatic expiration time of 30 seconds.

As for unlocking, to prevent the lock acquired by client 1 from being released by client 2, use the following Lua script to release the lock

If redis.call ("get", KEYS [1]) = = ARGV [1] then

Return redis.call ("del", KEYS [1])

Else

Return 0

End

When executing this LUA script, the value of KEYS [1] is resource_name,ARGV [1] and the value of my_random_value. The principle is to first obtain the my_random_ value corresponding to the lock, which is equal to the value worn by the client, so that you can prevent your lock from being released by others. In addition, the adoption of Lua script operation ensures atomicity. If it is not an atomic operation, the following occurs

Analysis: this redis add and unlock mechanism looks perfect, but there is an inevitable sore, which is how to set the expiration time. If the lock expires due to long-term blocking in the process of operating the shared resource, then it is not safe to access the shared resource.

However, some people will say

After the client has finished operating the shared resource, it can determine whether the lock is still owned by the client, and if it still belongs to the client, commit the resource and release the lock. If it is not owned by the client, the resources will not be submitted.

OK, by doing so, can only reduce the probability that multiple client operations will share resources, and will not solve the problem.

In order to make it easier for readers to understand, the blog presents a business scenario.

Business scenario: we have a content modification page. In order to avoid multiple client requests to modify the same page, we use distributed locks. Only the client that acquires the lock can modify the page. Then the process of modifying a page normally is shown in the following figure

Note that step (3)-- > step (4.1) above is not an atomic operation. That is to say, you may return the valid flag bit at step (3), but during transmission, due to delay and other reasons, the lock has expired at step (4.1). Then, at this point, the lock is acquired by another client lock. There is a situation in which two clients work together to operate shared resources.

You can think about it, no matter how you use any means of compensation, you can only reduce the probability of multiple client operations sharing resources, but can not be avoided. For example, you may also have a long GC pause during step 4. 1, and then the lock timeout expires during the pause, so that the lock may also be acquired by other clients. You can think about these for yourself.

(2) zookpeer

First briefly talk about the principle, according to the online document description, the distributed lock principle of zookpeer is to make use of the characteristics of temporary nodes (EPHEMERAL).

When znode is declared as EPHEMERAL, if the client that created the znode crashes, the corresponding znode will be automatically deleted. This avoids the problem of setting the expiration time.

The client tries to create a znode node, such as / lock. Then the first client is created successfully, which is equivalent to getting the lock, while the other clients fail to create (znode already exists) and fail to acquire the lock.

Analysis: in this case, although the problem of setting effective time is avoided, it is still possible for multiple client operations to share resources.

You should know that if Zookpeer cannot detect the heartbeat of the client for a long time (Session time), it will think that the Session has expired, then all ephemeral type znode nodes created by this Session will be automatically deleted.

At times like this, there will be the following situations.

As shown in the figure above, when client 1 has a GC pause, the zookpeer cannot detect the heartbeat, and it is also possible for multiple clients to operate on shared resources at the same time. Of course, you can say that we can tune through JVM to avoid GC pauses. Note, however, that what we do can only avoid sharing resources among multiple client operations as much as possible, and cannot be completely eliminated.

The second round, cluster situation comparison

In our production, we usually use the cluster case, so the stand-alone case discussed in the first round. As a warm-up for everyone.

(1) redis

For the high availability of redis, it is common to attach a slave to the node of redis, and then use Sentinel mode to switch between master and slave. However, because the master-slave replication (replication) of Redis is asynchronous, this may occur in the process of data synchronization, master is down, and slave has no time to synchronize data is selected as master, resulting in data loss. The specific process is as follows:

(1) client 1 acquired the lock from Master.

(2) Master is down, and the key storing locks has not been synchronized to the Slave yet.

(3) upgrade Slave to Master.

(4) client 2 acquires the lock corresponding to the same resource from the new Master.

In order to deal with this situation, antirez, the author of redis, proposed the RedLock algorithm as follows (the process comes from the official document), assuming that we have N master nodes (N is set to 5 in the official document, which is actually equal to 3).

(1) get the current time in milliseconds.

(2) take turns to request locks on N nodes with the same key and random values. In this step, when the client requests a lock on each master, there will be a much smaller timeout than the total lock release time. For example, if the lock automatic release time is 10 seconds, the timeout period for each node lock request may be in the range of 5-50 milliseconds, which can prevent a client from blocking on a down master node for too long. If a master node is unavailable, we should try the next master node as soon as possible.

(3) the client calculates the time it takes to acquire the lock in the second step. Only if the client successfully acquires the lock on most master nodes (3 in this case), and the total time consumed does not exceed the lock release time, the lock is considered to have been acquired successfully.

(4) if the lock acquisition is successful, the lock automatic release time is now the initial lock release time minus the time it takes to acquire the lock.

(5) if lock acquisition fails, whether it is because no more than half of the locks were successfully acquired, or because the total elapsed time exceeds the lock release time, the client will release locks on each master node, even those locks that he believes did not succeed.

Analysis: the RedLock algorithm considers that there are still the following problems

If the node crashes and restarts, multiple clients will hold locks.

Suppose there are five Redis nodes: a, B, C, D, E. Imagine the following sequence of events:

(1) client 1 successfully locked A, B, C, and acquired the lock successfully (but D and E were not locked).

(2) Node C crashes and restarts, but the lock added by client 1 on C is not persisted and is lost.

(3) after node C restarts, client 2 locks C, D, E, and acquires the lock successfully.

In this way, client 1 and client 2 acquire the lock (for the same resource) at the same time.

In order to deal with the lock failure caused by node restart, antirez, the author of redis, put forward the concept of delayed restart, that is, after a node crashes, it does not restart immediately, but waits for a period of time before restarting, and the waiting time is longer than the effective time of the lock. In this way, the locks that this node participates in will expire before restarting, and it will not affect the existing locks after restarting. In fact, this is also through artificial compensation measures to reduce the probability of inconsistency.

Time jump problem

(1) suppose there are five Redis nodes: a, B, C, D, E. Imagine the following sequence of events:

(2) client 1 successfully acquired the lock from Redis nodes A, B, C (most nodes). Communication with D and E failed due to network problems.

(3) the clock on node C jumps forward, causing the locks maintained on it to expire quickly.

Client 2 successfully acquired the lock (most nodes) of the same resource from the Redis nodes C, D, E.

Both client 1 and client 2 now think they have a lock.

In order to deal with the lock failure caused by constant jumps, antirez, the author of redis, proposed that artificial modification of the system time should be prohibited and use a ntpd program that does not "jump" to adjust the system clock. This is also through artificial compensation measures to reduce the probability of inconsistencies.

Timeout leads to lock failure

The RedLock algorithm does not solve the problem that the operation of shared resources timed out, resulting in lock failure. Recall the process of the RedLock algorithm, as shown in the following figure

As shown in the figure, we divide it into two parts. For the steps in the block diagram in the top half, the RedLock algorithm can handle the delay regardless of the reason, and the client will not get a lock that it thinks is valid but actually fails. However, for the steps in the lower half of the block diagram, it is possible for client 2 to get the lock if there is a delay that causes the lock to fail. Therefore, the RedLock algorithm does not solve this problem.

(2) zookpeer

Zookpeer in cluster deployment, the number of zookpeer nodes is generally odd and must be equal to 3. Let's first recall how zookpeer writes data.

As shown in the picture, this picture is too lazy to draw and just copy other articles.

So the steps for writing the data flow are as follows

1. Send a write request to Follwer at Client

2.Follwer sends the request to Leader

When 3.Leader receives it, it initiates a vote and informs Follwer to vote.

4.Follwer sends the voting results to Leader, and as long as more than half of them return ACK information, they will be deemed to have passed.

After 5.Leader summarizes the results, if it needs to be written, it starts writing and notifies Leader of the write operation, and then commit

6.Follwer returns the result of the request to Client

Also, zookpeer takes a global serialization operation

OK, now start the analysis.

Cluster synchronization

Client writes data to Follwer, but Follwer is down. Will there be data inconsistency? No way, at this point, client fails to set up the node and can't get the lock at all.

Client writes data to Follwer, and Follwer forwards the request to Leader,Leader due to downtime. Will there be any inconsistency? No way, at this point, zookpeer will select a new leader and continue the writing process mentioned above.

In short, if you use zookpeer as a distributed lock, you can't get the lock. Once you get it, the data of the node must be consistent, and there will be no data loss caused by asynchronous synchronization like redis.

Time jump problem

How can there be such a problem if it doesn't depend on global time?

Timeout leads to lock failure

How can there be such a problem without relying on effective time?

The third round, the comparison of other features of the lock

(1) the read and write performance of redis is much better than that of zookpeer. If zookpeer is used as a distributed lock in high concurrency scenarios, the lock acquisition will fail, and there will be performance bottlenecks.

(2) zookpeer can implement read-write locks, but not redis.

(3) the watch mechanism of ZooKeeper. When the client tries to create a znode, it is found that it already exists, and the creation fails, then it enters a waiting state. When the znode node is deleted, ZooKeeper notifies it through the watch mechanism, so that it can continue to complete the creation operation (acquisition lock). This allows the distributed lock to be used on the client like a local lock: the lock fails to block until the lock is acquired. This mechanism cannot be realized by redis.

Author: source of lonely smoke: http://rjzheng.cnblogs.com/

Summary

OK, the text is full of wordiness. In fact, I just want to make two points, whether it is redis or zookpeer, there are actually some problems with reliability. However, zookpeer's distributed locks are much more reliable than redis! However, the read and write performance of zookpeer is not as good as that of redis, and there is a performance bottleneck. If you use it in production, you can evaluate it by yourself.

In order to make learning easy and efficient, I will share a set of teaching resources taught by Ali architect for free today. Help you get through the difficulties on the road to becoming an architect.

This video course explains in detail the principles of Spring,MyBatis,Netty source code analysis, high concurrency, high performance, distributed, micro-service architecture, JVM performance optimization, distributed architecture, and so on.

But also the framework needs to use a variety of programs are packaged, according to the basic video allows you to easily build a distributed framework environment, as in the enterprise production environment to learn and practice.

Add Java advanced communication group: 725633148 can get this set of internal teaching materials worth 18000 for free immediately!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.