How to understand the consistent hash algorithm and implementation 04/04 Update SLTechnology News&Howtos

How to understand the consistent hash algorithm and implementation

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to understand the consistent hash algorithm and implementation, the editor feels very practical, so share with you to learn, I hope you can get something after reading this article, say no more, follow the editor to have a look.

What is the consistent hash algorithm?

Consistent hash algorithm is an algorithm proposed by MIT in 1997, which is mainly used in distributed cache.

Consistent hash algorithm can effectively solve the problems caused by dynamic addition and deletion of nodes in distributed storage structure.

Consistent hash algorithm is used in Memcached, Key-Value Store, Bittorrent DHT and LVS. It can be said that consistent hash algorithm is the preferred algorithm for load balancing in distributed systems.

Disadvantages of traditional hash algorithm

The commonly used algorithm is to take the remainder of the hash result (hash () mod N): for the machine number from 0 to Nmuri 1, according to the custom hash algorithm, the hash value of each request is modularized by N to get the remainder I, and then the request is distributed to the machine numbered I. However, there is a fatal problem with this algorithm. If a machine is down, the requests that should fall on the machine cannot be processed correctly, and the algorithm needs to be removed from the failed server. At this time, the cached data of the (Nmur1) / N server will need to be recalculated; if a new machine is added, the cached data of the N / (Numb1) server will need to be recalculated. This is usually an unacceptable jolt for the system (because it means that a large number of caches are invalidated or data needs to be transferred).

In the traditional load balancing algorithm, the number of cache nodes is changed from 3 to 4, and the cache miss rate is 75%. Calculation method: 12 numbers with hash values of 1-12 are selected for 3 and 4 respectively, and then the comparison shows that only the first 3 cache nodes have the same result as before, so 75% of the nodes' cache will fail, which may cause cache avalanche.

Consistent hash algorithm

First of all, we map the range of the hash algorithm to a space with 232th buckets, that is, the digital space of 0232-1. Now we can connect these numbers end to end and combine them into a closed ring.

Each cache key can be converted into a 32-bit binary number by Hash algorithm, which corresponds to a cache area in the ring space. We map all the cache key to different locations in the ring space.

Each of our cache nodes follows the same Hash algorithm, such as using IP or hostname to do Hash, which is mapped to the ring space, as shown in the following figure

How do you make the key correspond to the cache node? Quite simply, the clockwise nearest node of each key is the cache node to which the key belongs. So the key1 in the figure is stored in node1,key2,key3, stored in node2,key4 and stored in node3.

The advantage of consistent hashing becomes apparent when cached nodes are added or deleted. Let's look at the details of the implementation:

Add nodes

When the number of nodes in the cache cluster increases, the mapping of the whole ring space still maintains the clockwise rule of consistent hash, so the attribution of a small number of key will be affected.

Which key will be affected? A new node node4 is added in the figure, which lies between node1 and node2. According to the clockwise rule, the cache from node1 to node4 no longer belongs to node2, but belongs to the new node node4. Therefore, only key2 is the key affected.

Finally, the cached data of key2 is migrated from node2 to node4, and a new cache structure that conforms to the consistent hash rules is formed.

Delete Node when the node of the cache cluster needs to be deleted (for example, the node is dead), the mapping of the whole ring space will also maintain the clockwise rule of consistent hash, and the attribution of a small number of key will be affected.

Which key will be affected? The original node node3 is deleted in the figure. According to the clockwise rule, the cached data originally owned by node3 needs to be "entrusted" to the clockwise successor node node1 of node3. Therefore, only key4 is the key affected.

Finally, the cached data of key4 is migrated from node3 to node1, and a new cache structure that conforms to the consistent hash rules is formed.

Note: the migration mentioned here is not a direct data migration, but to find a clockwise successor node when looking for it, and refresh the cache because the cache misses.

Calculation method: assuming that the node hash hash is uniform (because hash is a hash table, it is not ideal), using the consistent hash algorithm, when the number of cache nodes is increased from 3 to 4, 0-33% of the cache will be invalidated. In addition, the new nodes will not affect the pressure of all the original nodes.

The result of the consistent hash algorithm has improved a lot compared with the traditional hash algorithm, but can it be improved? Or what if the distribution is uneven? As shown in the following figure, according to the clockwise rule, all key belong to a single node.

Consistent hash algorithm + virtual node

In order to optimize the imbalance caused by too few nodes. The consistent hash algorithm introduces the concept of virtual node.

The so-called virtual node is to map N child nodes based on the original physical node, and finally map all the child nodes to the annular space.

The more virtual nodes, the more evenly distributed. In the case of using consistent hash algorithm + virtual nodes, the number of cache nodes is changed from 3 to 4, the cache failure rate is 25%, and each node bears the pressure on average.

Implementation of consistent hash algorithm + Virtual Node

The principle is understood, and it is not difficult to implement, mainly in some details:

Choice of hash algorithm. Do not use the hashcode function in Java code. The result of this function is not hash enough, and there will be negative values to deal with. There are many algorithms for calculating hash values, such as CRC32_HASH, FNV1_32_HASH, KETAMA_HASH and so on. KETAMA_HASH is the default consistent Hash algorithm recommended by MemCache, and other Hash algorithms can be used. For example, FNV1_32_HASH algorithm will be more efficient.

The choice of data structure. According to the principle of the algorithm, our algorithm has several requirements:

To be able to sort storage according to hash value

Sort storage needs to be found quickly (not List)

Sorting lookups also need to be easy to change (not available for Array)

In addition, because the binary tree may be extremely unbalanced. Therefore, the use of red-black tree is the safest way to achieve. You can use TreeMap directly in Java.

The above is how to understand the consistent hash algorithm and implementation. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.