How to understand the consistent hash algorithm 04/17 Update SLTechnology News&Howtos

How to understand the consistent hash algorithm

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the knowledge of "how to understand the consistent hash algorithm". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

To understand consistent hashing, we must first understand traditional hashes and their limitations in large-scale distributed systems. Simply put, a hash is a store of key-value pairs, and in the case of a given key, the associated value can be found very efficiently. Suppose we want to find the name of the street in the city according to its zip code. One of the easiest ways to implement this is to store this information in the form of a hash dictionary.

The problem becomes more interesting when the data is too large to be stored on one node or machine, and multiple such nodes or machines are needed in the system to store it. For example, a system that uses multiple Web caching middleware. So how do you determine which key is stored on which node? The simplest solution to this problem is to use a hash model to determine. Given a key, first hash the key, divide it by the number of nodes in the system, and then put the key into that node. Similarly, when you get the key, hash the key, divide it by the number of nodes, and then go to that node and get the value. The hash algorithm corresponding to the above process is defined as follows:

Node_number = hash (key)% N # where N is the number of nodes.

The following figure depicts the traditional hash modulus algorithm in multi-node systems, based on which simple load balancing can be achieved.

First, the limitations of traditional hash modulus algorithm.

Let's analyze the traditional hash and its limitations in large-scale distributed systems. Here, we directly use the SimpleHash class defined in the development tool you deserve to have in my previous article, and then hash the semlinker, kakuqo and test keys and take the remainder. The specific code is as follows:

Public class SimpleHash {private int cap; private int seed; public SimpleHash (int cap, int seed) {this.cap = cap; this.seed = seed;} public int hash (String value) {int result = 0; int len = value.length (); for (int I = 0; I

< len; i++) { result = seed * result + value.charAt(i); } return (cap - 1) & result; } public static void main(String[] args) { SimpleHash simpleHash = new SimpleHash(2 1 node_number=hash("kakuqo") % 3 ->

2 node_number=hash ("test") 3-> 0

Based on the above output, we can create the following table:

1.1 scenarios with reduced nodes

In distributed multi-node systems, failures are very common. Any node may hang up without any prior notice, in which case we expect that the performance of the system will only be degraded and the normal function will not be affected. For the original example, what happens when a node fails? In the original example, there are three nodes, and suppose that one of them fails, and the number of nodes changes, the number of nodes decreases from 3 to 2, and the state of the table changes:

It is obvious that the reduction of nodes will lead to a change in the mapping relationship between keys and nodes, which will not have any impact on the new keys, but for existing keys, it will lead to node mapping errors. Take "semlinker" as an example, the system has 3 nodes before the change, and the corresponding node number of the key is 1. When a failure occurs, the number of nodes is reduced to 2, and the corresponding node number of the key is 0.

1.2 scenarios with additional nodes

In the distributed multi-node system, for some scenarios such as holiday promotion, it is necessary to expand the capacity of the service node to cope with the sudden traffic. For the original example, what happens when nodes are added? In the original example, there are 3 nodes. Suppose one node is temporarily added for capacity expansion. At this time, the number of nodes has changed, the number of nodes has increased from 3 to 4, and the status of the table has changed:

It is obvious that the increase of nodes will also lead to changes in the mapping relationship between keys and nodes, and this change will not have any impact on the new keys, but for existing keys, it will lead to node mapping errors. Similarly, take "semlinker" as an example, the system has 3 nodes before the change, and the corresponding node number of the key is 1. When adding nodes, the number of nodes increases to 4, and the corresponding node number of the key is 2.

When the number of nodes in the cluster changes, the previous mapping rules may change. If there is no difference in the services provided by each machine in the cluster, this will not make a difference. However, for systems such as distributed cache, the failure of mapping rules means the invalidation of the previous cache. If a large number of cache failures occur at the same time, there may be a "cache avalanche", which will lead to disastrous consequences.

To resolve this problem, we must reassign all existing keys on the remaining nodes, which can be a very expensive operation and may adversely affect the running system. Of course, in addition to reallocating all existing keys, there is a better solution even with a consistent hash algorithm.

Second, consistent hash algorithm

The consistent hash algorithm, proposed by the Massachusetts Institute of Technology in 1997, is a special hash algorithm, which can change the mapping between the existing service request and the processing request server as little as possible when removing or adding a server. Consistent hash solves the problem of dynamic scaling of simple hash algorithm in distributed hash table (Distributed Hash Table,DHT).

2.1 advantages of consistent hashing algorithm

Scalability. The consistent hash algorithm ensures that the change of data storage is the least when adding or decreasing servers, and greatly saves the cost of data movement compared with the traditional hash algorithm.

Better adapt to the rapid growth of data. Using consistent hashing algorithm to distribute data, when the data is growing, some virtual nodes may contain a lot of data, resulting in uneven distribution of data on the virtual nodes. At this time, the virtual nodes containing a lot of data can be split. This split only divides the original virtual node into two, and there is no need to re-hash and partition all the data.

After the virtual node is split, if the load of the physical server is still uneven, it is only necessary to adjust the storage distribution of some virtual nodes between the servers. In this way, the number of physical servers can be dynamically expanded as the data grows, and the cost is much less than the traditional hashing algorithm to redistribute all data.

2.2 the relationship between consistent hash algorithm and hash algorithm

The consistent hash algorithm is proposed on the basis of the hash algorithm. In the dynamic distributed environment, the hash algorithm should meet several conditions: balance, monotonicity and dispersion.

Balance: it means that the results of hash should be evenly distributed to each node, which solves the problem of load balancing algorithmically.

Monotonicity: when adding or deleting nodes, it does not affect the normal operation of the system.

Decentralization: data should be distributed among the nodes in the distributed cluster (nodes can have their own backups), without the need for each node to store all the data.

Third, the principle of consistent hashing algorithm

The consistent hash algorithm is implemented by a data structure called a consistent hash ring. The starting point of the ring is 0, the end point is 2 ^ 32-1, and the start point is connected to the end point, so the integer distribution range of the ring is [0,2 ^ 32-1], as shown in the following figure:

3.1 place an object on a hash ring

Suppose we have "semlinker", "kakuqo", "lolo" and "fer", abbreviated as o1, o2, o3 and o4, respectively, and then use the hash function to calculate the hash value of this object. The range of values is [0,2 ^ 32-1]:

The mapping of the objects in the figure is as follows:

Hash (o1) = K1; hash (O2) = K2; hash (o3) = K3; hash (o4) = K4

3.2 place the server in the hash ring

Then using the same hash function, we put the server on the hash ring, and we can choose the server's IP or hostname as the key to hash, so that each server can determine its location on the hash ring. Let's assume that we have three cache servers, cs1, cs2, and cs3:

The mapping of the server in the figure is as follows:

Hash (cs1) = T1; hash (cs2) = T2; hash (cs3) = T3; # Cache Server

3.3 Select a server for the object

After placing the object and the server in the same hash ring, look for the machine closest to the object's hash value clockwise on the hash ring, that is, the machine to which the object belongs. Take the O2 object as an example, the nearest machine found by the sequential needle is cs2, so the server cs2 caches the O2 object. The server cs1 caches the o1reo3 object, and the server cs3 caches the o4 object.

3.4 increase in server

Suppose that due to business needs, we need to add a server cs4. After the same hash operation, the server finally falls between T1 and T2 servers, as shown in the following figure:

Picture

In the above case, only the objects between T1 and T2 servers need to be reassigned. In the above example, only the o3 object needs to be reassigned, that is, it is redirected to the cs4 server. As we have analyzed earlier, if a simple modularization method is used, most of the caches may be invalidated when a new server is added, but this situation has been greatly improved by using the consistent hashing algorithm. because only a small number of objects need to be reallocated.

3.5 reduction in servers

Suppose the failure of the cs3 server leads to the offline service, then the object O4 originally stored in the cs3 server needs to be reassigned to the cs2 server, and the other objects are still stored on the original machine.

3.6 Virtual Node

The fundamentals of consistent hashing have been introduced here, but there are still some problems with the addition of servers. The new server cs4 only shares the load of the cs1 server, and the servers cs2 and cs3 do not reduce the load pressure because of the addition of the cs4 server. If the performance of the cs4 server is the same as or even higher than that of the original server, then this result is not what we expected.

To solve this problem, we can solve the problem of load imbalance by introducing virtual nodes. That is, each physical server is virtualized as a group of virtual servers, and the virtual server is placed on the hash ring. If you want to determine the server of the object, you need to determine the virtual server of the object, and then the virtual server determines the physical server.

In the figure, o1 and O2 represent objects, v1-V6 represents virtual servers, and S1-S3 represents physical servers.

4. Implementation of consistent hash algorithm

Here we only introduce the implementation of consistent hashing algorithm without virtual nodes:

Import java.util.SortedMap; import java.util.TreeMap; public class ConsistentHashingWithoutVirtualNode {/ / list of servers to be added to the Hash ring private static String [] servers = {"192.168.0.1 key 8888", "192.168.0.2servers 8888", "192.168.0.3 key 8888"}; / / server hash value, server value = server () / / initialize the program and put all the servers into sortedMap static {for (int I = 0; I)

< servers.length; i++) { int hash = getHash(servers[i]); System.out.println("[" + servers[i] + "]加入集合中, 其Hash值为" + hash); sortedMap.put(hash, servers[i]); } } //得到应当路由到的结点 private static String getServer(String key) { //得到该key的hash值 int hash = getHash(key); //得到大于该Hash值的所有Map SortedMap subMap = sortedMap.tailMap(hash); if (subMap.isEmpty()) { //如果没有比该key的hash值大的，则从第一个node开始 Integer i = sortedMap.firstKey(); //返回对应的服务器 return sortedMap.get(i); } else { //第一个Key就是顺时针过去离node最近的那个结点 Integer i = subMap.firstKey(); //返回对应的服务器 return subMap.get(i); } } //使用FNV1_32_HASH算法计算服务器的Hash值 private static int getHash(String str) { final int p = 16777619; int hash = (int) 2166136261L; for (int i = 0; i < str.length(); i++) hash = (hash ^ str.charAt(i)) * p; hash += hash >

7; hash + = hash > 17; hash + = hash

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.