How to understand Hash algorithm in distributed system 04/29 Update SLTechnology News&Howtos

How to understand Hash algorithm in distributed system

2025-04-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how to understand the hash algorithm in the distributed system". In the daily operation, I believe that many people have doubts about how to understand the hash algorithm in the distributed system. I hope it will be helpful to answer the question of "how to understand the hash algorithm in the distributed system". Sort out the simple and easy-to-use operation methods! Next, please follow the editor to study!

Hash

Hash, also known as hash, hash, the principle is that any length of string as input, and then through the Hash algorithm into a fixed-length output. Hash is a mapping process, so conflicts are bound to occur. Chain address method, open addressing method and other methods are generally used to solve hash conflicts.

Distributed hash

In distributed scenarios, in order to solve the problem of data and request orientation, we often use the hash algorithm. Next, we will introduce several hash algorithms that are often used in distributed environments.

Ordinary hash

Even in a distributed environment, we can use the simplest hash algorithm, by setting the corresponding result value of each server, calculating when the request or data comes in, and mapping the data to the corresponding server respectively. Because the calculation rules are consistent, no matter how many calculations are carried out, the data mapping will not change.

This common hash method has distinct advantages and disadvantages, and its advantage is that it is easy to implement and clear. The disadvantage is that the nodes in the distributed system are full of uncertainty, which may lead to capacity reduction or expansion or node downtime. In these cases, it means that the mapping of the hash will change. At the same time, the previously mapped data needs to be migrated so that it can be accessed correctly later. The amount of data migration generated by this way of hashing in this case will be very large.

The above figure shows a hash mapping relationship of a common distributed system. Three servers accept requests with a hash value of 0meme 1 ~ 2 respectively (usually the value of% 2 is calculated).

So when we add a server, the original three servers become four, the hash mapping needs to be modified, and a large amount of data needs to be migrated.

Consistent hash

Consistent hashing is to solve the problem of massive data migration caused by hashing mentioned earlier.

Compared with the ordinary hash, the consistent hash also has a certain mapping relationship, but the difference is that we will create a hash ring at the beginning, with a large number of node values distributed on the ring, with a general range of 0 ~ 2 ^ 32-1.

Then we will drop the server node on the ring according to certain rules, as shown in the following figure

After that, the logic is relatively simple. When the request is sent to the server, it is calculated to find its corresponding location on the ring, and then check whether there is a corresponding server node in that location, and if so, forward the request. If not, search clockwise along the hash ring until the node location is found.

The benefits of this design are obvious. If we need to add or delete nodes, it will only affect the data of up to 2 nodes at a time, and the consumption is obviously less than the previous normal hash. And when some servers suddenly go down due to failure, the request can also be deferred to the next node for processing.

Problem of uneven distribution of nodes

The characteristic of consistent hash determines that if the distribution of nodes is not uniform enough, it will lead to excessive pressure on some nodes, and some nodes have a lot of idle resources. The figure below is as follows

The An and B nodes in the figure are obviously unbalanced, and more requests will be sent to A nodes, while B nodes can only get requests from A nodes about 1 to 3.

So consistent hashing often introduces the concept of virtual nodes. Although our servers are not evenly distributed, we can assume that we can create some virtual nodes and create corresponding mappings to help the virtual nodes forward requests to the actual nodes.

As shown in the figure, we can create a corresponding virtual node, called Achievement, and then forward the request to B'to B 'and forward the request to B' to A, so that there will be no imbalance.

Hash slot

A typical hash slot is the distributed implementation of redis.

In the distributed implementation of redis, all servers are confirmed when the cluster is started, and then 16384 hash slots are evenly distributed to all master servers, and all the data is stored in the specified nodes.

There are similarities and differences between the implementation of redis's hash slot and the consistent hash. The main reason is that redis adopts different high availability strategies. Consistent hash redirects traffic to the next server when the server goes down, but redis is different. Redis's cluster mode ensures the active / standby mode of the server node. The backup node will not directly participate in the allocation of the hash slot, but when the master node goes down, the slave node will take the place of the master node to handle the task.

Distributed hash table

Distributed hash table (DHT) is a distributed hash method. Unlike consistent hashes, DHT does not need a central node to allocate the flow of data. He has his own mechanism to ensure that no matter which server the data starts from, he can find the right server he needs to go to.

Kademlia algorithm

Kademlia algorithm is a typical distributed hash table algorithm, which is mostly used in the construction of P2P network, which is jointly created by Petar Maymounkov and David Mazieres. Source address of Kademlia essay: Kademlia

The difficulty of hash table in distributed environment lies in the following points:

In a distributed environment, it is impossible for each server to grasp the situation of all servers, so how to ensure that your request can find the corresponding server without the location of the central node is a big difficulty.

Similarly, because the information of the server in the distributed environment is limited, how to join and exit the server can be known by the cluster is also a big difficulty.

So let's look at how the Kademlia algorithm solves these problems.

XOR operation

Kademlia uses XOR to calculate the distance. Let's first look at the definition of XOR.

The algorithm of XOR is: 0 ⊕ 0 ⊕ 1 ⊕ 0 1 ⊕ 1 ⊕ 1 0 (same is 0, different is 1)

Then let's see why XOR is used to calculate distance.

A ⊕ b = b ⊕ a / / XOR obeys the exchange law, the distance from a node to b node is the same as the distance from b node to node a ⊕ a = 0 / / the distance between oneself and itself is 0a ⊕ b > = 0 / / the distance between two nodes is greater than 0a ⊕ b + b ⊕ c > = a ⊕ c / a to b and then to c is greater than or equal to the distance directly to c

According to some of the above XOR rules, we can find that some features of XOR and distance can be said to be a perfect match. I really admire the author of the algorithm to come up with such an exquisite design.

Construction of binary tree

After determining the distance using XOR, how does a specific cluster build and store information so that the correct information can be found?

In the theory of Kademlia algorithm, each node in the cluster will store the information of some nodes (it is impossible to store the information of all nodes, because the efficiency will be low and the real-time performance can not be guaranteed).

All nodes are built into a unique binary tree as shown below:

First of all, after a certain hash of the id of each node, we get a string of 01 strings of the node to indicate its position in the tree. From the high position, 1 goes to the left subtree and 0 to the right subtree until the end. You can see that the hash value of the black node in the figure is 0011.

Split of binary tree

Each node in the binary tree of Kademlia can split the binary tree according to its own perspective.

The splitting rule is to split the subtrees that do not contain themselves in turn from the root node, and so on, leaving only themselves at last. The previous binary tree is split as shown in the previous figure. For black nodes, the outside has split four subtrees that do not contain their own.

K-bucket mechanism

After the binary tree is split, each split subtree actually corresponds to a bucket. The distance of each bucket to the current node is different, and the farther the distance is, the higher the distance is, so the greater the XOR result gap (the farther the distance):

K-bucket distance interval 0 [2 ^ 0,2 ^ 1) 1 [2 ^ 1,2 ^ 2) 2 [2 ^ 2,2 ^ 3) 3 [2 ^ 3, 2 ^ 4) 4 [2 ^ 4,2 ^ 5)

So in fact, after each node is split, you only need to store a copy of the bucket in each bucket to traverse the entire binary tree (cluster). Of course, in order to be fault-tolerant, generally speaking, there are several nodes in each bucket, not just one.

Node query

With a general understanding of the principle, we look back at how the node is located for each request.

The first request is to enter a server in the cluster. Then we calculate the distance between the id of the destination server with the request and the id of the current server. A value is then calculated, and the corresponding bucket (that is, the bucket corresponding to this distance range) is found from the server's bucket list. Our target server can be locked in the scope of that bucket, and then find the K service nodes closest to that node in the bucket (this parameter can be resized by yourself), and redirect the request to these nodes. Then repeat the above steps, and if there is really a target node in the cluster, you can return successfully.

The basic mechanism is like this, of course, in the actual environment, we take into account the reality of the request will do timeout processing, to avoid a large number of queries between nodes causing unnecessary load.

Node change

A new node wants to join the network, first of all, there is a prerequisite: it needs to have the information of a node in the network before it can start the joining process.

Join the process:

The new node A starts with the existing node T, adds it to its own K-bucket, and generates its own node id

Node An item Node B makes a request, using its own id as a parameter to request the node to locate its location.

Then there is the process of querying nodes. Each passing node will find the nearest node stored in its own node, and then A will put these nodes into its own bucket to improve its routing table. At the same time, these routing nodes will also put node An into their own routing table for later use.

When most of the nodes return, the routing table of A has been established, and some nodes have added A node to their own routing table. At this point, node A joined the network successfully.

Parameters of the algorithm

Some of the parameters we use in the algorithm can actually be defined by ourselves:

The k in k-bucket defines that each layer of bucket stores k node information.

Each request sends a message to s nodes

The length of id can also be customized. Note that the length of id determines the height of the binary tree.

At this point, the study on "how to understand the hash algorithm in distributed systems" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.