In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "analyzing the data distribution algorithm in the redis cluster". In the daily operation, I believe that many people have doubts in analyzing the data distribution algorithm in the redis cluster. The editor consulted all kinds of materials and sorted out the simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "analyzing the data distribution algorithm in the redis cluster". Next, please follow the editor to study!
Hash algorithm
Hash algorithm is widely used in distributed architecture, not only in data storage, but also in load balancing and other applications. The idea of hash algorithm is very simple. Maybe you know the hash function of HashMap. Like HashMap, hash algorithm also gets a certain number through a hash function, and then finds the corresponding server according to the number. The hash function of the hash algorithm is relatively simple. Generally, it takes the module according to the value of a certain key or the hash value of key and the number of available master nodes, and obtains the specific server according to the value of the module. The hash algorithm service structure model is shown in the following figure:
Hash algorithm structure model diagram
Using the data we hypothesized above, we use the hash algorithm to experiment to deepen our understanding of the application of the hash algorithm in the distributed system. We assume that the hash function in the hash algorithm is "id% master nodes". The data with the result of 0 is stored on the server1 server, the data with the result of 1 is stored on the server2 server, and the data with the result of 2 is stored on the server3 server.
Therefore, after the hashing algorithm, the data of id=3 and id=6 and the number of master nodes are modeled as 0 (3% 3% 0, 6% 3% 0), so these two data will be stored in the server1 server. And so on, the data of id=1 and id=4 will be stored in the server2 server, and the data of id=2 and id=5 will be stored on the server3. At this time, the data stored by the server is shown below:
Server stores data
This is the role of the hash algorithm in the distribution. It is relatively simple. We can see that as long as your hash function is well designed, the data is evenly distributed on each server, but the hash algorithm has a fatal disadvantage: the scalability is particularly poor. For example, in our cluster, the server server3 is down, so there are only two machines available in the cluster, so the hash function becomes id% 2. This will lead to a problem, all the data need to be recalculated, find new storage nodes, every time there is a server downtime or add a machine, a large number of data migration is needed, which will make the availability and stability of the system worse.
Consistent hash algorithm
The consistent hash algorithm can be said to be the upgraded version of the hash algorithm, which solves the problem of poor scalability of the hash algorithm. The consistent hash algorithm is different from the hash algorithm. The consistent hash algorithm maps the server and data to a hash ring connected from end to end through the hash function, and the storage node mapping can hash the value according to the ip address. After the data is mapped to the hash ring, it looks for the storage node in a clockwise direction, that is, the first storage node found clockwise from the position where the data is mapped on the ring, then it is stored on this node.
We use the consistent hashing algorithm to store our data, and I draw a graph to simulate the possible results of the consistent hashing algorithm:
Consistency algorithm simulates storage graph
Let's first interpret this diagram. According to the rules of the consistent hashing algorithm, the data looks for data in a clockwise direction. Then the data of id=4 is stored on the server1 server, the data of id=2 is stored on the server server2, and the data of id=3, id=1, id=5 and id=6 are all stored on the server server3. If you are sensitive, you may find the shortcomings of the consistent hashing algorithm, as you can see from the figure. Our six pieces of data are unevenly distributed, not that each server stores 2 pieces of data, and the gap seems to be a little big. here, we will talk about the shortcomings of the consistent hashing algorithm: the consistent hashing algorithm will cause the problem of uneven data distribution or data skew, as in our figure, uneven data distribution may cause excessive load on a certain node, resulting in downtime. There are two situations that lead to uneven distribution of data:
First: the reason of the hash function, after the hash function, the distribution of the server on the hash ring is not uniform, and the distance between the servers is not equal, which will lead to uneven data.
Second: when a server goes down, the successor node needs to bear the data that originally belongs to the downtime machine, which will also cause uneven data.
We mentioned earlier that the consistent hash algorithm solves the problem of poor scalability in the hash algorithm. How to understand this? Let's take a look. In the consistent hashing algorithm, when a storage node joins or exits, it will only affect the successor node of that node. For example, if we add a server storage node server4 between the server server3 and the service server2, it will only affect the server server3, and some of the data originally stored on the server server3 will fall into the server server4. There is no impact on the server server1 and server2, so that a large number of data migrations do not occur and scalability becomes stronger.
Consistent hash algorithm with limited load
Because of the uneven data distribution of the consistent hash algorithm, Google proposed the consistent hash algorithm with limited load in 2017 to solve this problem. The idea of the consistent hash algorithm with limited load is relatively simple. A storage upper limit is set for each storage node to control the data asymmetry caused by the addition or removal of the storage node. When the data finds the corresponding storage node according to the consistent hash algorithm. It is necessary to judge whether the storage node has reached the storage limit. If the upper limit has been reached, you need to continue to look for the node after the storage node is clockwise for storage.
We use the consistent hashing algorithm with limited load to improve the above data storage. We limit the upper limit of data stored by each server node to 2, and the order of data insertion is in the order of ID size. Similarly, I also draw a simulation diagram:
Consistent hash algorithm with limited load
Let's analyze this diagram together. Because we add the order according to the size of id, there is no problem with the first four data. At this time, the server does not exceed the maximum load. The data of id=5 falls between the server server2 and the server server3, and should be stored on the server server3. However, the data of id=1 and id=3 have been stored on the server server3 at this time. Therefore, the data of id=5 will continue to look for the server in a clockwise direction, and the next server is server1. At this time, the server server1 stores the data of id=4 and does not reach the upper limit, so the data of id=5 will be stored in the data of server server1,id=6. In this way, the consistent hash algorithm with limited load is used to solve the problem of uneven data distribution in the consistent hash algorithm.
Consistent Hash algorithm with Virtual Node
There is also a problem with consistent hashing algorithm with limited load, that is, the performance configuration of each server may be different. If the specified number is too small, it is a bit wasteful to configure high servers. This is because there may be differences between servers, called heterogeneity between servers, in order to solve the problem of heterogeneity between servers. A consistent hashing algorithm with virtual nodes is introduced. the core idea of the consistent hashing algorithm with virtual nodes is to divide a different number of virtual nodes for each node according to the performance of each node. and these virtual nodes are mapped to the hash ring, and then the data is mapped and stored according to the consistent hash algorithm.
In order to demonstrate the consistent hashing algorithm with virtual nodes, let's first assume that the server server3 is the worst configured, so we take the server server3 as the benchmark, the server server2 is twice as much as the server server3, and the server server1 is three times as much as the server server3. With this premise, we can set up the virtual node. We assume that the virtual node of the server server3 is the server server3_1. Server server2 has two virtual node servers server2_1, server server2_2, and server server1 has three virtual node servers server1_1, server server1_2, and server server1_3. I drew a simulation as before:
The data that falls on the virtual node will be stored on the corresponding physical server, so after the consistent hash algorithm with virtual node, the data storage result is as follows: the data of id=2, id=3 and id=5 will be stored on the server server1, the data of id=1 will be stored on the server server2, and the data id=4 and id=6 will be stored on the server server3.
Virtual nodes can make configured servers store more data, which solves the problem of system heterogeneity, and at the same time, due to the existence of a large number of virtual nodes, the data will fall on different physical machines during data migration. this reduces the sharing pressure of a server during data migration and ensures the stability of the system.
Virtual slot partition
Virtual slot partition is the default data distribution technology in redis cluster. Virtual slot partition skillfully uses hash space and uses well-dispersed hash functions to map all data to a fixed range of integers, which is defined as slots (slot), and the number of slots is generally far greater than the number of nodes.
There are 16384 (0,16383) slots in the redis cluster, which are evenly allocated to each master, and the CRC16 algorithm is used when storing the data. The specific formula is: slot=CRC16 (key) / 16384 to calculate which slot the key belongs to. In our clustered environment, a key storage or lookup process might be shown in the following figure:
Virtual slot partition decouples the relationship between data and nodes. By introducing slots, slots become the basic unit of data management and migration in the cluster, which simplifies the difficulty of node expansion and contraction. You only need to pay attention to which slot the data is in. You don't need to care about which node the data is on. Therefore, the virtual slot partition can be said to be well compatible with the problem of uniform data distribution and expansibility.
At this point, the study on "analyzing the algorithm of data distribution in redis cluster" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.