In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article shows you how to implement a Bloom filter in Redis, which is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
A brief introduction to Bloom filter
The Bloom filter (BloomFilter) is a very long binary vector and a series of random mapping functions, and we can simply understand that it is an imprecise set structure. Bloom filter is essentially a data structure, a more ingenious probabilistic data structure (probabilistic data structure), characterized by efficient insertion and query, can be used to tell you that "something must not exist or may exist." Compared with the traditional List, Set, Map and other data structures, it is more efficient and takes up less space, but the disadvantage is that the returned results are probabilistic, but we need not worry too much that it is not accurate enough, as long as the parameters are set reasonably, its accuracy can be controlled to enough accuracy.
Applicable scenarios:
Whether there are problems with big data, such as the problem of removing weight by browsing Douyin above.
To solve the problem of cache breakdown, if the data request has been a non-existent content, then it will directly request the database beyond the cache, resulting in cache breakdown. Bloom filter can also solve this kind of problem.
Solve crawler crawling to repeat url content and so on.
Basic use of Bloom filter
The Bloom filter has two basic instructions, bf.add to add elements, and bf.exists to query for the existence of elements, which are similar to the sadd and sismember of the set collection. Note that bf.add can only add one element at a time, and if you want to add more than one element at a time, you need to use the bf.madd instruction. Similarly, if you need to query the existence of multiple elements at a time, you need to use the bf.mexists directive.
> bf.add user user1 (integer) 1 > bf.add user user2 (integer) 1 > bf.add user user3 (integer) 1 > bf.exists user user1 (integer) 1 > bf.exists user user4 (integer) 0 > bf.madd user user4 user5 user61) (integer) 12) (integer) 13) (integer) 1 > bf.mexists user user4 user5 user6 user71) (integer) 12) (integer) 13) 14) (integer) 0
The Bloom filter used above is just the default parameter of the Bloom filter, which is automatically created the first time we add. Redis also provides a Bloom filter with custom parameters, which can be explicitly created using the bf.reserve instruction before add. If the corresponding key already exists, bf.reserve will report an error. Bf.reserve has three parameters, key, error_rate (error rate), and initial_size:
The lower the error_rate, the more space is needed
Initial_size indicates that the number of elements expected to be put in will increase when the actual number exceeds this value, so it is necessary to set a larger value in advance to avoid an increase in the misjudgment rate.
If bf.reserve is not applicable, the default error_rate is 0.01 and the default initial_size is 100.
The realization principle of Bloom filter
Add operation
Each Bloom filter corresponds to Redis's data structure with a large array of digits and several different unbiased hash functions. The so-called unbiased is the ability to calculate the hash value of the element more evenly. When adding key to the Bloom filter, multiple hash functions are used to hash the key to get an integer index value and then modulo the length of the bit array to get a position, and each hash function calculates a different position. Then set all these positions of the bit array to 1 to complete the add operation.
Exists operation
The exists operation, like add, will also calculate all the positions of hash to see if all these positions in the bit array are 1. As long as one bit is 0, it means that the key does not exist in the Bloom filter. If they are all 1, this does not mean that the key must exist, but it is very likely that these bits are set to 1 because of the existence of other key.
If the bit array is sparse, the probability is high, and if the bit array is crowded, the probability is reduced. When using, do not make the actual element much larger than the initialization size, when the actual element begins to exceed the initialization size, we should rebuild the Bloom filter, reassign a filter with a larger size, and then batch add all the historical elements (which requires us to record all the historical elements in other memory). Because error_rate does not increase sharply just because the number exceeds, this gives us a looser time to rebuild the filter.
The formula for calculating the space occupancy is: KTH 0.7 * (lBO) # equals approximately f = 0.6185 ^ (LPO) # ^ represents the power calculation, that is, math.pow
The Bloom filter has two parameters, the first is the expected number of elements n, and the second is the error rate f. The formula gets two outputs from these two inputs. The first output is the length of the bit array l, which is the amount of storage space required (bit), and the second output is the optimal number k of the hash function. The number of hash functions will also directly affect the error rate, and the best number will have the lowest error rate.
It can be seen from the formula
1. The longer the bit array, the lower the error rate f, which is consistent with the intuitive understanding.
The longer the bit array is, the more the optimal number of hash functions is, which affects the computational efficiency.
When an element requires an average of 1 byte (8bit) of fingerprint space (l/n=8), the error rate is about 2%
Error rate
The average space required for an element
Space number
10% 4.792 bit5bit
1% 9.585 bit10bit
0.1% 14.377 bit15bit
Some people may ask how the error rate will change if the actual element exceeds the budget element, and will the error rate be very high? We introduce a formula.
F = (1-0.5 ^ t) ^ k # limit approximation, k is the best quantity of hash function # t represents multiples of actual and expected elements
When the error rate is 10% and the multiple ratio is 2, the error rate will rise to nearly 40%.
When the error rate is 1% and the multiple ratio is 2, the error rate increases to 15%.
The error rate is 0.1%, and when the multiple ratio is 2, the error rate increases to 5%.
The above is how to implement a Bloom filter in Redis. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.