How to realize the Bloom filter in Redis 07/09 Update SLTechnology News&Howtos

How to realize the Bloom filter in Redis

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to realize the Bloom filter in Redis". In the daily operation, I believe that many people have doubts about how to realize the Bloom filter in Redis. The editor consulted all kinds of data and sorted out a simple and easy-to-use operation method. I hope it will be helpful to answer the doubt of "how to realize the Bloom filter in Redis". Next, please follow the editor to study!

Overview

Bloom Filter filter is a data structure proposed by Burton Howard Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions. [related recommendation: Redis video tutorial]

Bloom filters can be used to efficiently retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are far better than the general algorithm, but the disadvantage is that it has a certain error recognition rate, and it is difficult to delete (generally not supported, need additional implementation).

The false recognition rate means that you can judge that the element is definitely not in the set, the judging element may be in the set, and it is impossible to judge that the element must be in the set.

The Bloom filter is efficient because it is a probabilistic data structure that confirms that the element must not be in the collection, or that the element may be in the collection. It is possible because it has a certain error recognition rate, making it impossible to be 100% sure that the element must be in the collection.

The question leads to

A common requirement in daily work is to determine whether an element is in the collection. For example, the following scenarios

Given an IP blacklist library, check whether the specified IP is in the blacklist.

When receiving mail, determine whether an email address is spam?

In the word processing software, check whether an English word is spelled correctly?

When faced with this problem, intuition usually tells us that we should use the data structure of collection. For example, first store all the IP of the IP blacklist library in a collection, and then take the specified IP to the collection to check whether it exists. If so, the IP hits the blacklist.

Use a piece of code to simulate the storage and inspection of the IP blacklist library.

Public class IPBlackList {public static void main (String [] args) {Set set = new HashSet (); set.add ("192.168.1.1"); set.add ("192.168.1.2"); set.add ("192.168.1.4") System.out.println (set.contains ("192.168.1.1")); System.out.println (set.contains ("192.168.1.2")); System.out.println (set.contains ("192.168.1.3")); System.out.println (set.contains ("192.168.1.4"));}}

The interior of the collection is usually implemented using a hash table. Its advantage is that the query is very efficient, and the disadvantage is that it consumes more storage space.

Generally, when the amount of data is relatively small, we will use collections for storage. Exchanging space for time not only takes up less space, but also improves the query efficiency.

However, when the amount of data stored is relatively large, consuming a lot of space will be a problem. Because this data is usually stored in process memory to speed up the query efficiency. The memory of the machine is usually limited and should be used as efficiently as possible.

On the other hand, hash tables need to be balanced in terms of space and efficiency. Storing the same number of elements, the smaller the hash table capacity, the higher the probability of conflicts, and the more time it takes to resolve conflicts, which affects performance.

The generation of Bloom filter (Bloom Filter) can solve this problem very well. On the one hand, it can store data with less memory, on the other hand, it can achieve very efficient query performance.

Basic principles

First, create a binary vector and set all bits to 0

Then, K hash functions are selected to hash the element K times to calculate the bit subscript of the vector.

Add element

When adding an element to the collection, K hash functions act on the element respectively, generate K values as subscript, and set the corresponding bit of the vector to 1.

Check element

If you want to check whether an element exists in the collection, use the same hash method to generate K subscript and check that the corresponding bits of the vector are all 1.

If all 1, the element is likely to be in the collection; otherwise (as long as one or more bits are 0), the element is definitely not in the collection.

Demo

Suppose you have a Bloom filter with a capacity of 15 bits and two hash functions, as shown below.

| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |-| | 0 |

Add a string aPol 2 hashes to get the subscript 4 and 10, and mark the bits of 4 and 10 from 0 to 1.

| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |-| 1 | 1 | |

Add a string bBME 2 hashes to get the subscript 11 and mark the corresponding bit of 11 from 0 to 1.

| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |-| 1 | 1 | 1 |

Add a string cPol 2 hashes to get the subscript 11 and 12, and mark the corresponding bits from 0 to 1.

| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |-| 1 | 1 | 1 | 1 |

Finally, add a string sam,2 secondary hash to get the subscript 0 and 7, and mark the corresponding bits from 0 to 1.

| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 11 | 12 | 13 | 14 | 15 | | 1 | 1 | | 1 | | 1 | 1 | 1 | |

Above, we added four strings, each with two hashes, the corresponding two bits marked as 1, and the total number of bits marked as 1 is 6 instead of 8.

This shows that the positions obtained after hashing of different elements may overlap. The more elements, the higher the probability of overlap. If two elements overlap, we cannot tell that either element must be in the collection.

If you want to check whether the element exists in the collection, you only need to hash it twice in the same way, and the resulting two subscripts are found in the corresponding bits in the Bloom filter. If the corresponding 2 bits are not all 1, the element is definitely not in the collection. If the corresponding two bits are all 1, the element may or may not exist in the collection.

For example, check whether the string b exists in the collection, and the two subscripts obtained by the hash are both 11. It is found that the corresponding bit of 11 is 1. However, this does not mean that b must be in the collection. This is because the subscript after the hash of the string c also contains 11, it is possible that the string c is in the collection, but b does not exist, which results in false recognition, also known as false positive.

Then check the string foo, and the subscript of the hash is 8 and 13, respectively, and the corresponding bits are 0. Therefore, the string foo is definitely not in the collection.

Mathematical principle

The mathematical principle behind the Bloom filter is

The probability of collision between two completely random numbers is very small, so a large amount of information can be stored with very little space under the condition of very small misrecognition rate.

False recognition rate

The error recognition rate (FPP,false positive probabilistic) is calculated as follows.

Assuming the Bloom filter size is m bits, n elements are stored, and the k-th hash function is used to calculate the storage location of the elements.

If one element is added, the probability that any bit is 1 is 1, and the probability that any bit is 0 is 1-1.

After adding 1 element and performing k hashes, the probability of any bit being 0 is (1-1) ^ k, and the probability of any bit being 1 is 1-(1-1 image) ^ k.

If n elements are added, the probability of any bit being 0 is (1-1) ^ kn, and the probability of any bit being 1 is 1-(1-1) ^ kn.

If you add a new element to a Bloom filter that already has n elements, the probability that any bit is already 1 is the same as above, with a probability of 1-(1-1 image) ^ kn. Therefore, the probability that all k bits are 1 is (1-(1-1) kn) ^ k, which is the error recognition rate of the newly inserted element.

When n is relatively large, $(1-(1-1 kn/m) ^ {kn}) ^ k $is approximately equal to $(1-e ^ {- color}) ^ k $.

Assuming that the Bloom filter size m is 16 bits, k is 8, and the storage element n is 1, the probability of false recognition (false positive) is 5/10000 (510 000).

At this point, m/n=16, in fact, this means that an element uses 16 bits of space to store.

Therefore, when k is the same, mplan has the same error recognition rate when it is equal to 16,32pm and 64pm.

The website Bloom Filters-the math gives the error recognition rate table of the Bloom filter, which can easily find and deal with the error recognition rate under different values given by mjournal nforme k.

Solve the misrecognition rate

Common methods to solve the misrecognition rate include

White list

Retrospective confirmation

White list

A common way to solve the misrecognition rate is to create a small whitelist to store data that may be misidentified.

Take spam filtering as an example. Suppose we have a junk mail library to filter out spam when receiving mail.

At this time, you can first store the spam inventory in the Bloom filter, and when you receive the mail, you can filter out most of the normal mail efficiently through the Bloom filter.

For a small number of those who hit (may be) spam, some of them may be normal mail.

Then create a whitelist library, and when it is determined in the Bloom filter that it may be spam, query the whitelist to confirm whether it is a normal email.

Messages that are not on the whitelist will be moved to the dustbin by default. By means of manual identification, when a normal message is found in the dustbin, it is moved to the whitelist.

Origin-pull confirmation

In many cases, Bloom filters are used to intercept scenarios where a large amount of data is not in the collection at low cost and high efficiency. For example

Google Bigtable,Apache HBase and Apache Cassandra and PostgreSQL use Bloom filters to reduce disk lookups for rows or columns that do not exist. Avoiding expensive disk lookups can greatly improve the performance of database query operations.

The Google browser is used in web browsers that use Bloom filters to identify malicious URL. All URL is checked against the local Bloom filter first, and a full check of the executed URL is performed only if the Bloom filter returns a positive result (if the result also returns a positive result, the user issues a warning).

Block a large number of non-IP blacklist requests, and query the blacklist database again for a small number of IP that may be in the blacklist.

This is a very typical application scenario of the Bloom filter, which filters out most of the requests and then processes only a small number of ambiguous requests.

The difference between this method and the white list database is that there is no need to create another set of libraries to deal with, but to use pre-existing data and logic.

For example, Google Bigtable queries data rows that need to be checked, but uses a Bloom filter to intercept most unnecessary requests. Whether IP is blacklisted also needs to be queried, and the Bloom filter is also used to block most of the IP.

The above spam treatment, for possible spam cases, is not confirmed by re-query through the complete spam database, but by adding whitelist to judge. Because in general, white-name single libraries are smaller and easier to cache.

The origin-pull mentioned here is actually a request that may be mistakenly identified, and finally go back to the data source or logically confirm it.

Implementation of Bloom Container in Guava

An implementation of Bloom Filter is provided in Guava.

Introduction of guava package

To use Bloom Filter, you need to introduce the guava package

Com.google.guava guava 23.0 error rate test

The misjudgment rate of the Bloom container is tested in two steps.

Put a million into the filter, and then verify whether the one million can pass the filter.

Find another 10,000 to check the number of missing fish.

/ * Test Bloom filter (available for redis cache traversal) * * @ author APC * / public class TestBloomFilter {private static int total = 1000000; private static BloomFilter bf = BloomFilter.create (Funnels.integerFunnel (), total); / / private static BloomFilter bf = BloomFilter.create (Funnels.integerFunnel (), total, 0.001) Public static void main (String [] args) {/ / initializes 1000000 pieces of data to the filter for (int I = 0; I < total; iTunes +) {bf.put (I);} / / matches the values already in the filter, and whether there is a mismatch of for (int I = 0; I < total) ITunes +) {if (! bf.mightContain (I)) {System.out.println ("some bad guys escaped ~");}} / / matches 10000 values that are not in the filter, how many match int count = 0; for (int I = total; I < total + 10000) ) {if (bf.mightContain (I)) {count++;}} System.out.println ("number of accidental injuries:" + count);}}

Running result

Number of accidental injuries: 320

The running results show that when traversing the one million numbers in the filter, they are all identified. Of the ten thousand not in the filter, 320 were accidentally injured, with an error rate of about 0.03.

BloomFilter source code analysis public static BloomFilter create (Funnel

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.