Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand the implementation principle of Bloom filter algorithm

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to understand the implementation principle of Bloom filter algorithm". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. let's study and learn "how to understand the implementation principle of Bloom filter algorithm"!

Some of the concepts of the Bloom filter include:

Brief introduction

Arithmetic

Parameters.

Advantages and disadvantages

Brief introduction of Bloom filter

The Bloom filter is "a spatially efficient probabilistic data structure" (the original text in the encyclopedia is a space-efficient probabilistic data structure), which was proposed by Burton Howard Bloom in 1970, "to test whether an element is a member of a set". It is possible for the Bloom filter to match false positive (this is a proper noun "false positive", which can be understood as a misjudgment, and English words will be reserved if this noun is used below). In other words, when used, the Bloom filter may return the result "may exist in the collection" or "must not exist in the collection".

Description of Bloom filter algorithm

In web crawlers with complex scenarios, the URL dependencies of the crawled web pages may be looped. For example, the URL-2 is displayed on the URL-1 page, and then the URL-1 is displayed on the URL-2 page. In this case, a solution is needed to record and judge the URL that has been visited in history. At this point, you may think of the following plan:

Option 1: use a database to store URL that has been accessed, such as creating a unique index based on URL in the MySQL table or using the SET data type of Redis

Solution 2: use HashSet (in fact, it is not limited to HashSet here, linked lists, trees and hash tables and other data structures can be satisfied) to store the URL that has been accessed

Scheme 3: based on the optimization of scheme 1 and scheme 2, the summary of URL is stored, and the summary algorithm such as MD5 and SHA-n algorithm is used to generate summary for URL string.

Scheme 4: use the Hash function to process the corresponding URL to generate a hash code, and then map the hash code to a bit in a fixed capacity BitSet through a mapping function.

For scenarios 1, 2, and 3, huge storage space (disk or memory) will be consumed in the case of historical access to a large amount of URL data. For scenario 4, if there are 10 billion URL, then the collision probability should be reduced to 1%, then the capacity of BitSet needs to be set to 1 trillion.

Therefore, the above four schemes have obvious shortcomings, and the basic idea of the Bloom filter algorithm is similar to that of the fourth scheme, and the biggest difference is that only one hash function is mentioned in the fourth scheme. The Bloom filter uses k (k > = 1) independent, efficient and low-conflict hash functions.

An initialized Bloom filter is an array of bits of length m with all bits set to 0, that is, the cognitive concept of Bit Map in Bit Array, Bit Set, or Redis. Then we need to introduce k different hash functions, after a new element is processed by these k hash functions, it is mapped to k of the m bits of the bit array, and the k bits of these hit maps are set to 1 to produce a uniform random distribution. In general, a small constant of k depends on the required misjudgment rate, while the capacity of Bloom filter m is positively correlated with the number of hash functions k and the number of elements to be added.

When all the new elements that need to be added to the Bloom filter are added, many bits in the bit array are set to 1. At this time, if you need to judge whether an element exists in the Bloom filter, you only need to get the k subscripts of the bit array through k hash functions, and then determine whether the corresponding subscript of the bit array is 1. If there is at least one zero in the k subscript bits, then the element to be judged must not be in the set represented by the Bloom filter. If all the k subscript bits are 1, then the element to be judged may "exist" in the set represented by the Bloom filter or happen to be a False Positive. For error rate analysis, see the "related parameters of the Bloom filter" section below. What happens to False Positive can be seen in the following figure:

When the number of elements added to the Bloom filter is relatively large, and the capacity setting of the Bloom filter is unreasonable (too small), it is easy to have multiple elements through k hash functions. Mapping to the same k bits (such as the bits of subscript 1, 3, 9 above), it is impossible to accurately judge that the k bits are mapped from the specific element. In fact, you can think about it in an extreme: assuming that the Bloom filter capacity is 24 and there is only one hash function, then adding at most 25 different elements, the mapping results of two different elements must fall in the same bit.

Related parameters of Bloom filter

As mentioned in the algorithm description section, the Bloom filter has the following main parameters:

Initialize bit array capacity m

Hash function k

Misjudgment rate ε (mathematical symbol Epsilon, stands for False Positive Rate)

Number of elements that need to be added to the Bloom filter n

Taking into account the length of the reason, here do not do the derivation of the relationship between these values, directly sort out the results and relations.

The estimated value of error rate ε is: [1-e ^ (- kn/m)] ^ k

The estimated value of the optimal number of hash functions k: for a given m and n, when k = m _ p * ln2, the error rate is the lowest.

Calculate the value of the initialization bit capacity m. When k = m * ln2, m > = n * log2 (e) * log2 (1 / ε)

Here is a diagram of the relationship between mPain, k, and False Positive Rate in Resources:

Here we can calculate the space limit required by the maximum parameter in the table, assuming that n is 1 billion, m _ (_ A) = 32, then m is 32 billion and k is 24, the misjudgment rate is 2.17e-07 (0.000000217), and the space is 3814.69727m. The general rule is:

When k is fixed, the larger the mBO is, the smaller the misjudgment rate is.

When mBO is fixed, the greater the k, the greater the misjudgment rate.

In general, k needs to be fixed, but n can not determine the exact value. It is best to evaluate the growth trend and calculate a relatively large m value in advance to reduce the misjudgment rate. Of course, it is also necessary to weigh the problem that excessive m value leads to excessive space consumption.

Now that all the relationships of the parameters have been derived, you can write a parameter generator based on the relationship:

Import java.math.BigDecimal; import java.math.RoundingMode; public class BloomFilterParamGenerator {public BigDecimal falsePositiveRate (int m, int n, int k) {double temp = Math.pow (1-Math.exp (Math.floorDiv (- k * n, m)), k); return BigDecimal.valueOf (temp) .setScale (10, RoundingMode.FLOOR) } public BigDecimal kForMinFalsePositiveRate (int m, int n) {BigDecimal k = BigDecimal.valueOf (Math.floorDiv (m, n) * Math.log (2)); return k.setScale (10, RoundingMode.FLOOR);} public BigDecimal bestM (int n, double falsePositiveRate) {double temp = log2 (Math.exp (1) + Math.floor (1 / falsePositiveRate)) Return BigDecimal.valueOf (n) .multiply (BigDecimal.valueOf (temp)) .setScale (10, RoundingMode.FLOOR);} public double log2 (double x) {return Math.log (x) / Math.log (2);} public static void main (String [] args) {BloomFilterParamGenerator generator = new BloomFilterParamGenerator (); System.out.println (generator.falsePositiveRate (2,1,2)) / / 0.3995764008 System.out.println (generator.kForMinFalsePositiveRate (32, 1)); / / 22.1807097779 System.out.println (generator.bestM (1, 0.3995764009)); / / 2.2382615950}}

The calculation here does not take into account strict carry and truncation, so there may be a deviation from the actual results, only a reference example is provided.

Advantages and disadvantages of Bloom filter

Advantages of Bloom filter:

Compared with other data structures, Bloom filter has great advantages in time and space, occupies less memory, and the time complexity of query and insertion elements is O (k).

You can accurately judge the scene where the element does not exist in the Bloom filter.

Hash functions can be designed independently

The Bloom filter does not need to store the element itself, so it is suitable for some scenarios where data is sensitive and data is strictly confidential.

Disadvantages of Bloom filter:

It is impossible to accurately judge the scene in which the elements must exist in the Bloom filter, and there is a misjudgment rate. When k and m are fixed, the more elements are added, the higher the misjudgment rate is.

It does not store all the elements, so it is not suitable for some scenarios with accurate query or accurate statistics.

Native Bloom filters cannot safely delete elements

Here's a simple question for the reader: why can't native Bloom filters safely delete elements? (you can check the previous introduction to False Positive)

Implementation of Bloom filter algorithm

The famous Java tool class library Guava comes with a beta version of the Bloom filter implementation. Here we refer to the source code implementation ideas and the algorithm description above to implement a Bloom filter. Consider designing hash functions first. A simpler way is to refer to the design of JavaBean's hashCode () method:

/ / the following method comes from java.util.Arrays#hashCode public static int hashCode (Object a []) {if (a = = null) return 0; int result = 1; for (Object element: a) result = 31 * result + (element = = null? 0: element.hashCode ()); return result;}

The 31 of the above method can be used as an input seed, and each hash function designs an independent seed, and the calculated result is relatively independent by selecting a prime number based on the superposition of each char in the string:

Import java.util.Objects; public class HashFunction {/ * * Bloom filter capacity * / private final int m; / * seed * / private final int seed; public HashFunction (int m, int seed) {this.m = m; this.seed = seed } public int hash (String element) {if (Objects.isNull (element)) {return 0;} int result = 1; int len = element.length (); for (int I = 0; I < len; iTunes +) {result = seed * result + element.charAt (I) } / / make sure that the calculated results do not exceed m return (m-1) & result;}}

Then implement the Bloom filter:

Public class BloomFilter {private static final int [] K_SEED_ARRAY = {5, 7, 11, 13, 31, 37, 61, 67}; private static final int MAX_K = KnowledgeSEEDThe Array. Duration; private final int m; private final int k; private final BitSet bitSet; private final HashFunction [] hashFunctions; public BloomFilter (int m, int k) {this.k = k If (k MAX_K) {throw new IllegalArgumentException ("k =" + k);} this.m = m; this.bitSet = new BitSet (m); hashFunctions = new HashFunction [k]; for (int I = 0; I < k; iSay +) {hashFunctions [I] = new HashFunction (m, KnowSEEDARRAY [I]) } public void addElement (String element) {for (HashFunction hashFunction: hashFunctions) {bitSet.set (hashFunction.hash (element), true);}} public boolean contains (String element) {if (Objects.isNull (element)) {return false;} boolean result = true For (HashFunction hashFunction: hashFunctions) {result = result & & bitSet.get (hashFunction.hash (element));} return result;} public int m () {return m;} public int k () {return k;} public static void main (String [] args) {BloomFilter bf = new BloomFilter (24,3) Bf.addElement ("throwable"); bf.addElement ("throwx"); System.out.println (bf.contains ("throwable")); / / true}}

The hash algorithm and the limited k value here are not enough to deal with complex scenarios, just to show how to implement the Bloom filter, generally speaking, the native Bloom filter algorithm is relatively simple. For some complex production scenarios, you can use some off-the-shelf class libraries such as the Bloom filter API in Guava, the Bloom filter plug-in in Redis, or the Bloom filter API in Redisson (Redis advanced client).

Application of Bloom filter

It mainly includes:

API in Guava

API in Redisson

Working with scen

Use the Bloom filter API in Guava

Introduce Guava dependencies:

Com.google.guava guava 30.1-jre

Use the Bloom filter:

Import com.google.common.hash.BloomFilter; import com.google.common.hash.Funnels; import java.nio.charset.StandardCharsets; public class GuavaBloomFilter {@ SuppressWarnings ("UnstableApiUsage") public static void main (String [] args) {BloomFilter bloomFilter = BloomFilter.create (Funnels.stringFunnel (StandardCharsets.US_ASCII), 10000, 0.0444D); bloomFilter.put ("throwable"); bloomFilter.put ("throwx") System.out.println (bloomFilter.mightContain ("throwable")); System.out.println (bloomFilter.mightContain ("throwx"));}}

The static factory method for constructing the largest number of parameters for BloomFilter is BloomFiltercreate (Funnel funnel, long expectedInsertions, double fpp, BloomFilter.Strategy strategy), with the following parameters:

Funnel: mainly converts any type of data into HashCode, which is a top-level interface with a large number of built-in implementations, see Funnels

ExpectedInsertions: the number of elements expected to be inserted

Fpp: guess is False Positive Percent, misjudgment rate, decimal rather than percentage, default value 0.03

Strategy: mapping policy, currently only MURMUR128_MITZ_32 and MURMUR128_MITZ_64 (default policy)

The parameters can be customized based on the actual scenario by referring to the above table or the guidance of the parameter generator.

Use the Bloom filter API in Redisson

The advanced Redis client Redisson has been encapsulated based on the bitmap data structure of Redis, shielding the complex implementation logic and can be used out of the box. Introduce Redisson dependencies:

Org.redisson redisson 3.15.1

Use the Bloom filter API in Redisson:

Import org.redisson.Redisson; import org.redisson.api.RBloomFilter; import org.redisson.api.RedissonClient; import org.redisson.config.Config; public class RedissonBloomFilter {public static void main (String [] args) {Config config = new Config (); config.useSingleServer () .setAddress ("redis://127.0.0.1:6379"); RedissonClient redissonClient = Redisson.create (config) RBloomFilter bloomFilter = redissonClient.getBloomFilter ("ipBlockList"); / / the first parameter expectedInsertions represents the number of elements expected to be inserted, the second parameter falseProbability represents the expected error rate, and the decimal represents bloomFilter.tryInit (100000L, 0.03D); bloomFilter.add ("127.0.0.1"); bloomFilter.add ("192.168.1.1") System.out.println (bloomFilter.contains ("192.168.1.1"); / / true System.out.println (bloomFilter.contains ("192.168.1.2")); / / false}}

The Bloom filter interface RBloomFilter provided by Redisson is simple:

The common methods are tryInit () (initialization), add () (adding an element), and contains () (determining whether an element exists). Compared with Guava's memory state Bloom filter implementation, Redisson provides a "distributed Bloom filter" based on Redis implementation, which can meet the use of Bloom filter in distributed clusters.

Bloom filter usage scen

In fact, the use of the Bloom filter can be described by a diagram in the encyclopedia:

Some of the scenarios based on the above figure are listed as follows:

URL weight removal in website crawler applications (URL that does not exist in the Bloom filter must be an uncrawled URL)

IP blacklist judgment in firewall applications (not limited to IP blacklist, general blacklist judgment scenarios can use Bloom filter, IP that does not exist in Bloom filter must be whitelist)

Used to circumvent cache penetration (KEY that does not exist in the Bloom filter must not exist in the post cache)

Bloom filter variant

There are many variants of Bloom filter, mainly to solve some defects or disadvantages in the Bloom filter algorithm. Common variants are as follows:

The variant name describes that Counting Bloom Filter replaces each bit of the native Bloom filter with a small counter (Counter). The so-called counter is actually a small integer Compressed Bloom Filter that compresses the bit array Hierarchical Bloom Filters layering, which is composed of multi-layer Bloom filters to form an extension of Spectral Bloom FiltersCBF, which provides the function of querying the occurrence frequency of set elements Bloomier Filters storage function values, not just do bit mapping Time-Decaying Bloom Filters counter array replacement vector Optimize the minimum space Space Code Bloom Filter-Filter Banks-Scalable Bloom filters-Split Bloom Filters-Retouched Bloom filters-Generalized Bloom Filters-Distance-sensitive Bloom filters-Data Popularity Conscious Bloom Filters-Memory-optimized Bloom Filter-Weighted Bloom filter-Secure Bloom filters- required for each counter to store its value

Here, select the Counting Bloom Filter (CBF for short) variant to expand a little bit. The basic data structure of the native Bloom filter is the bit vector. CBF extends the basic data structure of the native Bloom filter. Each element of the underlying array uses a 4-bit counter to store the frequency of successful mapping when adding an element to a subscript of the array. When inserting a new element, it is mapped to k specific counters through k hash functions, and the value of these hits is increased by 1. When you delete an element, it is mapped to k specific counters through k hash functions, and these counter values are reduced by 1. When using CBF to determine whether an element is in the collection:

If an element is mapped to k specific counters through k hash functions, and all counters have a value of 0, then the element must not be in the collection.

If an element is mapped to k specific counters through k hash functions, at least one counter has a value greater than 0, then the element may be in the collection

A summary summarizes the basic functions of Bloom filter: "if it does not exist, it must not exist, and if it does not exist, it does not necessarily exist." "

When using a Bloom filter to determine whether an element belongs to a set, there will be a certain rate of misjudgment. That is, it is possible to misjudge elements that do not belong to a collection as belonging to the collection, which is called False Positive, but will not misjudge elements belonging to a collection as not belonging to the set (as opposed to False Positive, "false positive", if elements belonging to a collection are mistakenly judged as not belonging to the set, it is called False Negative, "false negative"). The introduction of False Positive, that is, the error rate or misjudgment rate, is the key to the balance of spatial efficiency in the design of Bloom filters.

Thank you for your reading. the above is the content of "how to understand the implementation principle of the Bloom filter algorithm". After the study of this article, I believe you have a deeper understanding of how to understand the implementation principle of the Bloom filter algorithm. the specific use also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report