How to use Java Bloom filter 07/09 Update SLTechnology News&Howtos

How to use Java Bloom filter

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

In this article, the editor introduces in detail "how to use the Java Bloom filter". The content is detailed, the steps are clear, and the details are handled properly. I hope this article "how to use the Java Bloom filter" can help you solve your doubts.

What do you usually use to judge whether an element exists or not?

Many people think of HashMap.

It is true that you can map values to HashMap's Key, and then return results within the time complexity of O (1), which is extremely efficient. But the implementation of HashMap also has some disadvantages, such as high storage capacity, considering the existence of load factor, usually the space can not be used up, but once you have a lot of values, such as hundreds of millions of dollars, then the memory size of HashMap becomes considerable.

I. brief introduction of Bloom filter

The Bloom filter (English: Bloom Filter) was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions. The Bloom filter can be used to retrieve whether an element is in a collection.

If you want to determine whether an element is in a collection, the general idea is to save all the elements in the collection and then determine by comparison. Linked lists, trees, hash tables (also known as hash tables, Hash table) and other data structures are all this way of thinking. But as the number of elements in the collection increases, we need more and more storage space. At the same time, the retrieval speed is getting slower and slower, and the retrieval time complexity of the above three structures is O (n), O (log n) and O (1), respectively.

The principle of the Bloom filter is that when an element is added to the set, the element is mapped to K points in a bit array by K hash functions, setting them to 1. When searching, we only need to see if these points are all 1s to know if there is it in the collection: if these points have any zeros, the checked element must not be there; if it is all 1, then the checked element is likely to be there. This is the basic idea of the Bloom filter. -quoted from Wikipedia, the Encyclopedia of Freedom

The Bloom filter is essentially a data structure, and the clever probabilistic data structure (probabilistic data structure) can be inserted and queried efficiently, which can be used to tell you that "something must not exist or may exist." Compared with the traditional List, Set, Map and other data structures, it is more efficient and takes up less space, but the disadvantage is that the returned results are probabilistic rather than accurate.

When you insert new data into a simple array or list, the index value of the insert is not determined based on the value of the insert. This means that there is no direct relationship between the index value of the new insert and the data value. In this way, when you need to search for corresponding values in an array or list, you must traverse an existing collection. If there is a large amount of data in the collection, it will affect the efficiency of data search.

To solve this problem, you can consider using a hash table. Using the hash table, you can hash the "value" to get the key or index value corresponding to the value, and then store the value in the corresponding index location in the list. This means that the index value is determined by the value of the insert, and when you need to determine whether the value exists in the list, you only need to hash the value and search at the appropriate index location, which is very fast.

Second, the structure of Bloom filter

By definition, the Bloom filter can check whether the value is "probably in the set" or "absolutely not in the set". "possible" means that there is a certain probability, that is, there may be a certain misjudgment rate. Then why is there a misjudgment? Let's analyze the specific reasons.

The Bloom filter (Bloom Filter) essentially consists of a bit vector or list of bits of length m (containing only a list of 0 or 1 bit values), and all values are initially set to 0, as shown in the following figure.

In order to add data items to the Bloom filter, we provide K different hash functions and set the value of the corresponding bit in the result position to "1". In the hash table mentioned earlier, we use a single hash function, so we can only output a single index value. For the Bloom filter, we will use multiple hash functions, which will result in multiple index values.

As shown in the figure above, when entering "semlinker", the preset three hash functions will output 2, 4, 6, and we will put the corresponding position 1. Assuming another input "kakuqo", the hash function outputs 3, 4, and 7. You may have noticed that index bit 4 has been marked with the previous "semlinker". At this point, we have populated the bit vector with the input values of "semlinker" and "kakuqo". The marking status of the current bit vector is:

When searching for a value, similar to a hash table, we will use three hash functions to hash the "searched value" and look at the resulting index value. Suppose that when we search for "fullstack", the three index values output by the three hash functions are 2, 3, and 7, respectively:

As you can see from the figure above, the corresponding index bits are set to 1, which means that we can say that "fullstack" may have been inserted into the collection. In fact, this is a false alarm, which is caused by storing different elements on the same bit because of a coincidence caused by a hash collision.

So how do we choose the number of hash functions and the length of the Bloom filter? obviously, if the Bloom filter is too small, all bit bits will soon be 1, so any value of the query will return "possible", which will not serve the purpose of filtering. The length of Bloom filter will directly affect the false alarm rate, the longer the Bloom filter, the smaller the false alarm rate.

In addition, the number of hash functions also needs to be weighed, the more the number, the faster the speed of the Bloom filter bit bit 1, and the lower the efficiency of the Bloom filter; but if too few, then our false alarm rate will be higher.

How to choose the k and m values that are appropriate for the business? fortunately, the Bloom filter has a predictable error rate (FPP):

N is the number of elements that have been added

Number of k hashes

The length of the m Bloom filter (such as the size of the bit array)

In extreme cases, when the Bloom filter has no free space (full), each query returns true. This means that the choice of m depends on the number of elements expected to be added n, and m needs to be much greater than n.

In practice, the length m of the Bloom filter can be calculated according to the given error rate (FFP) and the number of elements expected to be added by the following formula:

After understanding the above, we can draw a conclusion: when we search for a value, if any index bit of the value after K hash function operation is "0", then the value must not be in the set. However, if all hash index values are "1", it can only be said that the value of the search may exist in the collection.

III. Application of Bloom filter

In practice, the common application scenarios of Bloom filter are as follows:

Web crawler removes the weight of URL to avoid crawling the same URL address

Anti-spam to determine whether a mailbox is spam from billions of spam lists

Google Chrome uses Bloom filter to identify malicious URL

Medium uses Bloom filters to avoid recommending articles that users have already read.

Google BigTable,Apache HBbase and Apache Cassandra use Bloom filters to reduce lookups for rows and columns that do not exist.

In addition to the above application scenarios, another application scenario of the Bloom filter is to solve the problem of cache penetration. The so-called cache traversal means that the service caller queries the data that is not in the cache every time, so that every service call will query the database. If there are more such requests, it will increase the pressure on the database. So the cache loses its meaning.

Using the Bloom filter, we can cache the primary key of the data query, such as the user ID or the article ID, in the filter in advance. When querying data based on ID, we first determine whether the ID exists, and if so, proceed to the next step. If it does not exist, return directly so that subsequent database queries are not triggered. It should be noted that cache traversal can not be completely solved, we can only control it within a tolerable range.

IV. Advantages and disadvantages of Bloom filter

Advantages

Compared with other data structures, Bloom filter has great advantages in terms of space and time. The Bloom filter storage space and insert / query time are constant (O (k)). In addition, hash functions are not related to each other and are easy to be implemented in parallel by hardware. The Bloom filter does not need to store the element itself and has an advantage in some situations where secrecy is very strict.

A Bloom filter can represent a complete set, but not any other data structure.

K and m are the same, and the intersection and union of two Bloom filters using the same set of hash functions can be performed using bit operations.

Shortcoming

But the disadvantages of the Bloom filter are as obvious as the advantages. The miscalculation rate is one of them. As the number of elements deposited increases, the miscalculation rate increases. But if the number of elements is too small, a hash table is sufficient.

In addition, elements cannot be removed from the Bloom filter in general. It is easy to think of turning an array of bits into an array of integers, adding 1 to the counter corresponding to each element inserted, so that you can subtract the counter when you delete the element. However, it is not so simple to ensure that elements are safely deleted. First of all, we have to make sure that the deleted element is indeed in the Bloom filter. This filter alone cannot guarantee this. In addition, counter winding can also cause problems.

In reducing the miscalculation rate, there is a lot of work, resulting in a lot of variants of Bloom filter.

Fifth, Bloom filter actual combat

There are many implementations and optimizations of Bloom filters, and the famous Guava library developed by Google provides the implementation of Bloom filters (Bloom Filter). To use the Bloom filter provided by Guava in Maven-based Java projects, you only need to introduce the following coordinates:

Com.google.guava guava 28.0-jre

After importing the Guava library, we create a new BloomFilterDemo class. In the main method, we use the BloomFilter.create method to create a Bloom filter. Then we initialize 1 million pieces of data into the filter, then add 10000 pieces of data to the original data and determine whether the data exists in the Bloom filter:

Import com.google.common.base.Charsets;import com.google.common.hash.BloomFilter;import com.google.common.hash.Funnels;public class BloomFilterDemo {public static void main (String [] args) {int total = 1000000; / / Total quantity BloomFilter bf = BloomFilter.create (Funnels.stringFunnel (Charsets.UTF_8), total); / / initialize 1000000 pieces of data to filter for (int I = 0; I < total) Bf.put +) {if ("" + I);} / / determine whether the value exists in the filter int count = 0; for (int I = 0; I < total + 10000; iBo +) {if ("" + I)) {count++ }} System.out.println ("matched quantity" + count);}}

When the above code runs, the console outputs the following results:

Matched quantity 1000309

It is obvious that there have been false positives in the above output, because there are 309 more elements than expected, and the misjudgment rate is:

309 / (1000000 + 10000) * 100 ≈ 0.030594059405940593

If we want to improve the matching accuracy, we can set the error rate fpp when creating the Bloom filter:

BloomFilter bf = BloomFilter.create (

Funnels.stringFunnel (Charsets.UTF_8), total, 0.0002

);

Within BloomFilter, the default value for false positive rate fpp is 0.03:

/ / com/google/common/hash/BloomFilter.classpublic static BloomFilter create (Funnel

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.