How to implement a Bloom filter in C++ 07/12 Update SLTechnology News&Howtos

How to implement a Bloom filter in C++

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of how to achieve a Bloom filter in C++, the content is detailed and easy to understand, the operation is simple and fast, and has a certain reference value. I believe you will gain something after reading this C++ article on how to achieve a Bloom filter. Let's take a look.

Bloom filter

First, historical background knowledge

The Bloom filter (Bloom Filter) was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions. The Bloom filter can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are much higher than the general algorithm, and the disadvantage is that it has a certain error recognition rate and deletion errors. And this shortcoming is inevitable. However, there is absolutely no recognition error (that is, a false counterexample False negatives, if an element is not in the collection, then Bloom Filter will not report that the element exists in the collection, so it will not fail to report)

In FBI, whether a suspect's name is already on the suspect list; in a web crawler, whether a URL has been visited, and so on. The most direct way is to store all the elements in the collection in the computer, and when you encounter a new element, you can directly compare it with the elements in the collection. Generally speaking, collections in computers are stored in a hash table (hash table). Its advantage is that it is fast and accurate, while its disadvantage is that it costs storage space. This problem is not significant when the set is small, but the problem of inefficient storage of hash tables becomes apparent when the set is large.

For example, an email provider like Yahoo,Hotmail and Gmai always needs to filter spam from people who send spam (spamer). One way to do this is to record the email addresses of those who send spam. Because those senders are constantly registering new addresses, there are at least billions of spam addresses around the world, and saving them all requires a large number of web servers. If you use a hash table, for every 100 million email addresses stored, you need the memory of 1.6GB (the specific way to implement it with a hash table is to fingerprint each email address into an eight-byte information fingerprint, and then store these information fingerprints in the hash table. Since the storage efficiency of the hash table is generally only 50%, an email address needs to occupy 16 bytes. About 100 million addresses require 1.6GB, that is, 1.6 billion bytes of memory. So storing billions of e-mail addresses may require hundreds of GB of memory. Unless it is a supercomputer, the general server cannot be stored [2].

Second, the principle, advantages and disadvantages of Bloom filter

If you want to determine whether an element is in a collection, the general idea is to save all the elements in the collection and then determine by comparison. Linked lists, trees, hash tables (hash tables, Hash table) and other data structures all have this idea. But as the number of elements in the collection increases, we need more and more storage space. At the same time, the retrieval speed will be slower and slower.

Bloom Filter is a random data structure with high spatial efficiency. Bloom Filter can be regarded as an extension of bit-map. Its principle is:

When an element is added to the set, the element is mapped to K points in a bit array (Bit array) by K hash functions, and they are set to 1. When searching, we only need to see if these points are all 1 to know (approximately) whether it is in the collection:

If any of these points have zeros, the retrieved element must not be in the

If all are 1, the retrieved element is likely to be in.

Advantages:

Its advantage is that the space efficiency and query time are much higher than the general algorithm. Bloom filter storage space and insert\ query time are all O (K). In addition, the hash functions are not related to each other, so it is convenient for hardware parallel implementation. Bloom filter does not need to store the element itself, so it has an advantage in some situations where the security requirements are very strict.

Disadvantages:

1. The disadvantages and advantages of Bloom filter are also obvious. The miscalculation rate is one of them. With the increase of deposit elements, the miscalculation rate increases. But if the number of elements is too small, a hash is fine.

2, in general, we can not delete elements from the Bloom filter, we can easily think of turning a bit array into an integer array, adding 1 to the corresponding counter for each element inserted, so that the counter can be subtracted when the element is deleted. However, it is not that simple to ensure that elements are safely deleted. First of all, we have to make sure that the deleted elements do exist in the Bloom filter, and counter wrapping can also cause problems.

III. Example

Google Chrome browsers use Bloom filter to identify malicious links (which can represent large sets of data with smaller storage space, that is, each URL can be mapped to bit), and the error rate is less than 1/10000.

C++ implementation

Bit_set.h

# pragma once # include using namespace std; # include class Bitset {public: Bitset (size_t value) {_ a.resize ((value > > 5) + 1,0);} bool set (size_t num) {size_t index = num > > 5; size_t pos = num% 32; if (_ a [index] & (15; size_t pos = num% 32) If (Text (num)) {_ a [index] & = ~ (1 > 5; size_t pos = num% 32; return _ a [index] & (13));} else {hash ^ = (~ ((hash > 5);}} return hash;} template size_t JSHash (const char* str) {if (! * str) {return 0 } size_t hash = 1315423911; while (size_t ch = (size_t) * str++) {hash ^ = ((hash > 2));} return hash;}

Bloom_Filter.h

# pragma once # include "bite_set.h" # include "Hash.h" # include template struct _ HashFunk1 {size_t operator () (const T & key) {return BKDRHash (key.c_str ());}; template struct _ HashFunk2 {size_t operator () (const T & key) {return SDBMHash (key.c_str ());}} Template struct _ HashFunk3 {size_t operator () (const T & key) {return RSHash (key.c_str ())}}; template struct _ HashFunk4 {size_t operator () (const T & key) {return APHash (key.c_str ());}}; template struct _ HashFunk5 {size_t operator () (const T & key) {return JSHash (key.c_str ());}} Template class BoolFilter {public: BoolFilter (size_t n): _ a (n * 10), _ range (n * 10) {} void set (const K & key) {_ a.set (HashFunk1 () (key)% _ range); _ a.set (HashFunk2 () (key)% _ range); _ a.set (HashFunk3 () (key)% _ range) _ a.set (HashFunk4 () (key)% _ range); _ a.set (HashFunk5 () (key)% _ range);} bool Text (const K & key) {if (! _ a.Text (HashFunk1 () (key)% _ range)) return false; if (! _ a.Text (HashFunk2 () (key)% _ range)) return false If (! _ a.Text (HashFunk3 () (key)% _ range)) return false; if (! _ a.Text (HashFunk4 () (key)% _ range)) return false; if (! _ a.Text (HashFunk5 () (key)% _ range)) return false; return true;} private: Bitset _ a; size_t _ range;} This is the end of the article on "how to implement a Bloom filter in C++". Thank you for reading! I believe you all have a certain understanding of the knowledge of "how to achieve a Bloom filter in C++". If you want to learn more knowledge, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.