Principle and implementation of Bloom filter (Bloom Filter) 07/15 Update SLTechnology News&Howtos

Principle and implementation of Bloom filter (Bloom Filter)

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

Under what circumstances do you need a Bloom filter?

Let's first look at some of the more common examples.

In word processing software, you need to check whether an English word is spelled correctly.

In FBI, whether the name of a suspect is already on the suspect list

In a web crawler, has a web address been visited?

Yahoo, gmail and other email spam filtering functions

These examples have a common feature: how to determine whether an element exists in a collection?

Conventional train of thought

Array

Linked list

Tree, balanced binary tree, Trie

Map (Red Black Tree)

Hash table

Although these data structures described above are combined with common sorting, binary search can quickly and efficiently deal with most of the requirements to determine whether elements exist in the set. But when the number of elements in the collection is large enough, what if there are 5 million or even 100 million records? At this time, the problem of conventional data structure is highlighted. Data structures such as arrays, linked lists and trees will store the contents of elements. once the amount of data is too large, the memory consumed will grow linearly and eventually reach the bottleneck. Some students may ask, isn't the hash table very efficient? The query efficiency can reach O (1). But the memory consumption of hash tables is still high. The consumption of using a hash table to store 100 million junk email addresses? Hash table practice: first, the hash function maps an email address to an 8-byte information fingerprint; considering that the storage efficiency of the hash table is usually less than 50% (hash conflict); therefore, the memory consumed: 8 * 2 * 100 million bytes = 1.6 GB of memory, an ordinary computer cannot provide such a large amount of memory. At this time, the Bloom filter (Bloom Filter) came into being. Before continuing to introduce the principle of the Bloom filter, we will first explain the preliminary knowledge about the hash function.

Hash function

The concept of a hash function is a function that converts data of any size into data of a specific size, and the converted data is called a hash value or hash code. Here is a schematic diagram:

It is obvious that the original data is called hash coding after the mapping of the hash function, and the data is compressed. The hash function is the basis for implementing hash tables and Bloom filters.

Introduction of Bloom filter

Barton。 Bloom proposed in 1970

A very long binary vector (bit array)

A series of random functions (hash)

High space efficiency and query efficiency

There is a certain misjudgment rate (the hash table is an exact match)

Principle of Bloom filter

The core implementation of the Bloom filter (Bloom Filter) is a very large array of bits and several hash functions. Suppose the length of the bit array is m and the number of hash functions is k.

The above figure is an example of the specific operation flow: suppose there are three elements {x, y, z} in the set, and the number of hash functions is 3. First initialize the bit array and set the bit 0 for each bit in it. For each element in the set, the elements are mapped through three hash functions in turn, and each mapping produces a hash value that corresponds to a point on the bit array, and then marks the corresponding position of the bit array as 1. When querying whether the W element exists in the collection, the same method maps W to three points on the array by hashing. If one of the three points is not 1, you can judge that the element must not exist in the collection. Conversely, if all three points are 1, the element may exist in the collection. Note: it is not possible to determine whether the element must exist in the collection, there may be a certain misjudgment rate. You can see from the figure: suppose an element corresponds to these three points with a subscript of 4, 5, 5, 6 by mapping. Although these three points are all 1, it is obvious that these three points are the positions obtained by hashing different elements, so this situation shows that although the elements are not in the set, they may all correspond to 1, which is the reason for the existence of misjudgment rate.

Bloom filter add element

Give k hash functions to the elements to be added

Get the k positions corresponding to the bit array

Set the k positions to 1

Bloom filter query element

Give k hash functions to the elements to be queried

Get the k positions corresponding to the bit array

If one of the k locations is 0, it is definitely not in the collection

If all k locations are 1, you may be in the collection

Implementation of Bloom filter

The following is the implementation of python, using the murmurhash algorithm

Import mmh4from bitarray import bitarray# zhihu_crawler.bloom_filter# Implement a simple bloom filter with murmurhash algorithm.# Bloom filter is used to check wether an element exists in a collection, and it has a good performance in big data situation.# It may has positive rate depend on hash functions and elements count.BIT_SIZE = 5000000class BloomFilter: def _ _ init__ (self): # Initialize bloom filter Set size and all bits to 0 bit_array = bitarray (BIT_SIZE) bit_array.setall (0) self.bit_array = bit_array def add (self, url): # Add a url, and set points in bitarray to 1 (Points count is equal to hash funcs count.) # Here use 7 hash functions. Point_list = self.get_postions (url) for b in point_list: self.bit_ array [b] = 1 def contains (self, url): # Check if a url is in a collection point_list = self.get_postions (url) result = True for b in point_list: result = result and self.bit_ array [b] return result def get_postions (self Url): # Get points positions in bit vector. Point1 = mmh4.hash (url, 41)% BIT_SIZE point2 = mmh4.hash (url, 42)% BIT_SIZE point3 = mmh4.hash (url, 43)% BIT_SIZE point4 = mmh4.hash (url, 44)% BIT_SIZE point5 = mmh4.hash (url, 45)% BIT_SIZE point6 = mmh4.hash (url, 46)% BIT_SIZE point7 = mmh4.hash (url, 47)% BIT_SIZE return [point1, point2 Point3, point4, point5, point6, point7]

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.