In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >
Share
Shulou(Shulou.com)06/01 Report--
Under what circumstances do you need a Bloom filter?
Let's first look at some of the more common examples.
In word processing software, you need to check whether an English word is spelled correctly.
In FBI, whether the name of a suspect is already on the suspect list
In a web crawler, has a web address been visited?
Yahoo, gmail and other email spam filtering functions
These examples have a common feature: how to determine whether an element exists in a collection?
Conventional train of thought
Array
Linked list
Tree, balanced binary tree, Trie
Map (Red Black Tree)
Hash table
Although these data structures described above are combined with common sorting, binary search can quickly and efficiently deal with most of the requirements to determine whether elements exist in the set. But when the number of elements in the collection is large enough, what if there are 5 million or even 100 million records? At this time, the problem of conventional data structure is highlighted. Data structures such as arrays, linked lists and trees will store the contents of elements. once the amount of data is too large, the memory consumed will grow linearly and eventually reach the bottleneck. Some students may ask, isn't the hash table very efficient? The query efficiency can reach O (1). But the memory consumption of hash tables is still high. The consumption of using a hash table to store 100 million junk email addresses? Hash table practice: first, the hash function maps an email address to an 8-byte information fingerprint; considering that the storage efficiency of the hash table is usually less than 50% (hash conflict); therefore, the memory consumed: 8 * 2 * 100 million bytes = 1.6 GB of memory, an ordinary computer cannot provide such a large amount of memory. At this time, the Bloom filter (Bloom Filter) came into being. Before continuing to introduce the principle of the Bloom filter, we will first explain the preliminary knowledge about the hash function.
Hash function
The concept of a hash function is a function that converts data of any size into data of a specific size, and the converted data is called a hash value or hash code. Here is a schematic diagram:
It is obvious that the original data is called hash coding after the mapping of the hash function, and the data is compressed. The hash function is the basis for implementing hash tables and Bloom filters.
Introduction of Bloom filter
Barton。 Bloom proposed in 1970
A very long binary vector (bit array)
A series of random functions (hash)
High space efficiency and query efficiency
There is a certain misjudgment rate (the hash table is an exact match)
Principle of Bloom filter
The core implementation of the Bloom filter (Bloom Filter) is a very large array of bits and several hash functions. Suppose the length of the bit array is m and the number of hash functions is k.
The above figure is an example of the specific operation flow: suppose there are three elements {x, y, z} in the set, and the number of hash functions is 3. First initialize the bit array and set the bit 0 for each bit in it. For each element in the set, the elements are mapped through three hash functions in turn, and each mapping produces a hash value that corresponds to a point on the bit array, and then marks the corresponding position of the bit array as 1. When querying whether the W element exists in the collection, the same method maps W to three points on the array by hashing. If one of the three points is not 1, you can judge that the element must not exist in the collection. Conversely, if all three points are 1, the element may exist in the collection. Note: it is not possible to determine whether the element must exist in the collection, there may be a certain misjudgment rate. You can see from the figure: suppose an element corresponds to these three points with a subscript of 4, 5, 5, 6 by mapping. Although these three points are all 1, it is obvious that these three points are the positions obtained by hashing different elements, so this situation shows that although the elements are not in the set, they may all correspond to 1, which is the reason for the existence of misjudgment rate.
Bloom filter add element
Give k hash functions to the elements to be added
Get the k positions corresponding to the bit array
Set the k positions to 1
Bloom filter query element
Give k hash functions to the elements to be queried
Get the k positions corresponding to the bit array
If one of the k locations is 0, it is definitely not in the collection
If all k locations are 1, you may be in the collection
Implementation of Bloom filter
The following is the implementation of python, using the murmurhash algorithm
Import mmh4from bitarray import bitarray# zhihu_crawler.bloom_filter# Implement a simple bloom filter with murmurhash algorithm.# Bloom filter is used to check wether an element exists in a collection, and it has a good performance in big data situation.# It may has positive rate depend on hash functions and elements count.BIT_SIZE = 5000000class BloomFilter: def _ _ init__ (self): # Initialize bloom filter Set size and all bits to 0 bit_array = bitarray (BIT_SIZE) bit_array.setall (0) self.bit_array = bit_array def add (self, url): # Add a url, and set points in bitarray to 1 (Points count is equal to hash funcs count.) # Here use 7 hash functions. Point_list = self.get_postions (url) for b in point_list: self.bit_ array [b] = 1 def contains (self, url): # Check if a url is in a collection point_list = self.get_postions (url) result = True for b in point_list: result = result and self.bit_ array [b] return result def get_postions (self Url): # Get points positions in bit vector. Point1 = mmh4.hash (url, 41)% BIT_SIZE point2 = mmh4.hash (url, 42)% BIT_SIZE point3 = mmh4.hash (url, 43)% BIT_SIZE point4 = mmh4.hash (url, 44)% BIT_SIZE point5 = mmh4.hash (url, 45)% BIT_SIZE point6 = mmh4.hash (url, 46)% BIT_SIZE point7 = mmh4.hash (url, 47)% BIT_SIZE return [point1, point2 Point3, point4, point5, point6, point7]
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.