Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Bloom filter practice [prevent cache breakdown]

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Why is it introduced

We often encounter the problem of passing through libraries in our business, which can usually be solved through caching. If there are many data dimensions and the resulting data set is large, the effect of caching is not obvious. Therefore, in order to solve the problem of library penetration, we introduce Bloom Filter.

A suitable scene

The database prevents piercing Google Bigtable,Apache HBase and Apache Cassandra and Postgresql from using BloomFilter to reduce disk lookups for rows or columns that do not exist. Avoiding costly disk lookups will greatly improve the performance of database query operations. Just like the business scenario at the beginning. If the amount of data is large, it is not convenient to put it in the cache. You need to intercept the request to prevent it from going through the library.

In the case of cache downtime, the use of Bloom filter will cause a certain degree of misjudgment. The reason is that except for the misjudgment rate of Bloom Filter itself, the cache before downtime may not cover all the data in DB. When a user requests a data that has never been requested before, misjudgment will occur after the outage. Of course, the use of Bloom filters as an emergency in case of cache downtime should also be tolerable.

The WEB interceptor intercepts the same request to prevent attack. When the user makes the first request, the request parameters are put into the BloomFilter. When the second request is made, it is judged whether the request parameters are hit by BloomFilter. It can improve the cache hit rate.

Malicious address Detection chrome browser checks whether it is a malicious address. First check any URL against the local BloomFilter, and fully check the executed URL only if the BloomFilter returns a positive result (and the user warns, if it also returns a positive result).

Bitcoin acceleration bitcoin uses BloomFilter to speed up wallet synchronization.

Open source project address: https://github.com/luw2007/bloomfilter

Let's take a look at the general business caching process:

Query the cache first, and then query the database if the cache is not hit. Then put the query results in the cache, even if the data does not exist, you need to create a cache to prevent piercing the library. Here we need to distinguish whether the data exists or not. If the data does not exist, the cache time can be set to be relatively short to prevent problems such as master-slave synchronization from being magnified.

The weak problem in this process is that when the number of users is too large, we will cache a large amount of empty data, and once a wave of cold users comes, it will cause an avalanche effect. For this case, we have a second version of the process: the redis filtering cold user cache process

We put the hit user in the database in the set type of redis, and the setting does not expire. This is equivalent to the redis as the index of the database, as long as you query redis, you can know whether the data exists. If it doesn't exist in redis, you can return the result directly. If it exists, follow the general business caching process mentioned above.

If you are smart, you are sure to think of more questions:

Redis itself can be cached, so why not just return the data?

If the amount of data is relatively large, a single set, will there be performance problems?

Business is not important, put all the data in redis, take up a lot of memory of the server. Input and output are out of proportion?

Problem 1 needs to distinguish between business scenarios and the resulting data is low. We can directly use redis as the cache and return data directly. The result is relatively large, so it is not suitable to use redis storage. For example, in ugc content, there may be tens of thousands of words and many business fields in a comment.

There are many techniques for using redis. The harm of bigkey is relatively large, whether it is the memory request release caused by capacity expansion or reduction, or the improper use of query commands that lead to the return of a large amount of data, it will affect the stability of redis. Let's not dwell on the causes and hazards here. The solution to bigkey is simple. We can use the hash function to split the bucket and spread the data across multiple key. Reduce the size of a single key without affecting query efficiency.

Problem 3 is that redis storage takes up too much memory. So we need to reduce memory usage. Rethink the purpose of introducing redis. Redis is like a collection, and the whole business is to verify that the requested parameters are in the collection. This structure is like a two-way valve used in a bath: hot water on the left and cold water on the right.

Most programming languages have built-in filter. Take python as an example. The filter function is used to filter sequences, filter out elements that do not meet the criteria, and return a list of elements that meet the criteria.

Let's look at an example:

$python2

Python 2.7.10 (default, Oct 6 2017, 22:29:07)

[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

> s = {2,4}

> filter (lambda x in x, [0,1,2])

[2]

There are two numbers 2 and 4 in the set s, so we need to query the numbers in the set s that are in the set s. Lambda XRV x in s constructs an anonymous function to determine whether the input parameter x is in the set s. The filter filter executes anonymous functions on the numbers in the list in turn. Finally return the list [2].

Two structures are used to implement set in redis: intset and hash table. Non-numeric or a large number of numbers will be reduced to hash table. So whether a good algorithm can save the size of hash table?

In fact, the Bloom filter (English: Bloom Filter) was proposed by Burton Howard Bloom as early as 1970. It is actually a very long binary vector and a series of random mapping functions. The Bloom filter can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are far higher than the general algorithm, and the disadvantage is that it has a certain error recognition rate and deletion difficulties.

BloomFilter principle

It is common for us to md5 business fields and put them in a collection. Md5 generates a fixed-length 128bit string. If we use bitmap to represent it, we need to

2 bit 128 = 3402823669209384663374607431768211456

To determine whether a value is present or not, it becomes to determine whether the position is 1 in the bitmap. But our machine storage space around the world can't store downloads either. So we can only allocate limited space for storage. For example:

Crc32 (,):: crc32 ((x) .encode () size collision, s, () I (sample): K () j (hash_size): k.add ((ijsizehash_size)) k s: collision s k collision

When there is only one hash function: conflicts are easy to occur.

You can see that the hash results of both 1 and 2 above are 7, and there is a conflict. What happens if you add the hash function?

We use more hash functions and larger data sets to test. Get the following table

It can be seen that the increase of hash method can effectively reduce the probability of collision. The better data are as follows:

However, the addition of the hash method will reduce the efficiency of space use. When the total space occupied by the collection reaches 25%, the effect of increasing hash is no longer obvious.

The above use of multiple hash methods to reduce collisions is the core idea of BloomFilter.

Advantages of the algorithm:

The data space is small and there is no need to store the data itself.

The disadvantage of the algorithm itself is:

Elements can be added to the collection, but cannot be deleted.

The matching result can only be "absolutely not in the collection", and there is no guarantee that the successful matching value is already in the collection.

When the set is almost full, that is, when it is close to the estimated maximum capacity, the probability of false positives will increase.

The data takes up space to enlarge. In general, for a false positive probability of 1%, each element is less than 10 bits, regardless of the size or number of elements in the set. The query process slows down and the number of hash functions increases, resulting in the need to find multiple bits (the number of hash) to confirm the existence of each matching process.

For the advantages of BloomFilter, the disadvantages can be ignored. After all, only kN storage space is needed to store N elements. The space efficiency is excellent.

How to use BloomFilter

BloomFilter requires a large bitmap to store. In view of the current situation of the company, the best storage container is redis. Through a simple survey from github topics: bloom-filter.

Redis integration BloomFilter solution:

Native python calls setbit to construct BloomFilter

Lua script

Rebloom-Bloom Filter Module for Redis (Note: redis Module is introduced in redis4.0)

Use hiredis to call redis pyreBloom

Native python methods are too slow, and lua scripts and module deployment are cumbersome. So we recommend using pyreBloom, which is used at the bottom.

PyreBloom:master λ lsMakefile bloom.h bloom.pxd murmur.c pyreBloom.pyxbloom.c bloom.o main.c pyreBloom.c

You can see from the file naming that bloom is written in c. PyreBloom is written in cython.

Bloom.h to achieve the core logic of BloomFilter, complete the interaction with redis server; hash function; add, check and delete method implementation.

(pyrebloomctxt * ctxt, * key, capacity, error, * host, port, * password, db); (pyrebloomctxt * ctxt); (pyrebloomctxt * ctxt, * data, len); (pyrebloomctxt * ctxt, count); (pyrebloomctxt * ctxt, * data, len); (pyrebloomctxt * ctxt); delete (pyrebloomctxt * ctxt)

PyreBloom.pyx

Math randomcimport bloom (): cdef (): cdef bloom.pyrebloomctxt context cdef key bits: (): .context.bits hashes: (): .context. ,): .key key bloom.init_pyrebloom (.context, .key, capacity, error, host, port, password Db): pyreBloomException (.context.ctxt.errstr) (): bloom.free_pyrebloom (.context) (): bloom.delete (.context) (,): (value,): r [bloom.add (.context, v) (v) v value] r bloom.add_complete (.context, (value)): bloom.add (.context, value, (value)) r bloom.add_complete (.context) ) r: pyreBloomException (.context.ctxt.errstr) r (,): .put (value) (,): .put (values) (,): (value, ): r [bloom.check (.context, v, (v)) v value] r [bloom.check_next (.context) I ((value))] ((r)): pyreBloomException (.context.ctxt.errstr) [v Included (value, r) included]: bloom.check (.context, value) (value)) r bloom.check_next (.context) (r): pyreBloomException (.context.ctxt.errstr) (r) ( ): .context.keys [I] (.context.num _ keys)] Native pyreBloom method: cdef (object): cdef bloom.pyrebloomctxt context cdef bytes property bits: property hashes: def (self): def (self, value): def add (self) Value): def extend (self, values): def contains (self, value): def keys (self):

Because pyreBloom uses the hiredis library, there is no logic such as reconnection itself, so it makes a mistake in simple encapsulation.

Logging six pyreBloom pyreBloom, pyreBloomException BloomFilter.utils force_utf8 (): {,} ( ):. _ bf_conn. _ conf {:,:} redis: K V redis.items (): K. _ conf:. _ confs [k] redis [k]. _ conf force_utf8 (. _ conf) ():. _ bf_conn: prefix force_utf8 (.) Logging.debug (,. _ conf [],. _ conf [],. _ conf [], prefix,.,.). _ bf_conn pyreBloom (prefix,. _ conf). _ bf_conn ( Method.: () ( ): args force_utf8 (a) kwargs force_utf8 (kwargs) _ (.):: func (.bf _ conn, method) res func (args Kwargs) method: (res ): [i.decode () I res] res pyreBloomException error: logging.warn (, method (error) .reconnect () _.: logging.error () error catch_error ( ): .counting (item) ():. _ bf_conn: logging.debug (). _ bf_conn. _ bf_conn _ .bf _ conn Advanced: count filter (Counting Filter)

Provides a way to delete on BloomFilter without having to recreate the filter. In the counting filter, the array position (bucket) expands from a single bit to an n-bit counter. In fact, a regular Bloom filter can be thought of as a counting filter with a barrel size of one bit.

The insert operation is extended to increment the value of the bucket, and the lookup operation checks whether each required bucket is non-zero. The delete operation then includes decrementing the value of each bucket.

Arithmetic overflow of buckets is a problem, and buckets should be large enough to make this rare. If it does occur, the increment and decrement operations must set the store to the maximum possible value in order to retain the properties of the BloomFilter.

The size of the counter is usually 3 or 4 bits. As a result, the calculation space of the Bloom filter is 3 to 4 times more than that of the static Bloom filter. In contrast, the data structures of Pagh,Pagh and Rao (2005) and Fan et al. (2014) it also allows deletion but uses less space than static BloomFilter.

Another problem with counting filters is their limited scalability. Since the count Bloom filter table cannot be expanded, you must know in advance the maximum number of keys to be stored in the filter at the same time. Once the design capacity of the table is exceeded, the false positive rate will increase rapidly as more keys are inserted.

Bonomi et al. (2006) A data structure based on d-left hash is introduced, which is functionally equivalent, but uses about half the space of computational BloomFilter. There are no scalability issues in this data structure. Once the design capacity is exceeded, the key can be reinserted into the new hash table of double size.

Space-saving variants of Putze,Sanders and Singler (2007) can also be used to implement counting filters by supporting insertions and deletes.

Rottenstreich,Kanizo and Keslassy (2012) introduced a new general method based on variable increment which significantly improved the false positive probability of calculating Bloom filter and its variants while still supporting deletion. Unlike the count Bloom filter, the hash counter is incremented in hash variable increments rather than unit increments as each element is inserted. To query elements, you need to consider the exact values of counters, not just their positivity. If the sum represented by the counter value cannot be composed of the corresponding variable increments of the query element, you can return the negative answer to the query.

Original author: Lu Wei, a senior back-end engineer

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report