What is the algorithm formula for the size of Redis Bloom filter 07/06 Update SLTechnology News&Howtos

What is the algorithm formula for the size of Redis Bloom filter

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Today, the editor will share with you the relevant knowledge of what the algorithm formula of Redis Bloom filter size is. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look.

1. Brief introduction

Client: does this key exist?

Server: does not exist / does not know

In essence, Bloom filter is a kind of data structure and a kind of ingenious probabilistic data structure. It is characterized by efficient insertion and query. But when we want to check whether a key exists in a structure, by using the Bloom filter, we can quickly learn that "this key must not exist or may exist." Compared with the traditional data structures such as List, Set and Map, it is more efficient and takes up less space, but the results it returns are probabilistic and inaccurate.

The Bloom filter is only used to test membership in the collection. A classic example of using a Bloom filter is to reduce expensive disk (or network) lookups for keys that do not exist. As we can see, the Bloom filter can search for keys within a constant O (k) time, where k is the number of hash functions, and the non-existence of test keys will be very fast.

two。 Application scenario 2.1 cache traversal

In order to improve access efficiency, we will put some data in the Redis cache. When making a data query, you can first get the data from the cache without reading the database. This can effectively improve performance.

When querying data, we should first determine whether there is data in the cache, and if so, get the data directly from the cache.

But if there is no data, you need to get the data from the database and put it in the cache. If a large number of accesses fail to hit the cache, it will cause the database to bear greater pressure and cause the database to crash. With the Bloom filter, when accessing a cache that does not exist, you can quickly return to avoid caching or DB crash.

2.2 determine whether a data exists in a large amount of data

There is a huge amount of data stored in HBase. To determine whether a ROWKEYS or a column exists, you can use a Bloom filter to quickly get whether a certain data exists. But there is a certain rate of misjudgment. But if a key doesn't exist, it must be accurate.

3. The problem of HashMap

It is very efficient to determine whether an element exists or not with its practical HashMap. HashMap can achieve O (1) constant time complexity by mapping the value to the Key of HashMap.

However, if the amount of data stored is very large (for example, hundreds of millions of data), HashMap will consume a very large amount of memory. And it is impossible to read large amounts of data into memory at once.

4. Understanding Bloom filter

Working schematic diagram:

A Bloom filter is an bit array or a bit binary vector.

The elements in this array store either 0 or 1

The k hash functions are independent of each other, and the result calculated by each hash function is modularized to the length m of the array, and the bit of one is set to 1 (blue cell).

We set the cell in this way for each key, which is called "Bloom filter".

5. Query elements according to Bloom filter

Suppose we enter a key, and we use the previous k hash functions to hash and get k values.

Determine whether the k values are all blue. If one is not blue, then the key must not exist.

If they all have blue, then key is possible (Bloom filter can be misjudged)

Because if there are many input objects and the collection is relatively small, it will cause most of the positions in the collection to be blue, so when a key is blue, it happens that a certain location is set to blue. At this time, it will mistakenly think that the key is in the collection.

Example:

6. Can I delete it?

Traditional Bloom filters do not support delete operations. But a variant called Counting Bloom filter can be used to test whether the element count is definitely less than a certain threshold, and it supports element deletion. For detailed understanding, you can refer to the principle and implementation of the article Counting Bloom Filter, which is written in detail.

7. How to choose the number of hash function and the length of Bloom filter

Obviously, if the Bloom filter is too small, all bit bits will soon be 1, so the query will return "possible" for any value, which will not serve the purpose of filtering. The length of Bloom filter will directly affect the false alarm rate, the longer the Bloom filter, the smaller the false alarm rate.

In addition, the number of hash functions also needs to be weighed, the more the number, the faster the speed of the Bloom filter bit bit 1, and the lower the efficiency of the Bloom filter; but if too few, then our false alarm rate will be higher.

As can be seen from the above figure, increasing the number of hash functions k will greatly reduce the error rate p.

Looks like WTF? Don't worry, we actually need to determine our m and k. Therefore, if we set the fault tolerance value p and the number of elements n ourselves, we can use the following formula to calculate these parameters:

We can calculate the false alarm rate p according to the size of the filter m, the number of hash functions k and the number of elements inserted n. The formula is as follows: from the above, how to choose the k and m values suitable for the business?

Formula:

K is the number of hash functions, m is the length of Bloom filter, n is the number of elements inserted, and p is the false alarm rate.

As for how to derive this formula, I have covered it in the article published by Zhihu. If you are interested, you can take a look at it. If you are not interested, just remember the above formula.

I would also like to mention another important point here. Since the only purpose of using the Bloom filter is to search faster, we can't use a slow hash function, can we? Encrypted hash functions, such as Sha-1,MD5, are not a good choice for bloom filters because they are a bit slow. Therefore, the better choices from faster hash function implementations are murmur,fnv series hashes, Jenkins hashes, and HashMix.

More application scenarios

As you have seen in the given example, we can use it to warn users to enter a weak password.

You can use Bloom filters to prevent users from visiting malicious websites.

You can first use the Bloom Bloom filter to do a cheap lookup check without querying the SQL database to see if there is a user with a specific e-mail message. If email doesn't exist, that would be great! If it does exist, additional queries may have to be made to the database. You can also do the same to search for "user name is already occupied".

You can keep a Bloom filter based on the IP address of your visitors to check whether the users of your site are "back users" or "new users". Some false positives of "back users" won't hurt you, right?

You can also check spelling by tracking dictionary words using Bloom filters.

These are all the contents of the article "what is the algorithm formula for Redis Bloom filter size". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.