Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize the Bloom filter of Java

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

Today, the editor will share with you the relevant knowledge points about how to achieve Java's Bloom filter. The content is detailed and the logic is clear. I believe most people still know too much about this, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.

BitMap

Modern computers use binary (bit, bit) as the basic unit of information. One byte is equal to 8 bits. For example, the big string is composed of 3 bytes, but it is actually expressed in binary when it is stored in the computer. The corresponding ASCII codes of big are 98,105,103 respectively, and the corresponding binary are 01100010, 01101001 and 01100111 respectively.

Many development languages provide the function of operation bits, and the rational use of bits can effectively improve memory utilization and development efficiency.

The basic idea of Bit-map is to mark the value corresponding to an element with a bit bit, and key is the element. Because bit is used as the unit to store data, the storage space can be greatly saved.

In Java, int occupies 4 bytes, 1 byte = 8 bits (1 byte = 8 bit). If we use the value of each bit of this 32 bit bits to represent a number, we can represent 32 numbers, that is to say, 32 numbers only need the space occupied by an int, then the space can be reduced by 32 times.

1 Byte = 8 Bit,1 KB = 1024 Byte,1 MB = 1024 KB,1GB = 1024 MB

Suppose the website has 100 million users and 50 million users visit independently every day. If you use the collection type and BitMap to store active users every day:

1. If the user id is int, 4-byte, 32-bit, the space occupied by the collection type is 50,000,000 * 4and1024and1024 = 200m.

two。 If it is stored bit by bit, 50 million digits is 50 million bits, and the space occupied is 50000 000Universe 1024Universe 1024 = 6m.

So how do you represent a number in BitMap?

It is said that the value corresponding to an element is marked with the bit bit, and key is the element. We can think of BitMap as an array of bits. Each unit of the array can only store 0 and 1 (0 indicates that the number does not exist, 1 indicates existence). The subscript of the array is called offset in BitMap. For example, we need to express the four numbers {1, 3, 5, 7}, as follows:

What if there is a number 65? You only need to open the int [N _ int] array to store the data (where N represents the maximum value in this group of data), that is:

Int [0]: can denote 0x31

Int [1]: can represent 32'63

Int [2]: can represent 640095

Suppose we want to determine whether any integer is in the list, then Mmax 32 gets the subscript, and M% 32 knows where it is in this subscript, such as:

65 + 32 = 2, 65% 32: 1, that is, the first place of 65 in int [2].

Bloom filter

Bloom filter is essentially a data structure, a more ingenious probabilistic data structure, characterized by efficient insertion and query, which can be used to tell you that "something must not exist or may exist."

Compared with the traditional List, Set, Map and other data structures, it is more efficient and takes up less space, but the disadvantage is that the returned results are probabilistic rather than accurate.

In fact, Bloom filter is widely used in web blacklist system, spam filtering system, crawler URL weighing system and so on. Google's famous distributed database Bigtable uses Bloom filter to find rows or columns that do not exist to reduce the number of IO disk searches. Google Chrome browsers use Bloom filter to speed up secure browsing services.

Bloom filters are also used in many Key-Value systems to speed up the query process, such as Hbase,Accumulo,Leveldb. Generally speaking, Value is saved on disk and it takes a lot of time to access the disk. However, using Bloom filter can quickly determine whether a corresponding Value of a Key exists, so a lot of unnecessary disk IO operations can be avoided.

An element is mapped to a point in a bit array (Bit Array) through a Hash function. In this way, we just need to see if this point is 1 to see if there is it in the collection. This is the basic idea of the Bloom filter.

Application scene

1. At present, there are 1 billion natural numbers, which are arranged in disorderly order and need to be sorted. The constraint is completed on a 32-bit machine with a memory limit of 2G. How can it be done?

2. How to quickly locate whether the URL address is in the blacklist in the 100 million blacklist? (average 64 bytes per URL)

3. It is necessary to analyze the user's login behavior to determine the user's activity.

4. Web crawler-how to tell if URL has been crawled?

5. Quickly locate user attributes (blacklist, whitelist, etc.)?

6. If the data is stored on disk, how to avoid a large number of invalid IO?

7. Judge whether an element exists in hundreds of millions of data?

8. Cache penetration.

The deficiency of traditional data structure

Generally speaking, the web page URL is stored in the database for search, or set up a hash table for search on the OK.

When the amount of data is small, it is right to think so. It is true that the value can be mapped to the Key of HashMap, and then the result can be returned in O (1) time complexity, which is extremely efficient. However, the implementation of HashMap also has some disadvantages, such as high storage capacity. Considering the existence of load factors, space usually cannot be used up. For example, how much space will Value=Integer occupy if a 10 million HashMap,Key=String (no more than 16 characters in length and minimal repeatability) takes up? 1.2 G.

In fact, if you use 10,000 int types of bitmap,1000, you only need about 40m (10,000,000 * 4x1024max 1024 = 40m), accounting for 3% of 10 million Integer, and need about 161m of space, accounting for 13.3%.

It can be seen that once you have a lot of value, for example, hundreds of millions of dollars, you can imagine the amount of memory occupied by HashMap.

However, if the entire web page blacklist system contains 10 billion web pages URL, it is very time-consuming to search in the database, and if each URL space is 64B, then the memory needs to be 640GB, which is difficult for the average server to meet this requirement.

Realization principle

Let's say we have a set A, A Using k hash functions, each element in An is mapped to different positions in an array B of length a bit, where the binary number is set to 1. If the element to be checked, after the mapping of the k hash functions, it is found that all the binary numbers in k positions are 1, this element is likely to belong to set A, on the contrary, it must not belong to set A.

For example, we have three URL {URL1,URL2,URL3} and map them to an array of length 16 through a hash function, as follows:

If the current hash function is Hash2 (), map it to the array through the hash operation. Suppose Hash2 (URL1) = 3 URL2 2 (URL2) = 6 Magi Hash2 (URL3) = 6, as follows:

Therefore, if we need to determine whether URL1 is in this set, we calculate its subscript through Hash (1) and get a value of 1 that means it exists.

Because there are hash conflicts in Hash, such as the above URL2,URL3 are located in one location, assume that the Hash function is good. If our array length is m points, then if we want to reduce the conflict rate to, for example, 1%, the hash table can only hold 100 elements. Obviously, the space utilization becomes low, that is, it is impossible to achieve space-efficient.

The solution is also simple, which is to use multiple Hash algorithms. If one of them says that the element is not in the collection, it is definitely not there, as follows:

Hash2 (URL1) = 3 Hash3 (URL1) = 5 Hash4 (URL1) = 6Hash2 (URL2) = 5 Hash3 (URL2) = 8 Magi Hash4 (URL2) = 14Hash2 (URL3) = 4 Jing Hash3 (URL3) = 7 Jing Hash4 (URL3) = 10

This is the Bloom filter practice, which uses k hash functions, with each string corresponding to k bit, thus reducing the probability of conflicts.

Misjudgment phenomenon

The above approach is also problematic, because as more and more values are added, there will be more and more bit bits set to 1, so that even if a value has not been stored, the program will judge that the value exists if all three bit bits returned by the hash function are set to 1 by other values. For example, there is a URL1000 that does not exist at this time. After hashing, it is found that the bit bit is as follows:

Hash2 (URL1000) = 7 Magi Hash3 (URL1000) = 8 Magi Hash4 (URL1000) = 14

But the above bit bits have been set to 1 by URL1,URL2,URL3, and the program will judge that the URL1000 value exists.

This is the misjudgment phenomenon of Bloom filter, so the existence of Bloom filter judgment may not exist, but the judgment that does not exist must not exist.

Bloom filter can accurately represent a set, can accurately determine whether an element is in this set, the degree of accuracy depends on the specific design of the user, it is impossible to achieve 100% correct. But the advantage of Bloom filter is that high accuracy can be achieved by using very little space.

Implement the bitmap of Redis

Based on the relevant instructions of the bitmap data structure of redis.

RedisBloom

The Bloom filter can be implemented using the bitmap (bitmap) operation in Redis. It was not until the plug-in function was provided in the Redis4.0 version that the official Bloom filter provided by Redis was officially launched. The Bloom filter was loaded into Redis Server as a plug-in, and the official website recommended RedisBloom as the Module of the Redis Bloom filter.

BloomFilter of Guava

When the Guava project released version 11.0, one of the new features added was the BloomFilter class.

Redisson

The bottom layer of Redisson implements a Bloom filter based on bitmaps.

Public static void main (String [] args) {Config config = new Config (); / / stand-alone environment config.useSingleServer () .setAddress ("redis://192.168.153.128:6379"); / / construct Redisson RedissonClient redisson = Redisson.create (config); RBloomFilter bloomFilter = redisson.getBloomFilter ("nameList") / / initialize the Bloom filter: the expected element is 100000000L, and the error rate is 3%. According to these two parameters, the underlying bit array size bloomFilter.tryInit (100000L, 0.03) will be calculated; / / 10086 will be inserted into the Bloom filter bloomFilter.add ("10086"); / / determine whether the following number is in the Bloom filter System.out.println (bloomFilter.contains ("10086")). / / true System.out.println (bloomFilter.contains ("10010")); / / false System.out.println (bloomFilter.contains ("10000")); / / false} resolve cache penetration

Cache traversal refers to querying a data that does not exist at all. Neither the cache layer nor the storage layer will hit it. If the data cannot be found from the storage layer, it will not be written to the cache layer.

Cache traversal will cause non-existent data to be queried at the storage layer every time, which loses the meaning of cache protection for back-end storage. The problem of cache penetration may increase the load of backend storage, because many backend storage does not have high concurrency and may even cause backend storage to crash.

So we can use the Bloom filter to solve the problem. Before accessing the cache layer and the storage layer, we can save the existing key in advance with the Bloom filter and do the first layer interception.

For example, a recommendation system has 400 million id users, and every hour the algorithm engineer will calculate the recommendation data and put it into the storage layer based on each user's previous historical behavior, but the latest user will have cache penetration behavior because there is no historical behavior, so all users of recommendation data can be made into Bloom filters. If the Bloom filter believes that the user id does not exist, then the storage tier is not accessed, protecting the storage tier to some extent.

Note: Bloom filter may misjudge, let go of some requests, when it does not affect the whole, so at present this scheme is the best way to deal with this kind of problem.

These are all the contents of the article "how to achieve the Bloom filter of Java". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report