Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the design principle of RoaringBitmaps?

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

RoaringBitmaps design principle is what, I believe many inexperienced people are helpless, for this reason this article summarizes the causes and solutions of the problem, through this article I hope you can solve this problem.

Just contact programming that will remember to use Bitmap 0 and 1 bits to identify whether the data exists, mainly for sorting;

Later, we found that Bitmap can be used to judge the existence of an object in a large object set.

Later, I learned that Bloom used Hash algorithm combined with Bitmap to implement BoomFilter, which was used in mass data processing scenarios. I think Bloom filter is invincible in data filtering.

Later, someone asked me, although Bloom filter solved the problem of data filtering, but it does not support data modification and deletion, in addition, if the data is sparse, only the lowest bit is 1, others are 0, or will cause resource waste. What should I do?

The following is a brief introduction to RoaringBitmaps in combination with your own understanding.

RoaringBitmaps

RoaringBitmaps is referred to as RBM, which mainly divides 32-bit unsigned integers into buckets according to the highest 16 bits, that is, there may be up to 2^16=65536 buckets, which is called container in the paper. When storing data, find the container according to the high 16 bits of the data, if not found, create a new one, and then put the low 16 bits into the container. In other words, an RBM is a collection of containers.

design idea

The main idea of RBM is not complicated. In summary, there are the following points:

Divide the 32-bit range ([0, n)) into 2^16 buckets, and each bucket has a Container to store the lower 16 bits of a value;

When storing and querying values, we divide a value k into high 16 bits (k % 2^16) and low 16 bits (k mod 2^16), take the high 16 bits to find the corresponding bucket, and then store it in the corresponding Container in the low 16 bits;

RBM uses three container structures: Array Container, Bitmap Container and RunContainer. Array containers store sparse data, Bitmap containers store dense data. That is, if the number of integers in a Container is less than 4096, an ordered array of Short type is used to store the values. If it is greater than 4096, Bitmap is used to store the value;RunContainer is used to store continuous data. For example, a sequence of consecutive integers 11, 12, 13, 14, 15, 27, 28, 29 would be compressed by RLE into two binary pairs 11, 4, 27, 2, meaning that 11 is followed by four successively increasing values and 27 is followed by two successively increasing values.

exemplified

As shown in the figure above, it is an example given by the official website. The three containers represent three data sets respectively:

The first 1000 stores multiples of 62, the second stores values from 1 to 99, and the third stores all even numbers.

The above example only illustrates the flexibility of RBM to classify data. You don't have to take other representations seriously. In addition, you may look at the high 16 bits stored in the array and the low 16 bits stored in the Container, and suddenly look at it and compare it with a daze. Let me give you an example.

821697800 corresponds to a hexadecimal number 30FA 1D08, of which the upper 16 bits are 30FA and the lower 16 bits are 1D08.

First, use binary search to find the container with the value of 30FA from the first-level index (i.e. Container Array)(if the container does not exist, then create a new one). From the figure, we can see that the container is a Bitmap container.

After finding the corresponding container, look at the lower 16-bit value 1D08, which is equivalent to 7432, so find the corresponding position in Bitmap and set it to 1.

Seeing this, you may still wonder if the number of Integer in a Container is less than 4096, so use an ordered array of Short type to store values. If it is greater than 4096, Bitmap is used to store the value. What was the intention?

The reason why the threshold of 4096 is used here is probably because the lower 16 bits of int are 2 bits, which is 2 Byte * 4096 = 8KB in Arrary Container; for Bitmap Container, 2^16 bits are equivalent to 8KB. It's basically a complementary, opposite division point.

Those who have used jdk1.8 HashMap know that in order to further optimize HashMap, a red-black tree has been introduced. When the list length is too long (default exceeds 8), the list is converted to a red-black tree. We can use the red-black tree to quickly add and delete special points to improve the performance of HashMap. When the number of red-black tree nodes is less than 8, the red-black tree will be converted into a chain list. Because red-black trees are balanced on smaller data volumes, the performance advantage is not obvious compared to linked lists. wcc, if not entangled in details, RPM implementation is actually so similar to HashMap design ideas?

Operation conversion process creation

When creating a new container, RBM will be stored in ArrayContainer by default if only one element is inserted. If you insert an element sequence, you will first calculate the space occupied by ArrayContainer and RunContainer according to the above method, and select the smaller one for storage.

find

When searching for an element, the binary algorithm first searches for the ArrayContainer with the highest 16 bits. If it exists and the data amount is lower than 4096, the binary algorithm continues to search for specific fixed data, otherwise, the bit operation is used. In terms of time complexity, BitmapContainer only involves bit operations, which is obviously O(1). ArrayContainer and RunContainer both require binary search to locate elements in an ordered array, so O(logN).

conversion

If you read the article carefully, you may still have questions. At the beginning, only one element was inserted, which must be ArrayContainer. As the amount of data increased later, how did BitMapContainer appear later? That's because the algorithm itself supports transformations, as explained below.

When the ArrayContainer capacity exceeds 4096, it will automatically be converted to BitmapContainer storage. 4096 is a threshold below which ArrayContainer saves space and above which BitmapContainer saves space. That is to say, ArrayContainer stores sparse data, BitmapContainer stores dense data, which can minimize memory waste.

RBM can compare the memory footprint of ArrayContainer/BitmapContainer with its equivalent RunContainer by calling a specific API called optimize, and convert once RunContainer has a smaller footprint. That is, the second ArrayContainer in the example above can be transformed into a RunContainer with only one binary 0, 100, taking up a further reduction to 10200 bytes.

Of course, it also involves comparison and merging operations between multiple containers, such as: when two RBMs are set operations, the processing methods of bit operations between different types of containers, such as ArrayContainer AND BitmapContainer, BitmapContainer OR RunContainer, etc.; support for 64-bit integers when 32 bits are sometimes insufficient; ability to write RBM data out of the heap, i.e. memory mapping; support for Kryo serialization, etc. This involves a lot of displacement operations, a complex batch, I do not understand.

After reading the above content, do you know what the design principle of RoaringBitmaps is? If you still want to learn more skills or want to know more related content, welcome to pay attention to the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report