Redis accurate de-repetition counting method (growling bitmap) 07/12 Update SLTechnology News&Howtos

Redis accurate de-repetition counting method (growling bitmap)

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Preface

If you want to count the number of readings of an article, you can directly use the incr instruction of Redis to do it. If you require that the number of readings must be re-read by the user, you can use set to record the id of all users who have read this article, and the length of the set collection is the number of de-rereads. But if the popular style articles are read too much, set will waste too much storage space. At this time, we will use the HyperLogLog data structure provided by Redis instead of set, which will only take up a maximum of 12k of storage space to complete massive deduplication statistics. But it sacrifices accuracy, it is a fuzzy count, and the error rate is about 0.81%.

So is there an accurate counting method that doesn't waste much space? The first thing that comes to mind is a bitmap, which can be used to represent a user id. If a user's id is 32 bytes, then using a bitmap takes up only 1Tap 256 space to complete an accurate count. But how do you map the user id to the location of the map? This is easy if the user id is a continuous integer, but usually the user id of the user system is not an integer, but a string or a large integer with a certain degree of randomness.

We can forcibly assign a sequence of integers to each user id, and then store the correspondence between the user id and the integer in redis.

$next_user_id = incr user_id_seqset user_id_xxx $next_user_id$next_user_id = incr user_id_seqset user_id_yyy $next_user_id$next_user_id = incr user_id_seqset user_id_zzz $next_user_id

Here you may ask, you say it is to save space, so it is not a waste of space to store the mapping between user id and integers? This question is very good, but at the same time, we should also see that the mapping relationship can be reused. It can count the number of articles read, the number of daily active users and monthly active users who sign in. It can also be used in many other statistical situations where users need to be duplicated. This is the meaning of the so-called "success lies in the present age, and benefit lies in a thousand years".

With this mapping relationship, we can easily construct a reading management bitmap for each article, and for a user, the corresponding position in the corresponding bitmap will be one. If the bit changes from 0 to 1, then you can add 1 to the number of readings. In this way, the reading number of the article can be easily obtained.

And we can dynamically calculate the number of public users who have read the two articles. Do an AND calculation of the two bitmaps, and then count the number of bits 1 in the bitmap. Similarly, there are OR calculations, XOR calculations, and so on that are all feasible.

Here we go again! The bitmap of Redis is a dense bitmap. What does it mean? If there is a very large bitmap, only the last bit is 1 and the others are zero, the bitmap will still take up all the memory space, which is not a general waste. You can imagine that most articles don't read much, but they take up a very similar amount of space, similar to the memory occupied by popular articles.

It seems that this plan will not work, we need to think of other plans! Then the roaring bitmap (RoaringBitmap) comes.

It divides the whole large bitmap into blocks, and if the whole block is zero, then the whole block does not need to be saved. But if bit 1 is scattered and there are 1 in each block, although there are few 1 in a single block, it is not enough to divide into blocks, what should we do? Let's think again, for a single block, can we continue to optimize? If there is only a small number of parts in a single block, we can only store all the intra-block offsets (integers) of bit 1, that is, a list of integers, then the storage within the block can also be reduced. This is the sparse storage form of a single block bitmap-a list of storage offset integers. Sparse storage is converted to dense storage at one time only if bit 1 within a single block exceeds a threshold.

Roaring bitmap can not only greatly save space, but also reduce the computational efficiency of AND and OR alleles. In the past, you needed to calculate the entire bitmap, but now you only need to calculate some blocks. If the block is very sparse, then only the list of small integers need to be set AND, OR operations, such as the amount of computation can continue to reduce.

Here, neither space for time, nor time for space, but logical complexity for space and time at the same time.

The bit length of the roaring bitmap is up to 2 ^ 32, and the corresponding space is 512m (normal bitmap). The bit offset is divided into high 16 bits and low 16 bits, the high 16 bits represent the block offset, the low 16 bits represent the position within the block, and a single block can express the bit length of 64k, that is, 8K bytes. There will be a maximum of 64k blocks. The L1 cache of modern processors is generally larger than 8K, which ensures that all individual blocks can be put into the L1 Cache, which can significantly improve performance.

If all the bits in a single block are zero, then it does not need to be stored. Whether a block exists or not can also be expressed by a bitmap, when there are few blocks, it is expressed by a list of integers, and when there are more blocks, they can be converted into ordinary bitmaps. The integer list takes up less space, and it also has a dynamic expansion mechanism similar to ArrayList to avoid repeated expansion to copy array contents. When the number in the list exceeds 4096, it immediately changes to a normal bitmap.

The data structure used to express the existence of a block and the structure used to express the data of a single block can be the same, because the existence of a block is essentially 0 and 1, which is a common bit mark.

But Redis doesn't natively support roaring bitmaps as a data structure? How should we use it?

It's true that Redis doesn't have a native, but roaring bitmap Redis Module does.

Github.com/aviggiano/r...

The number of star in this project is not very large. Let's take a look at its official performance comparison.

OPTIME/OP (us) ST.DEV. (us) R.SETBIT31.8928.49SETBIT29.9829.23R.GETBIT29.9014.60GETBIT28.6314.58R.BITCOUNT32.130.10BITCOUNT192.380.96R.BITPOS70.270.14BITPOS87.700.62R.BITOP NOT156.663.15BITOP NOT364.465.62R.BITOP AND81.560.48BITOP AND492.978.32R.BITOP OR107.032.44BITOP OR461.688.42R.BITOP XOR69.072.82BITOP XOR440.757.90

Obviously what is compared here is a sparse bitmap, and only a sparse bitmap can present such a good-looking number. If it is a dense bitmap, the performance of the roaring bitmap is certainly slightly weaker than the normal bitmap, but it is usually not much weaker.

Let's take a look at the source code to see what its internal structure looks like.

/ / single block typedef struct roaring_array_s {int32_t size; int32_t allocation_size; void * * containers; / / points to an array of integers or a normal bitmap uint16_t * keys; uint8_t * typecodes; uint8_t flags;} roaring_array_t;// all blocks typedef struct roaring_bitmap_s {roaring_array_t high_low_container;} roaring_bitmap_t

It is obvious that the presence or absence of the block and the data within the block are expressed using the same data structure, which is roaring_bitmap_t. There are a variety of encoding forms in this structure, and the type is represented by the typecodes field.

# define BITSET_CONTAINER_TYPE_CODE 1#define ARRAY_CONTAINER_TYPE_CODE 2#define RUN_CONTAINER_TYPE_CODE 3#define SHARED_CONTAINER_TYPE_CODE 4

Looking at the type definition here, we find that it is not only in the form of normal bitmap and array list mentioned earlier, but also RUN and SHARED. The RUN form is a compressed form of a bitmap. For example, several consecutive bits 101102103104105106107108109 are represented as RUN, which means 101Power8 (1 followed by 8 self-increasing integers), so that it can be significantly compressed in space. Under normal circumstances, there are no blocks of type RUN inside the roaring bitmap. Only the optimized API that shows the roaring bitmap is converted to RUN format, and this API is roaring_bitmap_run_optimize.

The SHARED type is used to share blocks among multiple roaring bitmaps, and it also provides write copy capabilities. A new copy will be copied when the block is modified.

There are more details about the computational logic of the roaring bitmap, which we will continue to discuss later.

Summary

The above is the whole content of this article. I hope the content of this article has a certain reference and learning value for everyone's study or work. Thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.