What is the function of the internal data structure intset in Redis 07/08 Update SLTechnology News&Howtos

What is the function of the internal data structure intset in Redis

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article shows you what the role of the internal data structure intset in Redis is, the content is concise and easy to understand, it can definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Brief introduction of intset data structure

As its name implies, intset is a collection of integers. In fact, intset is an ordered set of integers, which makes it easy to do a binary search on it, which can be used to quickly determine whether an element belongs to this set. It is somewhat similar to ziplist in memory allocation, is a continuous block of memory space, and takes different coding for large integers and small integers (by absolute values), and optimizes the use of memory as far as possible.

The data structure of intset is defined as follows (from intset.h and intset.c):

Typedef struct intset {uint32_t encoding; uint32_t length; int8_t contents [];} intset;#define INTSET_ENC_INT16 (sizeof (int16_t)) # define INTSET_ENC_INT32 (sizeof (int32_t)) # define INTSET_ENC_INT64 (sizeof (int64_t))

Each field has the following meanings:

Encoding: data encoding, indicating that each data element in the intset is stored in several bytes. It has three possible values: INTSET_ENC_INT16 means that each element is stored in 2 bytes, INTSET_ENC_INT32 means that each element is stored in 4 bytes, and INTSET_ENC_INT64 means that each element is stored in 8 bytes. Therefore, integers stored in intset can only occupy 64bit at most.

Length: represents the number of elements in the intset. The encoding and length fields make up the header of the intset.

Contents: is a flexible array (flexible array member) that represents the header of the intset followed by the data element. The total length of this array (that is, the total number of bytes) is equal to encoding * length. Flexible arrays have appeared in the definition of many data structures in Redis (such as sds, quicklist, skiplist) to express an offset. Contents needs to allocate space for it separately, and this part of memory is not included in the intset structure.

It is important to note that intset may change its data encoding as data is added:

At first, the newly created intset uses INTSET_ENC_INT16, which is the smallest in memory, as the data encoding with a value of 2.

Each time a new element is added, the data encoding is upgraded according to the element size.

The following figure shows a specific example of adding data (click to see the larger image).

In the image above:

The newly created intset has only one header, with a total of 8 bytes. Where encoding = 2, length = 0.

After adding 13,5 elements, because they are relatively small integers, can be represented by 2 bytes, so the encoding is unchanged, the value is still 2.

When 32768 is added, it can no longer be represented in 2 bytes (the data range of 2-word energy efficient expression is-215 15-1, while 32768 equals 215, which is out of range), so encoding must be upgraded to INTSET_ENC_INT32 (value 4), that is, 4 bytes for an element.

As you add each element, intset always keeps it in order from small to large.

Like ziplist, intset is stored in little endian mode (see Wikipedia entry Endianness). For example, after intset has added all the data in the figure above, the four bytes representing the encoding field should be interpreted as 0x00000004, while the fifth data should be interpreted as 0x000186A0 = 100000.

Compared with ziplist, intset:

Ziplist can store any binary string, while intset can only store integers.

Ziplist is unordered, while intset is ordered from small to large. Therefore, lookups can only be traversed on ziplist, while binary lookups can be done on intset for higher performance.

Ziplist can encode each data item differently (each data item is preceded by a data length field len), while intset can only use a uniform encoding (encoding) as a whole.

Find and add operations of intset

To understand some of the implementation details of intset, you only need to pay attention to two key operations of intset: intsetFind and intsetAdd elements.

The key code for intsetFind is as follows (from intset.c):

Uint8_t intsetFind (intset * is, int64_t value) {uint8_t valenc = _ intsetValueEncoding (value); return valenc encoding) & & intsetSearch (is,value,NULL);} static uint8_t intsetSearch (intset * is, int64_t value, uint32_t * pos) {int min = 0, max = intrev32ifbe (is- > length)-1, mid =-1; int64_t cur =-1 / * The value can never be found when the set is empty * / if (intrev32ifbe (is- > length) = = 0) {if (pos) * pos = 0; return 0;} else {/ * Check for the case where we know we cannot find the value, * but do know the insert position. * / if (value > _ intsetGet (is,intrev32ifbe (is- > length)-1)) {if (pos) * pos = intrev32ifbe (is- > length); return 0;} else if (value

< _intsetGet(is,0)) { if (pos) *pos = 0; return 0; } } while(max >

= min) {mid = ((unsigned int) min + (unsigned int) max) > 1; cur = _ intsetGet (is,mid); if (value > cur) {min = mid+1;} else if (value

< cur) { max = mid-1; } else { break; } } if (value == cur) { if (pos) *pos = mid; return 1; } else { if (pos) *pos = min; return 0; }} 关于以上代码，我们需要注意的地方包括： intsetFind在指定的intset中查找指定的元素value，找到返回1，没找到返回0。 _intsetValueEncoding函数会根据要查找的value落在哪个范围而计算出相应的数据编码（即它应该用几个字节来存储）。如果value所需的数据编码比当前intset的编码要大，则它肯定在当前intset所能存储的数据范围之外（特别大或特别小），所以这时会直接返回0；否则调用intsetSearch执行一个二分查找算法。 intsetSearch在指定的intset中查找指定的元素value，如果找到，则返回1并且将参数pos指向找到的元素位置；如果没找到，则返回0并且将参数pos指向能插入该元素的位置。 intsetSearch是对于二分查找算法的一个实现，它大致分为三个部分：特殊处理intset为空的情况。特殊处理两个边界情况：当要查找的value比最后一个元素还要大或者比第一个元素还要小的时候。实际上，这两部分的特殊处理，在二分查找中并不是必须的，但它们在这里提供了特殊情况下快速失败的可能。真正执行二分查找过程。注意：如果最后没找到，插入位置在min指定的位置。代码中出现的intrev32ifbe是为了在需要的时候做大小端转换的。前面我们提到过，intset里的数据是按小端（little endian）模式存储的，因此在大端（big endian）机器上运行时，这里的intrev32ifbe会做相应的转换。这个查找算法的总的时间复杂度为O(log n)。而intsetAdd的关键代码如下所示（出自intset.c）： intset *intsetAdd(intset *is, int64_t value, uint8_t *success) { uint8_t valenc = _intsetValueEncoding(value); uint32_t pos; if (success) *success = 1; /* Upgrade encoding if necessary. If we need to upgrade, we know that * this value should be either appended (if >

0) or prepended (if

< 0), * because it lies outside the range of existing values. */ if (valenc >

Intrev32ifbe (is- > encoding) {/ * This always succeeds, so we don't need to curry * success. * / return intsetUpgradeAndAdd (is,value);} else {/ * Abort if the value is already present in the set. * This call will populate "pos" with the right position to insert * the value when it cannot be found. * / if (intsetSearch (is,value,&pos)) {if (success) * success = 0; return is;} is = intsetResize (is,intrev32ifbe (is- > length) + 1); if (pos

< intrev32ifbe(is->

Length)) intsetMoveTail (is,pos,pos+1);} _ intsetSet (is,pos,value); is- > length = intrev32ifbe (intrev32ifbe (is- > length) + 1); return is;}

With regard to the above code, we need to pay attention to:

IntsetAdd adds a new element value to the intset. If the value already exists before it is added, it will not be added repeatedly, and the parameter success is set to 0; if the value does not exist in the original intset, the value is inserted into the appropriate position, and the parameter success is set to 0.

If the element value you want to add requires a larger data encoding than the current intset, then call intsetUpgradeAndAdd to upgrade the intset's encoding and then insert it into value.

Call intsetSearch, and if it can be found, it will not be added repeatedly.

If not, intsetResize is called to expand the memory of the intset so that it can accommodate the newly added elements. Because intset is a contiguous space, this operation causes the realloc of memory (see http://man.cx/realloc). This may lead to a copy of the data. At the same time, call intsetMoveTail to move the elements behind the position to be inserted back 1 position, which also involves a data copy. It is worth noting that in intsetMoveTail, memmove is called to complete this data copy. Memmove ensures that there is no data overlap or overwriting during the copy process, as described in http://man.cx/memmove.

IntsetResize is also called in the implementation of intsetUpgradeAndAdd to complete memory expansion. During a coding upgrade, the implementation of intsetUpgradeAndAdd takes out each element in the original intset and rewrites it to the new location with the new code.

Notice the return value of intsetAdd, which returns a new intset pointer. It may or may not be the same as the intset pointer is passed in. The caller must replace the old intset variable passed in with the new intset returned here. Similar interface usage patterns are common in Redis implementation code, for example, we have encountered similar situations before when we introduced sds and ziplist.

Obviously, the total time complexity of this intsetAdd algorithm is O (n).

Set of Redis

To better understand the set data structures exposed by Redis, let's first take a look at some of the key commands of set. Here are some examples of commands:

What the above commands mean:

Sadd is used to add elements to collections S1 and S2, respectively. The elements added are both numeric and non-numeric ("a" and "b").

Sismember is used to determine whether the specified element exists in the collection.

Sinter, sunion and sdiff are used to calculate the intersection, union and difference of sets, respectively.

As we mentioned earlier, the underlying implementation of set varies depending on whether the element type is an integer and how many elements are added. For example, during the execution of the above command, the underlying data structure of collection S1 changes as follows:

After the execution of sadd S1 135 is started, since smaller integers are added, the underlying layer of S1 is an intset with data encoding encoding = 2.

After the execution of sadd S1 32768 10 100000, the underlying layer of S1 is still an intset, but its data encoding encoding is upgraded from 2 to 4.

After the execution of sadd S1 a b, the underlying implementation of S1 becomes a dict because the added elements are no longer digits.

We know that dict is a data structure used to maintain the mapping relationship between key and value, so when the underlying set is represented by dict, what are its key and value, respectively? In fact, key is the collection element to add, and value is NULL.

In addition to the previously mentioned transition from intset to dict at the bottom of the collection due to the addition of non-numeric elements, there are two other situations that can cause this conversion:

A number has been added, but it cannot be expressed as a signed number of 64bit. The maximum range of integers that intset can express is-264 numbers 264-1, so if you add numbers beyond this range, this will also cause intset to be converted to dict.

When the number of collection elements added exceeds the value of the set-max-intset-entries configuration, it will also cause intset to be converted to dict (for specific trigger conditions, see the setTypeAdd related code in t_set.c).

The main reason for using intset to store small collections is to save memory. Especially when the number of elements stored is small, the memory overhead caused by dict is much greater (including two hash tables, linked list pointers, and a large number of other metadata). So, when storing a large number of small collections and the collection elements are numbers, using intset can save a considerable amount of memory space.

In fact, in terms of time complexity, the average performance of intset is not as high as dict. Take lookup as an example, intset is O (log n), while dict can be considered O (1). However, this has little impact because the number of collection elements is relatively small when using intset.

Union, intersection and difference algorithm of Redis set

Redis set union, intersection, difference algorithm implementation code, in t_set.c. The call to calculate the intersection is sinterGenericCommand, and the call to union and subtraction is sunionDiffGenericCommand. All of them can operate on multiple (or more than 2) sets at the same time. When performing a subtraction operation on multiple sets, it means that the difference between the first set and the second set is made, and then the result is subtracted from the third set, and so on.

Here we briefly introduce the implementation ideas of the three algorithms.

Intersection

The process of calculating intersection can be divided into three parts:

Check each collection and treat a collection that does not exist as an empty set. Once there is an empty set, there is no need to continue the calculation, the final intersection is the empty set.

Sort each collection according to the number of elements from few to more. This sort is good for later calculations, starting with the smallest set, with a small number of elements to deal with.

Traverse the first set (that is, the smallest set) after sorting, and look for each of its elements in all subsequent collections in turn. Only those elements that can be found in all collections are added to the final result set.

It is important to note that step 3 above looks up in the collection, and the time complexity for intset and dict storage is O (log n) and O (1), respectively. However, since only small sets use intset, it can be roughly assumed that intset lookups are also of constant time complexity. Therefore, as stated in the Redis official documentation (http://redis.io/commands/sinter), the time complexity of the sinter command is:

O (Null M) worst case where N is the cardinality of the smallest set and M is the number of sets.

Union set

Calculating union is the easiest, simply traversing all collections, adding each element to the final result set. Adding elements to the collection automatically removes duplicates.

Because you want to traverse each element of all collections, the time complexity of the sunion command given by the Redis official documentation is (http://redis.io/commands/sunion):

O (N) where N is the total number of elements in all given sets.

Note that, as discussed earlier in the intersection calculation, the process of inserting elements into the result set ignores the case of intset and considers the time complexity to be O (1).

Difference set

There are two possible algorithms for calculating difference sets, and their time complexity is different.

The first algorithm:

Traverse the first collection and look for each of its elements in all subsequent collections in turn. Only elements that cannot be found in all collections are added to the final result set.

The time complexity of this algorithm is O (Null M), where N is the number of elements of the first set and M is the number of sets.

The second algorithm:

Add all the elements of the first collection to an intermediate collection.

Iterate through all the subsequent collections and delete it from the middle collection for each element you encounter.

Finally, the remaining elements of the middle set form the difference set.

The time complexity of this algorithm is O (N), where N is the sum of the elements of all sets.

At the beginning of calculating the difference set, the expected time complexity of the two algorithms will be estimated respectively, and then the algorithm with low complexity will be selected for operation. There are two more points to note:

To a certain extent, priority is given to the first algorithm, because it involves fewer operations and only needs to be added, while the second algorithm needs to be added and then deleted.

If the first algorithm is selected, then all sets after the second set are sorted by the number of elements in the implementation of Redis before the algorithm is executed. This sort helps to find elements with a greater probability, thus ending the search more quickly.

The above is what is the function of the internal data structure intset in Redis. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.