In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
Today, I will talk to you about how to achieve 100 million-level data statistics in Redis, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.
Usually, we are faced with a huge number of users and visits, such as millions or tens of millions of users, or tens of millions or even hundreds of millions of access information.
Therefore, we have to choose a collection type that can count large amounts of data (for example, hundreds of millions of dollars) very efficiently.
How to choose the appropriate data set, we must first understand the commonly used statistical models, and use reasonable data to solve practical problems.
Four statistical types:
Binary state statistics
Aggregate statistics
Ranking statistics
Cardinal statistics.
This paper will use the extended data types Bitmap and HyperLogLog besides String, Set, Zset, List and hash.
The instructions involved in this article can be debugged through the online Redis client, address: try.redis.io/, super convenient to say.
Send a message
Share more and give more, create more value for others in the early stage, regardless of return, in the long run, these efforts will repay you exponentially.
Especially when you start to cooperate with others, don't worry about the short-term returns, it doesn't make much sense, it's more about exercising your vision, perspective and problem-solving skills.
Binary state statistics
Brother Ma, what is binary state statistics?
That is, the values of the elements in the collection are only 0 and 1. In the case of check-in and whether the user is logged in, you only need to record check-in (1) or not (0), logged in (1) or not (0).
If we use Redis's String type implementation (key-> userId,value-> 0 means offline, 1-login) in the scenario of judging whether a user is logged in, if the login status of 1 million users is stored, and if it is stored in the form of a string, 1 million strings need to be stored, which is too expensive to store.
For the binary state scenario, we can use Bitmap to implement. For example, the login status is represented by one bit bit, and 100 million users only occupy 100 million bit bits of memory ≈ (100000000 / 8 / 1024) 12 MB.
The approximate formula for calculating the space occupation is: ($offset/8/1024/1024) MB
What is Bitmap?
The underlying data structure of Bitmap uses a String-type SDS data structure to hold an array of bits. Redis uses eight bit bits of each byte array, and each bit bit represents the binary state of an element (either 0 or 1).
Think of Bitmap as an array in bit units. Each cell of the array can only store 0 or 1. The subscript of the array is called the offset offset in Bitmap.
For intuitive display, we can understand that each byte of the buf array is represented by a row, each row has 8 bit bits, and 8 cells represent the 8 bit bits in this byte, as shown in the following figure:
Eight bit make up a Byte, so Bitmap saves a lot of storage space. This is the advantage of Bitmap.
Judge the user's login status
How to use Bitmap to determine whether a user is online among a large number of users?
Bitmap provides GETBIT and SETBIT operations to read and write to the bit bit of the offset position of the bit array through an offset value offset. It is important to note that offset starts at 0.
Only a key = login_status is needed to store the user login status collection data. If the user ID is used as an offset, the online setting is set to 1, and the offline setting is set to 0. Determine whether the corresponding user is online by GETBIT. 500 million users only need 6 MB of space.
SETBIT command
SETBIT
Sets or clears the bit value of key's value at offset (only 0 or 1).
GETBIT command
GETBIT
Gets the value of the bit bit of the value of key at offset, and returns 0 if key does not exist.
If we want to judge the login of users with ID = 10086:
The first step is to execute the following instructions to indicate that the user is logged in.
SETBIT login_status 10086 1
The second step is to check whether the user is logged in, and the return value of 1 indicates that the user is logged in.
GETBIT login_status 10086
The third step is to log out and set the value for offset to 0.
SETBIT login_status 10086 0
Monthly check-in status of users
In the check-in statistics, each user's daily check-in is represented by 1 bit bit, while a year's check-in only needs 365 bit bits. There are only 31 days in a month at most, and only 31 bit bits are needed.
For example, how do users with statistical number 89757 sign in in May 2021?
Key can be designed as uid:sign: {userId}: {yyyyMM}, and the value of-1 for each day of the month can be used as offset (because offset starts at 0, offset = date-1).
The first step is to execute the following instructions to record that the user will sign in on May 16, 2021.
SETBIT uid:sign:89757:202105 15 1
The second step is to determine whether user number 89757 will sign in on May 16, 2021.
GETBIT uid:sign:89757:202105 15
The third step is to count the number of times the user signs in in May and use the BITCOUNT instruction. This instruction is used to count the number of bit bits with a value of 1 in a given bit array.
BITCOUNT uid:sign:89757:202105
In this way, we can achieve the monthly sign-in situation of users, isn't it great?
How to count the time of signing in for the first time this month?
Redis provides the BITPOS key bitValue [start] [end] instruction, which returns data indicating the first offset location in the Bitmap whose value is bitValue.
By default, the command detects the entire bitmap, and you can specify the range to detect through the optional start and end parameters.
So we can get the date of the first sign-in of userID = 89757 in May 2021 by executing the following instruction:
BITPOS uid:sign:89757:202105 1
It is important to note that we need to return value + 1 to indicate the day of the first sign-in, because offset starts at 0.
Total number of consecutive check-in users
After recording the clock-in data of 100 million users for 7 consecutive days, how to count the total number of users who have signed in for 7 consecutive days?
We use the date of each day as the key,userId of Bitmap as the offset, and set the bit of the offset location to 1 if we sign in.
The data of each bit bit of the set corresponding to key is a sign-in record of the user on that date.
There are seven such Bitmap, if we can do "and" operation on the corresponding bit bits of these seven Bitmap.
The same UserID offset is the same. When the bit of a userID in the corresponding offset location of 7 Bitmap is 1, it means that the user has signed in continuously for 7 days.
The result is saved to a new Bitmap, and we count the number of bit = 1 through BITCOUNT to get the total number of users who have signed in for 7 consecutive days.
Redis provides BITOP operation destkey key [key...] This instruction is used to perform bit operations on Bitmap with one or more keys = key.
Opration can be and, OR, NOT, XOR. When BITOP deals with strings of different lengths, the missing part of the shorter string is treated as 0.
An empty key is also seen as a sequence of strings containing zeros.
It is easy to understand, as shown in the following figure:
3 Bitmap, the corresponding bit bits do the "and" operation, and the results are saved to the new Bitmap.
The operation instruction indicates that the three bitmap are AND and the result is saved to the destmap. Then BITCOUNT statistics are performed on destmap.
/ / count the number of bit bits = 1 with the operation BITOP AND destmap bitmap:01 bitmap:02 bitmap:03// BITCOUNT destmap
Simply calculate the memory cost of the next 100 million-bit Bitmap, accounting for about 12 MB of memory (10 ^ 8 / 8 Bitmap 1024), and the memory cost of a 7-day Bitmap is about 84 MB. At the same time, we'd better set the expiration time for Bitmap and let Redis delete expired punch data to save memory.
Summary
Thinking is the most important. When we encounter statistical scenarios that only need the binary status of the statistical data, such as whether users exist, whether ip is blacklisted, and check-in statistics, we can consider using Bitmap.
Only one bit bit is needed to represent 0 and 1, which will greatly reduce the memory footprint when counting large amounts of data.
Cardinal statistics
Cardinal statistics: counting the number of non-repeating elements in a set, which is common in calculating the number of independent users (UV).
The most direct way to implement cardinality statistics is to use the data structure of Set, which adds an element to the collection when it never appears; if it does, the set remains the same.
When the number of page visits is huge, you need a large Set collection to count, which will waste a lot of space.
In addition, such data do not need to be very accurate, is there a better plan?
This is a good question. Redis provides HyperLogLog data structures that are used to solve the statistical problems of various scenarios.
HyperLogLog is an imprecise cardinality removal scheme, and its statistical rules are based on probability, with a standard error of 0.81%, which is accurate enough to meet the statistical needs of UV.
The principle of HyperLogLog is too complicated, if you want to know, please move on:
Https://www.zhihu.com/question/53416615
Https://en.wikipedia.org/wiki/HyperLogLog
The UV of the website
Implemented through Set
A user visiting a website multiple times a day can only be counted as once, so it's easy to think of it through Redis's Set collection.
When user number 89757 accesses "Why Redis is so fast", we put this information in Set.
Why SADD Redis is so fast: uv 89757
When the user number 89757 visits the "Why Redis is so fast" page more than once, Set's deduplication function ensures that the same user ID will not be recorded repeatedly.
Use the SCARD command to count the "Why Redis is so fast" page UV. Instruction returns the number of elements of a collection (that is, the user ID).
Why SCARD Redis is so fast: uv
Implemented through Hash
Code old wet, you can also use the Hash type to achieve, the user ID as the key of the Hash collection, visit the page and execute the HSET command to set the value to 1.
Even if the user repeatedly visits and executes the command, the value of the userId will only be set to "1".
Finally, using the HLEN command to count the number of elements in the Hash collection is UV.
As follows:
HSET redis Cluster: uv userId:89757 1 shock / Statistical UVHLEN redis Cluster
HyperLogLog King Plan
The code is old and wet, Set is good, if the article is very popular to reach tens of millions of levels, a Set will save tens of millions of users'ID, and the memory consumed by more pages is too large. In the same way, the same is true of Hash data types. What should I do?
Take advantage of the HyperLogLog advanced data structures provided by Redis (don't just know the five basic data types of Redis). This is a type of data set used for cardinality statistics, and even if the amount of data is large, the space required to calculate the cardinality is fixed.
Each HyperLogLog costs up to 12KB memory to calculate the cardinality of 2 to the power of 64 elements.
Redis optimizes the storage of HyperLogLog. When the count is relatively small, the storage space uses coefficient matrix, which takes up very small space.
Only when the count is very large and the space occupied by the sparse matrix exceeds the threshold will it be transformed into a dense matrix and occupy the 12KB space.
PFADD
Add each user ID that accesses the page to the HyperLogLog.
PFADD Redis master-slave synchronization principle: uv userID1 userID 2 useID3
PFCOUNT
Use PFCOUNT to get the UV value of the Redis master-slave synchronization principle page.
PFCOUNT Redis master-slave synchronization principle: uv
PFMERGE usage scenario
In addition to the PFADD and PFCOIUNT above, HyperLogLog also provides PFMERGE, which combines multiple HyperLogLog together to form a new HyperLogLog value.
Grammar
PFMERGE destkey sourcekey [sourcekey...]
Working with scen
For example, in the website, we have two pages with similar content, and the operator says that we need the data of these two pages to be merged.
The number of UV visits to the page also needs to be merged, so PFMERGE can come in handy at this time, that is, the same user visits the two pages only once.
As shown below: the two Bitmap collections of Redis and MySQL hold two pages of user access data respectively.
PFADD Redis data user1 user2 user3PFADD MySQL data user1 user2 user4PFMERGE database Redis data MySQL data PFCOUNT database / / return value = 4
Merge multiple HyperLogLog (merge) into a single HyperLogLog, and the cardinality of the merged HyperLogLog is close to the union of all the visible sets (observed set) of the input HyperLogLog.
Both user1 and user2 visited Redis and MySQL, only once.
Ranking statistics
Of the four collection types of Redis (List, Set, Hash, Sorted Set), List and Sorted Set are ordered.
List: sorted by the order in which the elements are inserted into the List. Usage scenarios can usually be used as message queues, latest lists, and rankings.
Sorted Set: according to the score weight of elements, we can determine the weight value of each element ourselves. Use scenarios (rankings, such as by number of views, likes).
List of latest comments
Code old and wet, I can use the order sort inserted by List to realize the comment list.
For example, in the background reply list of Wechat official account (do not bar, for example), each official account corresponds to a List, and this List stores all user comments on the official account.
Whenever a user comments, use LPUSH key value [value...] Insert into the head of the List team.
LPUSH byte 1 2 3 4 5 6
Then use LRANGE key star stop to get the elements in the specified range of the list.
> LRANGE code 0 41) "6" 2) "5" 3) "4" 4) "3" 5) "2"
Note that not all up-to-date lists can be implemented in List, and for lists that are updated frequently, paging of type list can cause list elements to be duplicated or omitted.
For example, the current comment list List = {A, B, C, D}, the latest comment is on the left, and D is the earliest comment.
LPUSH code brother byte D C B A
Show the latest 2 comments on the first page and get An and B:
LRANGE code 0 11) "A" 2) "B"
According to the logic we want, the second page can get Cpene D through the LRANGE code 2 / 3.
If a new comment E is generated before the second page is displayed, the comment E is inserted into the head of the List line by LPUSH code Gobyte E, List = {E, A, B, C, D}.
Now execute the LRANGE code 2 / 3 to get the second page of the comment and find that B appears again.
LRANGE Code Kobytes 2 31) "B" 2) "C"
The reason for this is that List uses the location where the elements are sorted, and once a new element is inserted, List = {Emaine A, B, C, D}.
The original data is moved back one bit in the location of the List, resulting in reading old elements.
Summary
Only lists that do not need to be paged (for example, only the first five elements of the list are taken at a time) or are updated less frequently (such as statistics are updated every morning) are suitable for using the List type.
For lists that need to be paged and updated frequently, you need to use the ordered collection Sorted Set type.
In addition, the latest list that needs to be found through the time range can not be realized by the List type, but needs to be realized by the ordered collection Sorted Set type, such as the order list queried by the transaction time range as a condition.
Ranking
Code old wet, for the latest list of scenarios, List and Sorted Set can be achieved, why still use List? It's not better to use Sorted Set directly, it can also set score weights to sort more flexibly.
The reason is that the Sorted Set type takes up several times more memory than the List type, and in the case of a small number of lists, you can use the Sorted Set type.
For example, for a weekly music list, we need to update the number of broadcasts in real time, and we need to display them in pages.
In addition, the ranking is based on the number of broadcasts, so List cannot be satisfied at this time.
We can save the music ID to the Sorted Set collection, and the score is set to the number of plays per song, and each time the music is played, it is set to score = score + 1.
ZADD
For example, we add "blue and white porcelain" and "Hua Tian Cuo" to the musicTop collection:
ZADD musicTop 100000000 blue and white porcelain 8999999 Hua Tian Cuo
ZINCRBY
Every time "Blue and White porcelain" is played, it uses the ZINCRBY instruction to score + 1.
ZINCRBY musicTop 1 blue and white porcelain 100000001
ZRANGEBYSCORE
Finally, we need to get the top 10 musicTop playback music list. Currently, the maximum playback volume is N, which can be obtained by using the following instructions:
ZRANGEBYSCORE musicTop Nmur9 N WITHSCORES
65 Brother: but how do we get this N?
ZREVRANGE
You can use the ZREVRANGE key start stop [WITHSCORES] instruction.
The elements are sorted by decreasing score values (from large to small).
Members with the same score value are arranged in reverse (reverse lexicographical order) lexicographic order.
> ZREVRANGE musicTop 00 WITHSCORES1) Blue and White porcelain 2) 100000000
Summary
Even if the elements in the collection are updated frequently, Sorted Set can get the ordered data accurately through the ZRANGEBYSCORE command.
When faced with scenarios such as the need to display the latest lists and rankings, if the data is updated frequently or needs to be displayed by pages, it is recommended to give priority to using Sorted Set.
Aggregate statistics
It refers to counting the aggregate results of multiple collection elements, such as:
Statistics of common data for multiple elements (intersection)
Count one of the unique elements of two sets (difference statistics)
Count all elements of multiple sets (union statistics).
Code old wet, what kind of scene will use intersection, difference, union?
The Set type of Redis supports adding, deleting, changing and querying in the collection. The underlying Hash data structure is used, and both add and remove are O (1) time complexity.
And support the intersection, union and difference operations among multiple sets, and use these set operations to solve the statistical problems mentioned above.
Intersection-Common Friends
For example, mutual friends in QQ are the intersection of aggregation statistics. We use the account as the Key, and the friends of the account as the value of the Set collection.
Simulate a collection of friends of two users:
SADD user: code brother R big Linux father of PHP SADD user: big Linux big god Python big god C++ vegetable chicken
Counting the common friends of two users only requires the intersection of two Set sets, as shown in the following command:
SINTERSTORE user: common friend user: code brother byte user: boss
After the execution of the command, the intersecting data of the two sets "user: codebyte" and "user: boss" are stored in the collection user: common friends.
Difference-number of new friends per day
For example, to count the daily number of new registered users in a certain App, you only need to take the difference set of the total registered users in the past two days.
For example, the total number of registered users in 2021-06-01 is stored in the key = user:20210601 set collection, and the total number of users in 2021-06-02 is stored in the collection key = user:20210602.
The following instructions perform the subtraction calculation and store the results in the user:new collection.
SDIFFSTORE user:new user:20210602 user:20210601
After execution, the user:new collection will be the number of new users on 2021-06-02.
In addition, there is a possible acquaintance feature on QQ, which can also be implemented using subtraction, which is to subtract the collection of your friends' friends from your mutual friends who are likely to know.
Merge-add a total of friends
As an example of difference sets, you only need to merge the two sets to calculate the total number of new users added on 2021-06-01 and 2021-06-02.
After reading the above content in SUNIONSTORE userid:new user:20210602 user:20210601, do you have any further understanding of how to achieve 100 million-level data statistics in Redis? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.