Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Explain the principle of HyperLogLog algorithm and how Redis applies it

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

In this issue, the editor will bring you an explanation of the principle of HyperLogLog algorithm and how to apply Redis. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

Problem prototype

If you want to achieve such a function:

Count the number of times a user clicks into a page of an APP or web page every day. The repeated clicks of the same user are recorded as 1.

Smart you may immediately think that the use of HashMap this kind of data structure is fine, but also satisfied with de-duplication. Indeed, this is one solution, and there are other solutions.

Although the problem is not difficult, but when the variables in the participating problem reach a certain order of magnitude, even a simple problem will become a difficult problem. Assuming that the number of daily active users in APP reaches the level of one million or more, we adopt the approach of HashMap, which will result in a large amount of memory in the program.

Let's try to estimate the memory footprint of HashMap in dealing with the above problems. Suppose that Key is of type string and value is bool in the defined HashMap. The key corresponds to whether the user's Id,value is clicked to enter. Obviously, when millions of different users visit. The memory footprint of this HashMap is: 1 million * (string + bool).

Condition selection

It can be said that among the existing solutions to the above problems, HashMap takes up the largest amount of memory. If there are not many statistics, then this method can be used to solve the problem, and it is easy to implement.

In addition, there are B+ tree, Bitmap bitmap, and the HyperLogLog algorithm solution mainly introduced in this article.

Under certain conditions, if the error rate in front of huge amounts of data is allowed to be within an acceptable range, and 10 million pageviews are allowed to be counted less than 10,000 to 20,000, then the HyperLogLog algorithm can be used to solve the similar counting problem above.

HyperLogLog

HyperLogLog, hereinafter referred to as HLL, is an updated version of the LogLog algorithm to provide imprecise deduplication. There are the following characteristics:

The code is difficult to implement.

Very little memory can be used to count huge amounts of data. HyperLogLog implemented in Redis needs only 12K memory to count 2 ^ 64 data.

There is a certain error in counting, and the error rate is low as a whole. The standard error is 0.81%.

The error can be reduced by setting an auxiliary calculation factor.

Students who have a little knowledge of the memory footprint of the basic data types in programming should be surprised that they only need 12K of memory to count 2 ^ 64 data. Why do you say so? let's give an example:

In Java language, long generally occupies 8 bytes, while a byte has 8 bits, that is, 1 byte = 8 bit, that is, the maximum number that can be represented by the long data type is: 2 ^ 63-1. Corresponding to the number of 2 ^ 64 above, suppose there are so many 2 ^ 63-1 at this time, from 0 to 2 ^ 63-1, the total memory is calculated according to the rule of long and 1k = 1024 bytes, that is: (2 ^ 63-1) * 8Comp1024) K, which is a very large number, and the storage space is far more than 12K. However, HyperLogLog can be counted in 12K.

Bernoulli's test

Before you know why HyperLogLog can use very little memory to count huge amounts of data, you need to know about the Bernoulli experiment.

Bernoulli experiment is a part of mathematical probability theory, and its allusion comes from flipping a coin.

The coin has both sides, and there is a 50% chance that the coin will end up on both sides of the coin. Suppose we flip a coin until it appears on the front, and we record it as a complete experiment, which may occur once in a while or four times. No matter how many times it is thrown, as long as there is a positive face, it will be recorded as an experiment. This experiment is the Bernoulli test.

So for many Bernoulli tests, let's assume that this is n times. It means that there are n positive faces. Suppose the number of throws experienced in each Bernoulli test is k. In the first Bernoulli test, the number of times is set to K1, and so on, the nth time corresponds to kn.

Among them, for these n Bernoulli tests, there must be a maximum throw number k, for example, it takes 12 times to appear positive, so this is called k_max, which represents the maximum number of throws.

The Bernoulli test is easy to draw the following conclusions:

The throwing times of n times Bernoulli process are no more than that of k_max.

N times Bernoulli process, at least one throw number is equal to k_max

Finally, combined with the maximum likelihood estimation method, it is found that there is an estimation correlation between n and k_max: n = 2 ^ (k_max). This method of predicting the characteristics of the whole data flow through local information seems to be beyond our basic cognition, and the correlation can be deduced and verified by probability and statistical methods.

For example, it looks like this:

The first test: it took 3 times to show the positive face, and at this time, the second test: it took 2 times to show the positive face, and at this time, the third test: it took 6 times to show the positive face, and at this time, the nth test: it took 12 times to throw the positive face. At this time, we estimated that n = 2 ^ 12.

Assuming that the number of experimental groups in the above example consists of three groups, then k_max = 6 and finally n = 3, we put it into the estimation formula, obviously: 3 ≠ 2 ^ 6. In other words, when the number of experiments is very small, the error of this estimation method is very large.

Optimization of estimation

In the above three sets of examples, we call it a round of estimates. If it is only one round, when n is large enough, the estimated error rate will be relatively reduced, but it is still not small enough.

So is it possible to conduct multiple rounds? For example, conduct 100 or more rounds of experiments, and then take the k_max of each round, and then take the average, that is, k_mx/100. Finally, n is estimated. Here is the formula for estimating LogLog:

The DVLL of the above formula corresponds to the njournal constant is the correction factor, its specific value is uncertain and can be branched according to the actual situation. M represents the number of wheels in the experiment. The R with a horizontal on the head is the average: (k_max_1 +... + k_max_m) / m.

This kind of algorithm optimization by increasing the number of test rounds and then taking the average of k_max is the practice of LogLog. The difference between HyperLogLog and LogLog is that it does not use averages, but harmonic averages. The advantage of reconciling averages over averages is that they are not easily affected by large values. Here's an example:

Calculate the average salary:

An is 1000 / month, B is 30000 / month. The average is (1000 + 30000) / 2 = 15500

The way to reconcile the average is as follows: 2 / (1 ≈ 1000 + 1 pound 30000)

Obviously, the effect of reconciling the average is better than the average. The following is how the harmonic average is calculated, and ∑ is the cumulative symbol.

Be involved in a relationship

As we already know, in the case of a coin flip, n can be estimated by k_max in a Bernoulli experiment.

So how does this estimation method relate to the following questions?

Count the number of times a user clicks into a page of an APP or web page every day. The repeated clicks of the same user are recorded as 1.

This is what HyperLogLog does. For the entered data, perform the following steps:

1. Bit string

Through the hash function, the data is converted into a bit string, for example, by entering 5, it is converted to: 101. Why should it be transformed in this way?

Because to correspond to the coin flip, in the bit string, 0 represents the negative, 1 represents the front, and if a data is eventually converted into 10010000, then from right to left, from low to high, we can think that when 1 appears for the first time, it's positive.

Well, based on the above estimation conclusion, we can estimate the total number of experiments by the maximum number of coin tosses to the front, and we can also estimate how much data is stored according to the maximum position k_max of 1 after conversion.

two。 Split bucket

To divide a bucket is to divide how many rounds. Abstract into the computer storage, that is, to store a large array S with length L in bit, S is divided into m groups averagely, note that this m group is corresponding to how many rounds, and then the number of bits occupied by each group is average, set to P. It is easy to draw the following relationships:

L = S.length

L = m * p

Memory occupied by S in K = L / 8 / 1024

In Redis, the HyperLogLog is set to: masking 16834, pendant 6, Lemma 16834 * 6. Occupied memory = 16834 * 6 / 8 / 1024 = 12K

Visualize as:

Group 0, group 1. Group 16833 [000 000] [000 000] [000 000]. [000 000] 3. Correspondence

Now let's get back to the problem of counting users on our original APP page.

Let the key of the APP home page be: main

The id of the user is: idn, n-> 0pm 1pm 2pm 3....

In this statistical problem, different user id identifies a user, so we can use the user's id as input by hash. That is:

Hash (id) = bit string

Different users id must have different bit strings. Every bit string is bound to have a position of 1 at least once. We compare each bit string to a Bernoulli experiment.

Now it's time to split the wheel, that is, the bucket. So we can set how many of the first bits of each bit string are converted to decimal, and the value corresponds to the label of the barrel in which it is located. Assuming that the lower two bits of the bit string are used to calculate the flag under the bucket, there is a user's id bit string: 1001011000011. Its bucket subscript is 11 (2) = 1 * 2 ^ 1 + 1 * 2 ^ 0 = 3, which is in the third barrel, that is, the third round.

In the above example, after calculating the bucket number, the remaining bit string is 10010110000, and from low to high, the position of 1 for the first time is 5. That is to say, the third barrel at this time, in the third round of the test, k_max = 5. The corresponding binary of 5 is 101, and because each barrel has p bits. When p > = 3, 101 can be saved.

Imitating the above process, multiple different user id are dispersed into different buckets, and each bucket has its own k_max. Then when it comes to counting the number of user clicks on the mian page, it's an estimate. Finally, combined with the k_max in all the buckets, the estimated value can be obtained by substituting the estimation formula.

Here is HyperLogLog's formula for estimating harmonic averages, with the same interpretation of variables as LogLog's:

Application of HyperLogLog in Redis

First, in Redis, HyperLogLog is one of its high-level data structures. Includes, but is not limited to, the following two commands:

Pfadd key value, save a value corresponding to key into

Pfcount key, count the number of value in key

Recall that the original APP page counted the user's problems. If the key corresponds to the page name, the value corresponds to the user id. Then the problem is just right.

HyperLogLog principle in Redis

We have realized that in its implementation, there are 16384 buckets, that is, 2 ^ 14 = 16384, each barrel has 6 digits, and the maximum number that each bucket can express is: 2 ^ 5 + 2 ^ 4 +. + 1 = 63, binary: 111l.

For command: pfadd key value

When saved, the value will be hash into 64 bits, that is, 64 bit bit strings. The first 14 bits used to select this value's first 1 subscript position value will be stored in that bucket, that is, the first 14 bits will be used to split the bucket. Set the value where the first 1 appears to be index. When index=5, it is:. 10000 [01 0000 0000 0000]

The reason why 14 digits are chosen to express the barrel number is that it is divided into 16384 buckets, and 2 ^ 14 = 16384, which is just right, the bucket can be used up at the maximum time without causing waste. Suppose the first 14 bits of a string are: 0000 0000 0010 (viewed from right to left), with a decimal value of 2. Then the index will be converted and put into the bucket numbered 2.

Conversion rules of index:

First of all, because the complete value bit string is 64-bit form, minus 14, leaving 50 bits, then in extreme cases, the position of 1 is in the 50th bit, that is, the position is 50. At this point index = 50. At this point, convert index to binary, which is: 110010.

Because of the 16384 barrels, each barrel is made up of 6 bit. It just happens that 110010 is set to barrel 2. Please note that 50 is already the worst-case scenario, and it has been accommodated. Then other things can certainly be accommodated without thinking about it.

Because the key of fpadd can set multiple value. For example, the following example:

Pfadd lgh golangpfadd lgh pythonpfadd lgh java

According to the above practice, different value will be set to different buckets, if it appears in the same bucket, that is, the first 14 values are the same, but the position of the subsequent 1 is different. Then compare whether the original index is larger than the new index. Yes, replace it. No, it remains the same.

In the end, the 16384 barrels corresponding to a key have a lot of value set up, and each barrel has a k_max. When you call pfcount at this time, you can calculate how many times the value of key is set, that is, the statistical value, according to the estimation method described earlier.

The value is converted to a 64-bit bit string and eventually recorded in each bucket as described above. 64-bit conversion to decimal is: 2 ^ 64, HyperLogLog only uses: 16384 * 6 / 8 / 1024 K storage space to count as many as 2 ^ 64.

Deviation correction

In the estimated formula, the constant variable is not a fixed value, it will be set by the branch according to the actual situation, such as the following.

Suppose that m is the number of buckets and p is the logarithm of m with 2 as the base.

/ m is switch (p) {case 4: constant = 0.673 * m * m; case 5: constant = 0.697 * m * m; case 6: constant = 0.709 * m * m; default: constant = (0.7213 / (1 + 1.079 / m)) * m * m } the above is the principle of the HyperLogLog algorithm shared by the editor and how Redis applies it. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report