Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Redis site Traffic Statistics how to use HyperLogLog

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to use HyperLogLog on Redis site traffic statistics. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

When we do site traffic statistics, we usually count the page UV (unique visitor: unique visitor) and PV (page views: page view). Then our most common processing method is for users to insert a piece of data into the database with one click, and query the table to achieve the effect of statistical traffic.

If we deal with it through redis, we can use the string type and then increase the count to achieve the statistical PV. The statistical UV can use set. Each user's id is unique and can be put into this collection. For statistics, you only need to execute scard to obtain the set size.

These two ways can achieve site traffic statistics, but if the site traffic is very large tens of millions of visits a day, then the database may be divided into tables and libraries, redis add enough data after the memory consumption is also very amazing, generally speaking, it is not cost-effective to do so, then our redis has a special algorithm to deal with such data, HyperLogLog.

Basic usage of HyperLogLog

HyperLogLog provides two instructions, pfadd and pfcount, which are easy to understand literally: one is to increase the count, and the other is to get the count. The use of sadd for pfadd and set collections is the same, and the use of pfcount and scard is the same, getting the count value directly.

HyperLogLog also provides a merge instruction pfmerge to add up two pf to form a new pf, which is very useful if we need to merge the traffic from each page.

> pfadd user mango (integer) 1 > pfadd user zhangsan (integer) 1 > pfadd user lisi (integer) 1 > pfadd user mango # repeat does not count (integer) 0 > pfcount user (integer) 3

> pfadd paper mango (integer) 1 > pfadd paper zhangsan (integer) 1 > pfmerge pv user paper # merge OK > pfcount pv (integer) 3

Note:

There is a certain error in the counting statistics of HyperLogLog, the maximum error is less than 1%.

HyperLogLog data structure needs to occupy the storage space of 12KB, so when we use it, we should be aware that if the amount of data is very small, we can choose other ways, but if the amount of data is very large, then our data structure is very valuable. However, we should not worry too much about its consumption. At the beginning, it was not so large. When the count was relatively small, he used sparse matrix for storage, and the space occupancy rate was still very small. Only after reaching a certain threshold will it be transformed into a dense matrix space to reach 12KB.

The Origin of the Command pf prefix

It's a little strange when we use this HyperLogLog command, because we all start with h when we use hash, hadd,set starts with sadd,zset, and starts with z, so why is our HyperLogLog a pf? Philippe Flajolet, the inventor of the data structure HyperLogLog, pf is his acronym. Principle of HyperLogLog implementation

HyperLogLog does not actually store the value of each element, it uses a probability algorithm to calculate the number of elements by storing the position of the first 1 of the element's hash value.

The random value we generated for the first time gets its binary, counting the number of consecutive zeros from right to left, and each time we get the probability of 1 and 0, the probability of occurrence of 1 and 0 in any round is (1) ^ k, so it can be inferred that n = 2 ^ k, but generally speaking, the value of n will wander between 2 ^ k and 2 ^ k, in order to make this algorithm more accurate, the concept of bucket is introduced. Calculate the weighted average of m barrels so that you can get a more accurate answer (in fact, other corrections are needed). The final formula is shown in the figure

Where m is the number of buckets, const is the normal number, its value will change according to m. P=log2m

Switch (p) {case 4: constant = 0.673 * m * m; case 5: constant = 0.697 * m * m; case 6: constant = 0.709 * m * m; default: constant = (0.7213 / (1 + 1.079 / m)) * m * m;}

There are three HyperLogLog-related commands:

Pfadd hll ele: add ele to the cardinality calculation of hll. Process:

First get hash from ele (using an algorithm called MurMurHash)

Take the lower 14 bits of hash (because there are a total of 2 to the power of 14 barrels) as the barrel number, select the bucket, and note that the current value in the bucket is count

Count zeros from the 15th bit of hash, assuming that there are n consecutive zeros (that is, leading zeros) from the 15th bit of

If n is greater than count, set the value of the selected bucket to n, otherwise it will not change

Pfcount hll: calculates the cardinality of hll. Is to use the DV formula given above to calculate the cardinality according to the values in the bucket.

Pfmerge hll3 hll1 hll2: merge hll1 and hll2 into hll3. The merge method is to merge all the HyperLogLog objects into an object called max. Max uses a dense storage structure. If the merged object is also a dense storage structure, each count value is looped and the larger one is stored in the max.

All HyperLogLog structures of Redis have fixed 16384 barrels (2 to the power of 14) and have two storage formats:

Sparse format: at the beginning of the HyperLogLog algorithm, most buckets are actually 0, and the sparse format greatly reduces the amount of memory needed at the beginning of HyperLogLog by storing the number of consecutive zeros instead of saving each 0 once.

Compact format: use 6 bit to represent a bucket, which takes up 12KB memory

Why is the memory footprint of pf 12k?

We use 1024 buckets for independent counting in the above algorithm, but in the HyperLogLog implementation of Redis, we use 16384 buckets, that is, 2 ^ 14. The maxbits of each bucket needs 6 bits to store, and the maximum can represent maxbits=63, so the total memory consumption is 2 ^ 14 * 6x8 = 12k bytes. On "Redis site traffic statistics HyperLogLog how to use" this article is shared here, I hope the above content can be of some help to you, so that you can learn more knowledge, if you think the article is good, please share it out for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report