How to use HyperLogLog in Spark-Alchemy 07/06 Update SLTechnology News&Howtos

How to use HyperLogLog in Spark-Alchemy

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is to share with you about how to use HyperLogLog in Spark-Alchemy. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Reaggregation Challenge

There is a prerequisite for the establishment of Reaggregation, and the pre-calculated dimensions can be aggregated again, and aggregation represents aggregability in the dictionary definition. The semantic is further extended to explain reaggregation-with aggregations that can be aggregated again, summation, maximum and minimum values can all be reaggregation, but distinct count does not support reaggregation, mainly because of the problem of secondary counting. Counting the total number of visitors to each site is not equal to the total number of visitors to the site, because a single visitor can visit multiple sites.

This non-reaggregable feature makes the system calculating distinct count must access the finest-grained data, so each query needs to access each row of data to count distinct count.

In the field of big data, there is another problem with distinct counts. The amount of memory needed in the calculation process is more proportional to the size of different result sets. In order to avoid the above problems, in recent years, some big data platforms such as Apache Spark and analysis-oriented databases such as Amazon RedShift have introduced a cardinality estimation algorithm, which uses HyperLogLog (HLL) to estimate distinct counts. If you want to use cardinality estimation algorithm in spark As long as approx_count_distinct (x [, rsd]) is used instead of COUNT (DISTINCT x), the optional parameter rsd represents the tolerable error. In the test report of databricks, the aggregation performance of HLL can reach 2-8 times that of accurately calculating distinct count performance, and the error rate is maintained at more than 1%. Users can pursue higher precision, but the running time of HLL algorithm may be longer than that of accurate calculation of distinct count.

The price of a 2-8x performance improvement is that the error rate is always more than 1%, which is unacceptable in some scenarios, and what can we do about the fact that 2-8x performance is negligible in the face of a 1000-fold performance improvement brought about by pre-aggregation? Let's first introduce the HLL algorithm.

HyperLogLog algorithm

The HLL algorithm can be divided into the following parts in the Spark processing flow

Map stage

Initialize the HLL sketch data structure

Join input for each HLL sketch

Send HLL sketch

Reduce

Merge all HLL sketches into aggregated sketch

Finalize

HLL sketch supports reaggreation, and HLL sketch is still generated by merging in the reduce phase. Users can serialize the result and persist the result as part of the pre-aggregation to provide input for subsequent computing distinct count, which can bring a 1000-fold improvement.

Calculate the estimated value of distinct count from the aggregated sketch

This method can also have another advantage, through which users can control the error rate within 1%. Because pre-aggregation can bring a 1000-fold increase, we can spend more time calculating HLL to achieve a smaller error rate. In the pre-aggregation phase, it is acceptable to spend 2-5 times the pre-aggregation time, for most users. Performance improvement is accompanied by almost no other sacrifices.

Spark-Alchemy introduction: HLL function

Because the Spark community does not support HLL features, Swoop opens up these functions as part of the spark-alchemy library. Users can refer to the samples provided in the HLL documentation. Spark alchemy provides richer features than BigQuery's HLL support.

The following figure shows how spark alchemy HLL handles the initialization of aggregation (hll_init_agg), reaggregation (hll_merg), and the presentation of the final result (hll_cardinality)

If users are worried about the storage overhead of HLL sketches, a simple estimation can be made by the following rules: if the accuracy is increased by 2 times, the storage cost of HLL sketches will be increased by 4 times. in most applications, the reduction in the number of records will far outweigh the increased cost of HLL sketch.

HyperLogLog interoperability

Accurate calculation of Distinct count, switching between estimation modes and saving HLL sketches as column data can prevent users from traversing all recorded data when querying, but the system still needs to access all recorded data when preparing HLL data. In addition, there is no unified standard for the serialization of HLLsketches, so the data of HLL can not be shared in different systems. This inconvenience of interoperability increases the analysis cost and complexity of the interaction analysis system.

Interactive analysis system requires fast response time, but this requirement is not the core design goal of big data system, which is why interactive analysis is still running on relational or NoSQL databases. Without the interoperability of HLL sketches, users may still use the original way of interactive query.

In order to solve this problem, when developing HLL-related functions, spark alchemey provides a storage format and native Postgres-compatible databases, so that spark can be used as a unified platform for data preprocessing for systems requiring fast response time. The benefits of this architecture are as follows

More than 99% of the data is managed through spark and there is no copy

More than 99% of processing occurs in spark support, including pre-aggregation

Higher performance and fewer resources for interactive queries

The above is how to use HyperLogLog in Spark-Alchemy. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.