How to apply HyperLogLog function in Spark 07/04 Update SLTechnology News&Howtos

How to apply HyperLogLog function in Spark

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to use the HyperLogLog function in Spark. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

The challenge of Reaggregation

Pre-aggregation is a powerful technical means in the field of data analysis, as long as the index to be calculated is reaggregable. Aggregation operations, as the name implies, satisfy the association law, so it is easy to introduce reaggregation operations because aggregation operations can be further aggregated. Counts can be reaggregated by SUM, the minimum can be reaggregated by MIN, and the maximum can be reaggregated by MAX. While distinct counts is a special case, it is impossible to reaggregate. For example, the sum of distinct count of different website visitors is not equal to the distinct count value of all website visitors. The reason is very simple: the same user may visit different websites, so there is the problem of repeated statistics.

The non-aggregable nature of Distinct count has a great impact, computing distinct count must access the finest-grained data, further, the query that calculates distinct count must read each row of data.

When this problem meets big data, there will be a new challenge: the amount of memory required in the computing process is proportional to the amount of distinct count results. In recent years, big data systems such as Apache Spark and analytical databases such as Amazon Redshift have introduced the approximate calculation function of distinct count-cardinality estimation (cardinality estimation), which is implemented by HyperLogLog (HLL) probabilistic data structure. To use approximate calculation in Spark, you only need to replace COUNT (DISTINCT x) with approx_count_distinct (x [, rsd]), where the additional parameter rsd represents the maximum allowable deviation rate, and the default value is 5%. The HLL performance analysis given by Databricks shows that as long as the maximum deviation rate is greater than or equal to 1% of distinct count approximate calculation, the speed of approximate calculation is 2 to 8 times higher than that of accurate calculation. However, if we need a smaller deviation rate, approximate calculation may take longer than accurate calculation.

An 8-fold performance improvement is considerable, but the sacrificed accuracy of a maximum deviation rate greater than or equal to 1% may not be acceptable in some cases. In addition, a 2x-8x performance improvement is negligible in the face of a thousand-fold performance improvement brought about by prepolymerization, so what can we do?

In fact, the answer to the review of HyperLogLog algorithm lies in the HyperLogLog algorithm itself. The pseudo code for Spark to implement the HLL algorithm through partition shard execution of MapReduce is as follows:

Map (per partition)

Initialize the HLL data structure, called HLL sketch

Add each input to the sketch

Send sketch

Reduce

Aggregate all sketch into one aggregate sketch

Finalize

Calculate the distinct count approximation in aggregate sketch

It is worth noting that HLL sketch is reaggregable: the result of merging the reduce process is a HLL sketch. If we can serialize sketch into data, then we can persist it in the pre-polymerization phase, and we can get a thousand-fold performance improvement when we calculate the distinct count approximation later!

In addition, this algorithm can bring another equally important benefit: we are no longer limited to performance issues to compromise with estimation accuracy (an estimation deviation greater than or equal to 1%). Because pre-polymerization can bring thousands of times performance improvement, we can create HLL sketch with very low estimation deviation, because we can accept 2-5 times the computation time of pre-polymerization phase in the face of thousands of times query performance improvement. This is basically equivalent to a free lunch in big data's business: it brings a huge performance improvement without having a negative impact on most business users.

Introduction to Spark-Alchemy: HLL Native function due to Spark does not provide the corresponding function, Swoop open source high-performance HLL native function toolkit, as part of the spark-alchemy project, specific use examples can refer to HLL docs. It provides the most complete HyperLogLog processing tools in big data's field, surpassing BigQuery's HLL support.

The following figure shows spark-alchemy processing initial aggregation (via hll_init_agg), reaggregation (via hll_merge), and presentation (via hll_cardinality). If you want to know the memory usage of HLL sketch, you can follow this rule: for every 2 times increase in HLL cardinality estimation precision, 4 times more memory is required for HLL sketch. In most scenarios, the benefits of a small number of rows of data far outweigh the additional storage brought by HLL sketch.

HyperLogLog interoperability replaces accurate calculation by approximately calculating distinct count, and saves HLL sketch as column data. In the final query phase, there is no need to deal with the finest-grained data in each row, but there is still an implicit requirement that systems that use HLL data need to access all the finest-grained data, because there is no industry standard to serialize HLL data structures. Most implementations, such as BigQuery, use opaque binary data and are not documented, which makes cross-system interoperability difficult. This problem of interoperability greatly increases the cost and complexity of the interactive analysis system.

One of the key requirements of interactive analysis system is fast query response. This is not the core of many big data systems such as Spark and BigQuery, so in many scenarios, interactive analysis queries are implemented through relational or NoSQL databases. If HLL sketch does not achieve data-level interoperability, then we will go back to square one.

To solve this problem, the spark-alchemy project uses open storage standards, built-in support for Postgres-compatible databases, and JavaScript. This enables Spark to become a global data preprocessing platform, which can meet the needs of fast query response, such as portal and dashboard scenarios. Such an architecture can bring huge benefits:

99% of the data is managed only through Spark, with no duplication

In the pre-polymerization stage, 99% of the data is processed by Spark.

The response time of interactive query is greatly reduced, and the amount of data processed is also significantly less.

Thank you for reading! This is the end of the article on "how to apply HyperLogLog function in Spark". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it out for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.