Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the function of Distinct Count?

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Most people don't understand the knowledge points of this article "Distinct Count", so Xiaobian summarizes the following contents for everyone. The contents are detailed, the steps are clear, and they have certain reference value. I hope everyone can gain something after reading this article. Let's take a look at this article "Distinct Count".

Big data, an IT industry term, refers to a collection of data that cannot be captured, managed, and processed with conventional software tools within a certain time frame. It is a massive, high-growth, and diverse information asset that requires new processing models to have stronger decision-making power, insight, and process optimization capabilities.

Hive

In a big data scenario, an important item in the report is UV (Unique Visitor) statistics, that is, the number of users in a certain period of time. For example, to view the distribution of app users in a week, Hive writes HiveQL implementation:

select app, count(distinct uid) as uvfrom log_tablewhere week_cal = '2016-03-27'

Pig

Similarly, Pig is written:

-- all usersdefine DISTINCT_COUNT(A, a) returns dist { B = foreach $A generate $a; unique_B = distinct B; C = group unique_B all; $dist = foreach C generate SIZE(unique_B);}A = load '/path/to/data' using PigStorage() as (app, uid);B = DISTINCT_COUNT(A, uid);-- A = load '/path/to/data' using PigStorage() as (app, uid);B = distinct A;C = group B by app;D = foreach C generate group as app, COUNT($1) as uv;-- suitable for small cardinality scenariosD = foreach C generate group as app, SIZE($1) as uv;

DataFu UDF datafu.pig.stats. HyperLogPlusPlus that provides cardinality estimates for pig, which uses HyperLogLog++ algorithm to more quickly distinguish Count:

define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();A = load '/path/to/data' using PigStorage() as (app, uid);B = group A by app;C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;

Spark

In Spark, Load data is transformed through a series of RDD transformations--map, distinct, reduceByKey Distinct Count:

rdd.map { row => (row.app, row.uid) } .distinct() .map { line => (line._ 1, 1) } .reduceByKey(_ + _)// orrdd.map { row => (row.app, row.uid) } .distinct() .mapValues{ _ => 1 } .reduceByKey(_ + _)// or rdd.map { row => (row.app, row.uid) } .distinct() .map(_._ 1) .countByValue()

Meanwhile, Spark provides an API that approximates Distinct Count:

rdd.map { row => (row.app, row.uid) } .countApproxDistinctByKey(0.001)

The implementation is based on the HyperLogLog algorithm:

The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.

Alternatively, after converting the Schemaized RDD to DataFrame, registerTempTable and then executing sql command can also be:

val sqlContext = new SQLContext(sc)val df = rdd.toDF()df.registerTempTable("app_table")val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app") The above is the content of this article about "Distinct Count has what role". I believe everyone has a certain understanding. I hope the content shared by Xiaobian will be helpful to everyone. If you want to know more relevant knowledge content, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report