In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Most people don't understand the knowledge points of this article "Distinct Count", so Xiaobian summarizes the following contents for everyone. The contents are detailed, the steps are clear, and they have certain reference value. I hope everyone can gain something after reading this article. Let's take a look at this article "Distinct Count".
Big data, an IT industry term, refers to a collection of data that cannot be captured, managed, and processed with conventional software tools within a certain time frame. It is a massive, high-growth, and diverse information asset that requires new processing models to have stronger decision-making power, insight, and process optimization capabilities.
Hive
In a big data scenario, an important item in the report is UV (Unique Visitor) statistics, that is, the number of users in a certain period of time. For example, to view the distribution of app users in a week, Hive writes HiveQL implementation:
select app, count(distinct uid) as uvfrom log_tablewhere week_cal = '2016-03-27'
Pig
Similarly, Pig is written:
-- all usersdefine DISTINCT_COUNT(A, a) returns dist { B = foreach $A generate $a; unique_B = distinct B; C = group unique_B all; $dist = foreach C generate SIZE(unique_B);}A = load '/path/to/data' using PigStorage() as (app, uid);B = DISTINCT_COUNT(A, uid);-- A = load '/path/to/data' using PigStorage() as (app, uid);B = distinct A;C = group B by app;D = foreach C generate group as app, COUNT($1) as uv;-- suitable for small cardinality scenariosD = foreach C generate group as app, SIZE($1) as uv;
DataFu UDF datafu.pig.stats. HyperLogPlusPlus that provides cardinality estimates for pig, which uses HyperLogLog++ algorithm to more quickly distinguish Count:
define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();A = load '/path/to/data' using PigStorage() as (app, uid);B = group A by app;C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;
Spark
In Spark, Load data is transformed through a series of RDD transformations--map, distinct, reduceByKey Distinct Count:
rdd.map { row => (row.app, row.uid) } .distinct() .map { line => (line._ 1, 1) } .reduceByKey(_ + _)// orrdd.map { row => (row.app, row.uid) } .distinct() .mapValues{ _ => 1 } .reduceByKey(_ + _)// or rdd.map { row => (row.app, row.uid) } .distinct() .map(_._ 1) .countByValue()
Meanwhile, Spark provides an API that approximates Distinct Count:
rdd.map { row => (row.app, row.uid) } .countApproxDistinctByKey(0.001)
The implementation is based on the HyperLogLog algorithm:
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
Alternatively, after converting the Schemaized RDD to DataFrame, registerTempTable and then executing sql command can also be:
val sqlContext = new SQLContext(sc)val df = rdd.toDF()df.registerTempTable("app_table")val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app") The above is the content of this article about "Distinct Count has what role". I believe everyone has a certain understanding. I hope the content shared by Xiaobian will be helpful to everyone. If you want to know more relevant knowledge content, please pay attention to the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.