How to realize the aggregation function in Spark 07/06 Update SLTechnology News&Howtos

How to realize the aggregation function in Spark

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

It is believed that many inexperienced people don't know what to do about how to realize the aggregation function in Spark. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Internet company-interview question: / * * for example, to count the total number of visits by users and the total number of visits excluding the same URL, randomly create several sample data (four fields: the id,name,vtm,url,vtm field is useless, never mind) as follows: id1,user1,2, http://www.hupu.comid1,user1,2,http://www.hupu.comid1,user1,3,http://www.hupu.comid1,user1,100, Http://www.hupu.comid2,user2,2,http://www.hupu.comid2,user2,1,http://www.hupu.comid2,user2,50,http://www.hupu.comid2,user2,2,http://touzhu.hupu.com according to this data set, we can write the hql implementation: select id,name, count (0) as ct,count (distinct url) as urlcount from table group by id,name The result should be: id1,user1,4,1id2,user2,4,2 to achieve this aggregation function with Spark to briefly talk about the parsing process of MR: map stage: id and name are combined into key, url is valuereduce stage: len (urls) appears times, len (set (urls)) appears the number of users because I do not write MR, resulting in an awkward interview. Want to pretend to write a Spark, found that it is very difficult, because indeed many functions are not familiar with.

The code is as follows:

Import org.apache.spark.SparkContext._ import org.apache.spark._ object SparkDemo2 {def main (args: Array [String]) {case class User (id: String, name: String, vtm: String, url: String) / / val rowkey = (new RowKey). Evaluate (_) / / val HADOOP_USER = "hdfs" / / set the user name / System.setProperty ("user.name", HADOOP_USER) used to access spark / / set the user name / / System.setProperty used to access hadoop ("HADOOP_USER_NAME", HADOOP_USER) Val conf = new SparkConf (). SetAppName ("wordcount"). SetMaster ("local") / / .setExecutorEnv ("HADOOP_USER_NAME", HADOOP_USER) val sc = new SparkContext (conf) val data = sc.textFile ("/ Users/jiangzl/Desktop/test.txt") val rdd1 = data.map (line = > {val r = line.split (",") User (r (0), r (1), r (2)) R (3))}) val rdd2 = rdd1.map (r = > ((r.id, r.name), r) val seqOp = (a: (Int, List [String]), b: User) = > a match {case (0, List ()) = > (1, List (b.url)) case _ = > (a: Int 1 + 1, b.url:: a.room2)} val combOp List [String]), b: (Int, list [string])) = > {(a. Listing 1 + b.list1, a.room2: b.list2)} println ("- -") val rdd3 = rdd2.aggregateByKey ((0, list [string] () (seqOp CombOp) .map (a = > {(a.room1, a.room2.room1) A._2._2.distinct.length)) rdd3.collect.foreach (println) println ("- -") sc.stop ()}} solution-error Scala version problem: Exception in thread "main" java.lang.NoSuchMethodError: scala.runtime.VolatileObjectRef.zero () Lscala/runtime/VolatileObjectRef

Modify Scala version 2.11.7 to 2.10.4

Simple.sbt

Name: = "SparkDemo Project" version: = "1.0" scalaVersion: =" 2.11.7 "libraryDependencies + =" org.apache.spark "%" spark-core "%" 1.4.1 "--modified to:-- name: =" SparkDemo Project "version: = "1.0" scalaVersion: = "2.10.4" libraryDependencies + = "org.apache.spark"% "spark-core"% "1.4.1"

Running process

JiangzhongliandeMacBook-Pro:spark-1.4.1-bin-hadoop2.6 jiangzl$. / bin/spark-submit-class "SparkDemo2" ~ / Desktop/tmp/target/scala-2.11/simple-project_2.11-1.0.jarException in thread "main" java.lang.NoSuchMethodError: scala.runtime.VolatileObjectRef.zero () Lscala/runtime/VolatileObjectRef At SparkDemo2 $.main (tmp_spark.scala) at SparkDemo2.main (tmp_spark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0 (NativeMethod) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain (SparkSubmit.scala:665) at org.apache .spark.deploy.SparkSubmit $.doRunMain $1 (SparkSubmit.scala:170) at org.apache.spark.deploy.SparkSubmit$.submit (SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main (SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main (SparkSubmit.scala)-modify to:- -jiangzhongliandeMacBook-Pro:spark-1.4.1-bin-hadoop2.6 jiangzl$. / bin/spark-submit-- class "SparkDemo2" ~ / Desktop/tmp/target/scala-2.10/sparkdemo-project_2.10-1.0.jar16/04/29 12:40:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable-- ((id1,user1), 4pr 1) ((id2,user2), 4pr 2)-after reading the above content Have you mastered how to implement the aggregation function in Spark? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.