Spark MaprLab-Auction Data case analysis 07/15 Update SLTechnology News&Howtos

Spark MaprLab-Auction Data case analysis

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "Spark MaprLab-Auction Data Example Analysis". The explanation content in this article is simple and clear, easy to learn and understand. Please follow the idea of Xiaobian slowly and deeply to study and learn "Spark MaprLab-Auction Data Example Analysis" together!

I. Environmental installation

1. Install hadoop

2. Install spark

3. Start hadoop

4. Start spark.

II.

1. data preparation

Download DEV360DATA.zip from MAPR website and upload to server.

[hadoop@hftclclw0001 spark-1.5.1-bin-hadoop2.6]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6[hadoop@hftclclw0001 spark-1.5.1-bin-hadoop2.6]$ cd test-data/[hadoop@hftclclw0001 test-data]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6/test-data/DEV360Data[hadoop@hftclclw0001 DEV360Data]$ lltotal 337940-rwxr-xr-x 1 hadoop root 575014 Jun 24 16:18 auctiondata.csv => c Test data used-rw-r--r--1 hadoop root 57772855 Aug 18 20:11 sfpd. csv-rwxrwx 1 hadoop root 287692676 Jul 26 20:39 sfpd. json [hadoop@hftclw0001 DEV360Data]$more auctiondata. csv 8213034705,95, 2.927373,jake7870,0,95,117.5,xbox,38213034705,115, 2.943484,davidbresler2,1,95,117.5,xbox,38213034705,100, 2.951285,gladimacowgirl,58,95,117.5,xbox,38213034705,117.5, 2.998947,daysrus,10,95,117.5,xbox,38213060420,2, 0.065266,donnie4814,5,1,120,xbox,38213060420,15.25, 0.123218,myreeceyboy,52,1,120,xbox,3...# The data structure is as follows auctionid, bid, bidtime, bidder, bidrate, openbid, price, itemtype, daystolve #Upload data to HDFS [hadoop@hftclw0001 DEV 360 Data]$hdfs dfs-mkdir-p/spark/exer/mapr [hadoop@hftclw0001 DEV 360 Data]$hdfs dfs-put auctiondata. csv/spark/exer/mapr [hadoop@hftclw0001 DEV 360 Data]$hdfs dfs-ls/spark/exer/mapr Found 1 items-rw-r--r 2 hadoop supergroup 575014 2015-10-29 06:17 /spark/exer/mapr/auctiondata.csv

2. Run spark-shell scala. I use and analyze for the following task

tasks:

a.How many items were sold?

b.How many bids per item type?

c.How many different kinds of item type?

d.What was the minimum number of bids?

e.What was the maximum number of bids?

f.What was the average number of bids?

[hadoop@hftclclw0001 spark-1.5.1-bin-hadoop2.6]$ pwd/home/hadoop/spark-1.5.1-bin-hadoop2.6[hadoop@hftclclw0001 spark-1.5.1-bin-hadoop2.6]$ ./ bin/spark-shell ...... scala>#Load data from HDFS first to generate RDDscala> val originalRDD = sc. textFile ("/spark/exer/mapr/auctiondata. csv ")... scala > originalRDD ==> Let's analyze the type of originalRDD RDD [String] can be regarded as an array of strings, Array [String] res 26: org. apache. spark. rdd. RDD [String]= MapPartitionsRDD [1] at textFile at: 21 ##according to "," separate each line using mapscala> val auctionRDD = www.example.com (_. split (",")) scala> auctionRDD ==> Let's analyze the type of auctionRDD RDD [Array [String]] can be regarded as an array of String, but the elements are still arrays, that is, Array [String] res 17: org. apache. spark. rdd. RDD [Array [String]]= MapPartitionsRDD [5] at map at: 23

a.How many items were sold?

==> val count = auctionRDD.map(bid => bid(0)).distinct().count()

According to auctionid to repeat: each record according to "," separation, and then repeat, and then count

#Get the first column, i.e. get auctionid, still use map #to understand the following line, because auctionRDD is Array [String]], then each parameter type to map is Array [String], because actionid is the first bit of the array, i.e. get the first element Array (0), note that () is not [] scala> val auctionidRDD = www.example.com (_(0))... scala> auctionidRDD ==> Let's analyze the type RDD [String] of auctionidRDD, understand it as Array [String], that is, the array of all auctionids res 27: org. apache. spark. rdd. RDD [String]= MapPartitionsRDD [17] at map at: 26#to auctionidRDD de-duplication scala> val auctionidDistinctRDD = auctionidRDD. distinct ()#count scala> auctionidDistinctRDD. count ()...

b.How many bids per item type?

===> auctionRDD.map(bid => (bid(7),1)).reduceByKey((x,y) => x + y).collect()

#For each row of map, get the seventh column, i.e. itemtype column, output (itemtype, 1)#You can think of the output as an array of type (String, Int) scala> www.example.com (bid =>(bid (7),1)) res 30: org. apache. spark. rdd. RDD [(String, Int)]= MapPartitionsRDD [26] at map at: 26...# reduceByKey is to reduce according to key #reduce ByKey under parsing #(xbox, 1)(xbox, 1)(xbox, 1)... (xbox,1) ==> reduceByKey ==> (xbox,(.. (((1 + 1) + 1) + ... + 1)) scala> www.example.com (bid =>(bid (7),1)). reduceByKey ((x, y)=> x + y)#Array of type (String, Int) String => itemtype Int is already the sum of counts of that itemtype res 31: org. apache. spark. rdd. RDD [(String, Int)]= ShuffledRDD [28] at reduceByKey at: 26#Converted to Array type array by collect () scala> auctionRDD.map (bid =>(bid (7),1)). reduceByKey ((x, y)=> x + y). collect () res 32: auctionRDD.map The specific use situation also needs everyone to practice verification. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.