How to run the wordcount project on Spark platform by command line 07/08 Update SLTechnology News&Howtos

How to run the wordcount project on Spark platform by command line

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to run the wordcount project on the Spark platform in a command-line way, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a simpler and easier way.

Created by Wang, Jerry, last modified on Sep 22, 2015

Stand-alone mode, that is, local mode

Local mode is very simple to run, as long as you run the following command, assuming that the current directory is $SPARK_HOME

MASTER=local bin/spark-shell

"MASTER=local" indicates that you are currently running in stand-alone mode.

Scala > val textFile = sc.textFile ("README.md")

Val textFile = sc.textFile ("jerry.test")

19:14:32 on 15-08-08 INFO MemoryStore: ensureFreeSpace (182712) called with curMem=664070, maxMem=278302556

19:14:32 on 15-08-08 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 178.4 KB, free 264.6 MB)

19:14:32 on 15-08-08 INFO MemoryStore: ensureFreeSpace (17237) called with curMem=846782, maxMem=278302556

19:14:32 on 15-08-08 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 16.8 KB, free 264.6 MB)

19:14:32 on 15-08-08 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:37219 (size: 16.8 KB, free: 265.3 MB)

15-08-08 19:14:32 INFO SparkContext: Created broadcast 7 from textFile at: 21

TextFile: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [12] at textFile at: 21

Then: textFile.filter (.customers ("Spark"). Count

Or textFile.flatMap (.split (")) .map ((_, 1))

15-08-08 19:16:27 INFO FileInputFormat: Total input paths to process: 1

15-08-08 19:16:27 INFO SparkContext: Starting job: count at: 24

19:16:27 on 15-08-08 INFO DAGScheduler: Got job 0 (count at: 24) with 1 output partitions (allowLocal=false)

19:16:27 on 15-08-08 INFO DAGScheduler: Final stage: ResultStage 0 (count at: 24)

19:16:27 on 15-08-08 INFO DAGScheduler: Parents of final stage: List ()

19:16:27 on 15-08-08 INFO DAGScheduler: Missing parents: List ()

19:16:27 on 15-08-08 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD [2] at filter at: 24), which has no missing parents

19:16:27 on 15-08-08 INFO MemoryStore: ensureFreeSpace (3184) called with curMem=156473, maxMem=278302556

19:16:27 on 15-08-08 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1KB, free 265.3 MB)

19:16:27 on 15-08-08 INFO MemoryStore: ensureFreeSpace (1855) called with curMem=159657, maxMem=278302556

19:16:27 on 15-08-08 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1855.0 B, free 265.3 MB)

19:16:27 on 15-08-08 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:42648 (size: 1855.0 B, free: 265.4 MB)

15-08-08 19:16:27 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874

19:16:27 on 15-08-08 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD [2] at filter at: 24)

15-08-08 19:16:27 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks

19:16:27 on 15-08-08 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1415 bytes)

19:16:27 on 15-08-08 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)

15-08-08 19:16:27 INFO HadoopRDD: Input split: file:/root/devExpert/spark-1.4.1/README.md:0+3624

15-08-08 19:16:27 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id

15-08-08 19:16:27 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

15-08-08 19:16:27 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap

15-08-08 19:16:27 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition

15-08-08 19:16:27 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id

19:16:27 on 15-08-08 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1830 bytes result sent to driver

19:16:27 on 15-08-08 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 80 ms on localhost (1 take 1)

15-08-08 19:16:27 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

19:16:27 on 15-08-08 INFO DAGScheduler: ResultStage 0 (count at: 24) finished in 0.093 s

15-08-08 19:16:27 INFO DAGScheduler: Job 0 finished: count at: 24, took 0.176689 s

Res0: Long = 19

This is the answer to the question on how to run the wordcount project on the Spark platform on the command line. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.