How to write Spark programs in SparkShell and IDEA 10/21 Update SLTechnology News&Howtos

How to write Spark programs in SparkShell and IDEA

2025-10-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to write Spark programs in SparkShell and IDEA". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to write Spark programs in SparkShell and IDEA".

Spark-shell is an interactive Shell program that comes with Spark, which is convenient for users to program interactively. Users can write Spark programs with Scala under this command line. Spark-shell programs are generally used as Spark program testing exercises. Spark-shell belongs to the special application of Spark, in which we can submit the application.

There are two modes of spark-shell startup, local mode and cluster mode, respectively

Local mode:

Spark-shell

Local mode only starts a SparkSubmit process locally and does not establish a connection with the cluster. Although there is SparkSubmit in the process, it will not be submitted to the cluster red.

Cluster mode (cluster mode):

Spark-shell\

-- master spark://hadoop01:7077\

-executor-memory 512m\

-- total-executor-cores 1

The last two commands are not required-the master command is required (unless it is already indicated in the jar package that it can not be specified, it must be specified)

Exit shell

Do not ctrl+c spark-shell exit correctly: quit do not ctrl+c exit this is wrong if you use the ctrl+c exit command to view the listening port netstat-apn | grep 4040 can be killed using the kill-9 port number

3.25.11 comparison between spark2.2shell and spark1.6shell

Ps: if you start spark-shell in cluster mode, there will be a task that will be executed all the time in webUI

Create a Spark project through IDEA

Ps: the steps before project creation are omitted, which have been explained in scala. The default is to create a project directly.

Configure pom.xml files in the project

1.8

UTF-8

2.11.8

2.2.0

2.7.1

2.11

Org.scala-lang

Scala-library

${scala.version}

Org.apache.spark

Spark-core_2.11

${spark.version}

Org.apache.hadoop

Hadoop-client

${hadoop.version}

Implementation of WordCount Program by Spark

Scala version

Import org.apache.spark.rdd.RDD

Import org.apache.spark. {SparkConf, SparkContext}

Object SparkWordCount {

Def main (args: Array [String]): Unit = {

Val conf = new SparkConf () .setAppName ("dri/wordcount") .setMaster ("local [*]")

/ / create a sparkContext object

Val sc = new SparkContext (conf)

/ / data can be processed through the sparkcontext object

/ / read the file parameter is a string of type String, passing in the path

Val lines: RDD [String] = sc.textFile ("dir/wordcount")

/ / split the data

Val words: RDD [String] = lines.flatMap (_ .split (""))

/ / generate tuples for each word (word, 1)

Val tuples: RDD [(String, Int)] = words.map ((_ 1))

/ / an operator reduceByKey is provided in spark and the same key is provided as a group for summation calculation value

Val sumed: RDD [(String, Int)] = tuples.reduceByKey (_ + _)

/ / sort the current result. SortBy and scala have a different sotrBy parameter.

/ / the default is ascending false, which means descending

Val sorted: RDD [(String, Int)] = sumed.sortBy (_. _ 2, false)

/ / the value cannot be returned if the data is submitted to cluster storage

Sorted.foreach (println)

/ / reclaim resources to stop sc and end the task

Sc.stop ()

}

Java version

Import org.apache.spark.SparkConf

Import org.apache.spark.api.java.JavaPairRDD

Import org.apache.spark.api.java.JavaRDD

Import org.apache.spark.api.java.JavaSparkContext

Import org.apache.spark.api.java.function.FlatMapFunction

Import org.apache.spark.api.java.function.Function2

Import org.apache.spark.api.java.function.PairFunction

Import scala.Tuple2

Import java.util.Arrays

Import java.util.Iterator

Import java.util.List

Public class JavaWordCount {

Public static void main (String [] args) {

/ / 1. First create a conf object to configure is mainly to set the name, in order to set the running mode

SparkConf conf = new SparkConf () .setAppName ("JavaWordCount") .setMaster ("local")

/ / 2. Create a context object

JavaSparkContext jsc = new JavaSparkContext (conf)

JavaRDD lines = jsc.textFile ("dir/file")

/ / data sharding flatMapFunction is a concrete implementation class

JavaRDD words = lines.flatMap (new FlatMapFunction () {

@ Override

Public Iterator call (String s) throws Exception {

List splited = Arrays.asList (s.split (""))

Return splited.iterator ()

}

});

/ / generate data into tuples

/ / the first generic type is the input data type, and the last two parameters are output parameter tuples.

JavaPairRDD tuples = words.mapToPair (new PairFunction () {

@ Override

Public Tuple2 call (String s) throws Exception {

Return new Tuple2 (s, 1)

}

});

/ / aggregation

JavaPairRDD sumed = tuples.reduceByKey (new Function2 () {

@ Override

/ / the first Integer is the value corresponding to the same key

/ / the second Integer is the value corresponding to the same key

Public Integer call (Integer v1, Integer v2) throws Exception {

Return v1 + v2

}

});

/ / because Java api does not provide a sortBy operator, you need to change the position of the data in the tuple, and then sort, and then change the order back to

/ / the first exchange is for sorting

JavaPairRDD swaped = sumed.mapToPair (new PairFunction () {

@ Override

Public Tuple2 call (Tuple2 tup) throws Exception {

Return tup.swap ()

}

});

/ / sort

JavaPairRDD sorted = swaped.sortByKey (false)

/ / the second exchange is for the final result

JavaPairRDD res = sorted.mapToPair (new PairFunction () {

@ Override

Public Tuple2 call (Tuple2 tuple2) throws Exception

{

Return tuple2.swap ()

}

});

System.out.println (res.collect ())

Res.saveAsTextFile ("out1")

Jsc.stop ()

}

Thank you for reading, the above is the content of "how to write Spark programs in SparkShell and IDEA". After the study of this article, I believe you have a deeper understanding of how to write Spark programs in SparkShell and IDEA, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.