How to use Spark Shell 07/11 Update SLTechnology News&Howtos

How to use Spark Shell

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use Spark Shell". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn how to use Spark Shell.

# # interaction Analysis using Spark Shell # #

# # basic # Spark's shell provides an easy way to learn API and a powerful tool for interactive analysis of data. It is available in both Scala (which runs on Java JVM and this is a good way to use existing Java libraries) and Python. Start by running the following (script) in the Spark directory:

. / bin/spark-shell

The main abstraction of Spark is a collection of distributed projects called resilient distributed datasets (RDD). RDDs can be created in an Hadoop input format (such as an HDFS file), or by converting other RDDs. Let's create a new RDD from the text of the file README in the Spark source directory:

Scala > val textFile = sc.textFile ("README.md")

TextFile: spark.RDD [String] = spark.MappedRDD@2ee9b6e3

RDDs has actions, and they return the value, and the conversion returns a pointer to a new RDDs. Let's start with some actions:

Scala > textFile.count () / / Number of items in this RDD

Res0: Long = 126

Scala > textFile.first () / / First item in this RDD

Res1: String = # Apache Spark

Now, let's use a transformation, we will use a filter transformation to return a new RDD of a subset of the project that owns this file:

Scala > val linesWithSpark = textFile.filter (line = > line.contains ("Spark"))

LinesWithSpark: spark.RDD [String] = spark.FilteredRDD@7dd4af09

We can call the conversion box action in a chain:

Scala > textFile.filter (line = > line.contains ("Spark"). Count () / / How many lines contain "Spark"?

Res3: Long = 15

# more RDD operations # RDD actions and transformations can be used for more complex calculations. Let's say that we want to find the lines with the largest number of words:

Scala > textFile.map (line = > line.split (") .size) .reduce ((a, b) = > if (a > b) an else b)

Res4: Long = 15

This first maps a line to an integer value, creating a new RDD. Reduce is called on that RDD to find the largest line count. The arguments to map and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We'll use Math.max () function to make this code easier to understand:

Scala > import java.lang.Math import java.lang.Math

Scala > textFile.map (line = > line.split (") .size). Reduce ((a, b) = > Math.max (a, b)) res5: Int = 15

A common data flow model is MapReduce, just like the well-known Hadoop.Spark can easily implement MapReduce.

Scala > val wordCounts = textFile.flatMap (line = > line.split ("")) .map (word = > (word, 1)) .reduceByKey ((a, b) = > a + b)

WordCounts: spark.RDD [(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8

Now, we combine the flatMap,map and reduceByKey transformations to calculate the number of occurrences of each word in the file as a RDD (String,Int) pair. To count the words in our shell, we can use the collect action:

Scala > wordCounts.collect ()

Res6: Array [(String, Int)] = Array ((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1),...)

# caching # # Spark also supports caching datasets in cluster-wide memory. This is useful when data needs to be accessed repeatedly. For example, when querying a small "hot" dataset or when running an iterative algorithm like PageRank. As a simple example, let's cache our lineWithSpark dataset.

Scala > linesWithSpark.cache () res7: spark.RDD [String] = spark.FilteredRDD@17e51082

Scala > linesWithSpark.count () res8: Long = 15

Scala > linesWithSpark.count () res9: Long = 15

# # stand-alone application # # now, suppose we want to develop a stand-alone application using Spark API. We will use a simple application that uses Scala (using SBT), Java (using Maven) and Python:

This example will use Maven to compile an application Jar, but any other similar build tool will work as well.

We will create a very simple Spark application, SimpleApp.java.

Import org.apache.spark.api.java.*;import org.apache.spark.SparkConf;import org.apache.spark.api.java.function.Function;public class SimpleApp {public static void main (String agrs []) {String logFile= "YOUR_SPARK_HOME/README.md"; SparkConf conf=new SparkConf () .setAppName ("Simple Application"); JavaSparkContext sc=new JavaSparkContext (conf); JavaDDD logData=sc.textFile (logFile) .cache () Long numAs=logData.filter (new Function () {public Boolean call (String s) {return s.contains ("a");}}) .count () Long numBs=logData.filter (new Function () {public Boolean call (String s) {return s.contains ("b");}}) .count () System.out.println ("Lines with a:" + numAs+ ", lines with b:" + numBs);}}

This program simply calculates the number of lines containing "a" and the number of lines containing "b" in a text file. Note that you must replace "YOUR_SPARK_HOME" with the location where Spark is installed. In the case of Scala, we initialize a SparkContext, although here we use a special JavaSparkContext class to get a Java-friendly (SparkContext). At the same time, we create RDDs (represented by JavaRDD) and run the transformation on them. Finally, we pass the function to Spark by creating a class that implements spark.api.java.function.Function. The Spark programming guide describes their differences in more detail.

To build this project, we write a pox.xml file for Maven that lists Spark as a dependency. Note that Spark artifacts is marked as a Scala version.

Edu.berkeley simple-project 4.0.0 Simple Project jar 1.0 Akka repository http://repo.akka.io/releases org.apache.spark spark-core_2.10 1.0.2

We created these files according to the specified Maven directory schema:

$find.. / pom.xml./src./src/main./src/main/java./src/main/java/SimpleApp.java

Now we can package the application using Maven and execute it using. / bin/spark-submit.

# Package a jar containing your application$ mvn package... [INFO] Building jar: {..} / {..} / target/simple-project-1.0.jar# Use spark-submit to run your application$ YOUR_SPARK_HOME/bin/spark-submit\-- class "SimpleApp"\-- master local [4]\ target/simple-project-1.0.jar...Lines with a: 46, Lines with b: 23

# # Where to Go from Here## Congratulations on running your first Spark application.

For an in-depth overview of the API, start with the Spark programming guide, or see "Programming Guides" menu for other components.

For running applications on a cluster, head to the deployment overview.

Finally, Spark includes several samples in the examples directory (Scala, Java, Python). You can run them as follows:

For Scala and Java, use run-example:. / bin/run-example SparkPi

For Python examples, use spark-submit directly:. / bin/spark-submit examples/src/main/python/pi.py

Thank you for your reading, the above is the content of "how to use Spark Shell", after the study of this article, I believe you have a deeper understanding of how to use Spark Shell, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.