Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Spark Shell for Interactive Analysis

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

In this article, the editor introduces in detail "how to use Spark Shell for interactive analysis", the content is detailed, the steps are clear, and the details are handled properly. I hope this article "how to use Spark Shell for interactive analysis" can help you solve your doubts.

Basics

Spark shell provides an easy way to learn the API, as well as a powerful tool to analyze data interactions. It is available in Scala (which runs on the Java virtual machine and makes good calls to existing Java class libraries) or in Python. Start it by running the following command in the Spark directory:

Scala

Python

. / bin/spark-shell

The main abstraction of Spark is a distributed item collection called Dataset. Datasets can be created from Hadoop's InputFormats (such as HDFS files) or through other Datasets transformations. Let's create a new Dataset from the README file in the Spark source directory:

Scala > val textFile = spark.read.textFile ("README.md") textFile: org.apache.spark.sql.Dataset [String] = [value: string]

You can get values (value) directly from Dataset and get a new one by calling some actions (action) or transform (transformation) Dataset. See API doc for more details.

Scala > textFile.count () / / Number of items in this Datasetres0: Long = 126 / / May be different from yours as README.md will change over time, similar to other outputsscala > textFile.first () / / First item in this Datasetres1: String = # Apache Spark

Now let's transform the Dataset to get a new one. We call filter to return a new Dataset, which is a subset of the items in the file.

Scala > val linesWithSpark = textFile.filter (line = > line.contains ("Spark")) linesWithSpark: org.apache.spark.sql.Dataset [String] = [value: string]

We can chain operate transformation (conversion) and action (action):

Scala > textFile.filter (line = > line.contains ("Spark")). Count () / / How many lines contain "Spark"? res3: Long = more operations on 15Dataset

Dataset actions (operation) and transformations (transformation) can be used for more complex calculations. For example, count the words that appear the most:

Scala

Python

Scala > textFile.map (line = > line.split (") .size). Reduce ((a, b) = > if (a > b) an else b) res4: Long = 15

The first map operation creates a new Dataset that map a row of data to an integer value. Call reduce on Dataset to find the maximum row count. The parameters map and reduce are Scala functions (closures) and can use any language feature of the Scala/Java library. For example, we can easily call a function declaration, and we will define a max function to make the code easier to understand:

Scala > import java.lang.Mathimport java.lang.Mathscala > textFile.map (line = > line.split (") .size). Reduce ((a, b) = > Math.max (a, b)) res5: Int = 15

A common data flow pattern is MapReduce, which is popularized by Hadoop. Spark can easily implement MapReduce:

Scala > val wordCounts = textFile.flatMap (line = > line.split ("")) .groupByKey (identity). Count () wordCounts: org.apache.spark.sql.Dataset [(String, Long)] = [value: string, count (1): bigint]

Here, we call flatMap to take the Dataset of transform a lines as a Dataset of words, and then combine groupByKey and count to calculate the counts of each word in the file as a (String, Long) Dataset pairs. To collect word counts in shell, we can call collect:

Scala > wordCounts.collect () res6: Array [(String, Int)] = Array ((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1),...) Caching

Spark also supports Pulling (pull) datasets into a cluster-wide memory cache. For example, when querying a small "hot" dataset or running an iterative algorithm like PageRANK, it is very efficient when the data is accessed repeatedly. To take a simple example, let's mark our linesWithSpark dataset into the cache:

Scala

Python

Scala > linesWithSpark.cache () res7: linesWithSpark.type = [value: string] scala > linesWithSpark.count () res8: Long = 15scala > linesWithSpark.count () res9: Long = 15

It seems silly to use Spark to explore and cache a 100-line text file. Interestingly, even when they span dozens or hundreds of nodes, these same functions can be used for very large data sets. You can also be like a programming guide. Do this interactively by connecting the bin/spark-shell to the cluster as described in.

Independent application

Suppose we want to use Spark API to create a stand-alone application. We practice a simple application in Scala (SBT), Java (Maven) and Python.

Scala

Java

Python

We will create a very simple Spark application in Scala-very simple, in fact, it is called SimpleApp.scala:

/ * SimpleApp.scala * / import org.apache.spark.sql.SparkSessionobject SimpleApp {def main (args: Array [String]) {val logFile = "YOUR_SPARK_HOME/README.md" / / Should be some file on your system val spark = SparkSession.builder.appName ("Simple Application"). GetOrCreate () val logData = spark.read.textFile (logFile). Cache () val numAs = logData.filter (line = > line.contains ("a"). Count () Val numBs = logData.filter (line = > line.contains ("b")). Count () println (s "Lines with a: $numAs Lines with b: $numBs ") spark.stop ()}}

Note that for this application we should define a main () method instead of extending scala.App. Subclasses that use scala.App may not work properly.

The program only counts the number of lines in the Spark README file that contain'a 'and the number of' b'. Note that you need to replace YOUR_SPARK_HOME with the location where you installed Spark. Unlike the previous examples using spark shell operations, which initialize their own SparkContext, we initialize a SparkContext as part of the application.

We call SparkSession.builder to construct a [[SparkSession]], then set the application name (application name), and finally call getOrCreate to get an instance of [[SparkSession]].

Our application relies on Spark API, so we will include a sbt configuration file called build.sbt, which describes the dependencies of Spark. The file also adds a repository that Spark depends on:

Name: = "Simple Project" version: = "1.0" scalaVersion: =" 2.11.8 "libraryDependencies + =" org.apache.spark "%" spark-sql "%" 2.2.0 "

In order for sbt to work properly, we need to lay out SimpleApp.scala and build.sbt files according to the classic directory structure. After success, we can create a JAR package that contains the application code, and then use the spark-submit script to run our program.

# Your directory layout should look like this$ find. / build.sbt./src./src/main./src/main/scala./src/main/scala/SimpleApp.scala# Package a jar containing your application$ sbt package... [info] Packaging {..} / {..} / target/scala-2.11/simple-project_2.11-1.0.jar# Use spark-submit to run your application$ YOUR_SPARK_HOME/bin/spark-submit\-- class "SimpleApp"\ -- master local [4]\ target/scala-2.11/simple-project_2.11-1.0.jar...Lines with a: 46 Lines with b: 23 Fast Jump

Congratulations on running your first Spark application successfully!

For a more in-depth overview of API, start here with RDD programming guide and SQL programming guide, or take a look at other components in the programming Guide menu.

To run the application on the cluster, go to deployment overview.

Finally, some examples (Scala, Java, Python, R) are included in the examples directory of Spark. You can run them as follows:

# for Scala and Java, use run-example:./bin/run-example SparkPi# for the Python example, directly use spark-submit:./bin/spark-submit examples/src/main/python/pi.py# for the R example, and directly use spark-submit:./bin/spark-submit examples/src/main/r/dataframe.R to read here. This article "how to use Spark Shell for interactive analysis" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it. If you want to know more about the articles, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report