How to get started with Spark 1.6.0 novice 07/08 Update SLTechnology News&Howtos

How to get started with Spark 1.6.0 novice

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you a quick start on how to be a novice to Spark 1.6.0. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

The use of Spark Interactive Shell

Basics

Spark's interactive Shell provides an easy way to learn Spark's API, as well as powerful interactive data processing capabilities. Spark Shell supports both Scala and Python. Start the Spark Shell that supports Scala as follows

. / bin/spark-shell

One of the most important abstract concepts of Spark is the resilient distributed dataset (Resilient Distributed Dataset) referred to as RDD. RDDs can be created through Hadoop InputFormats (such as HDFS files) or transformed from other RDDs. The following example is an example of generating RDD by loading the README.md file in the Spark directory:

Scala > val textFile = sc.textFile ("README.md") textFile: spark.RDD [String] = spark.MappedRDD@2ee9b6e3

RDDs has two operations:

Actions: returns the calculated value

Transformations: returns a reference to a new RDDs

An example of actions is as follows:

Scala > textFile.count () / / Number of items in this RDDres0: Long = 126scala > textFile.first () / / First item in this RDDres1: String = # Apache Spark

The following transformations example returns a new RDD using the filter operation, which is a subset of the data items in the file that meets the filtering criteria:

Scala > val linesWithSpark = textFile.filter (line = > line.contains ("Spark")) linesWithSpark: spark.RDD [String] = spark.FilteredRDD@7dd4af09

Spark also supports using actions with transformations:

Scala > textFile.filter (line = > line.contains ("Spark")). Count () / / How many lines contain "Spark"? res3: Long = 15 more RDD operations (More on RDD Operations)

RDD's actions and transformations operations can be used for more complex calculations. Here is the number of words to find the lines with the largest number of words in the README.md file:

Scala > textFile.map (line = > line.split (") .size). Reduce ((a, b) = > if (a > b) an else b) res4: Long = 15

In the above code, * map operations separate a line of text by spaces, count the number of words, map line to an Integer value, and create a new RDD to hold these integer values. RDD calls reduce to calculate the number of words. In the example, the parameters of map and reduce operations are the functional programming style of Scala. Spark supports the programming styles of Scala, Java, Python, and supports the Scala/Java library. For example, using the Math.max () function in Scala makes the program more concise and readable:

Scala > import java.lang.Mathimport java.lang.Mathscala > textFile.map (line = > line.split (") .size). Reduce ((a, b) = > Math.max (a, b)) res5: Int = 15

With the popularity of Hadoop, MapReduce has become a common data flow pattern. Spark can easily implement MapReduce, and it is easier to write MapReduce programs with Spark:

Scala > val wordCounts = textFile.flatMap (line = > line.split ("")) .map (word = > (word, 1)) .reduceByKey ((a, b) = > a + b) wordCounts: spark.RDD [(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8

In the above example, the flatMap, map, and reduceByKey operations are used to calculate the number of times each word appears in the file and generate a RDD with a structure. You can use collect operation to complete the collection and integration of word statistics:

Scala > wordCounts.collect () res6: Array [(String, Int)] = Array ((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1),...) Caching

Spark supports caching data to the distributed memory of the cluster. In the case that the data will be accessed repeatedly, caching the data to memory can reduce the data access time and improve the running efficiency. Especially when the data is distributed on dozens or hundreds of nodes, the effect is more obvious. The following is an example of caching data linesWithSpark to memory:

Scala > linesWithSpark.cache () res7: spark.RDD [String] = spark.FilteredRDD@17e51082scala > linesWithSpark.count () res8: Long = 19scala > linesWithSpark.count () res9: Long = 19 standalone application

Suppose we want to write stand-alone applications using Spark API. We can use Scala, Java and Python to write Spark applications easily. The following example is a simple application example:

Scala

/ * SimpleApp.scala * / import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._import org.apache.spark.SparkConfobject SimpleApp {def main (args: Array [String]) {val logFile = "YOUR_SPARK_HOME/README.md" / / Should be some file on your system val conf = new SparkConf () .setAppName ("Simple Application") val sc = new SparkContext (conf) val logData = sc.textFile (logFile) 2). Cache () val numAs = logData.filter (line = > line.contains ("a")). Count () val numBs = logData.filter (line = > line.contains ("b")) .count () println ("Lines with a:% s, Lines with b:% s" .format (numAs, numBs))}}

The above program counts the number of lines containing the characters'a 'and' b'in README respectively. Unlike the previous Spark shell example, we need to initialize SparkContext.

We create a SparkConf object through SparkContext, and the SparkConf object contains the basic information of the application.

We write applications based on Spark API, so we need to write a sbt configuration file called "simple.sbt" to indicate that Spark is a dependency of the application. In the following example of a sbt configuration file, a dependency library "spark-core" for Spark has also been added:

Name: = "Simple Project" version: = "1.0" scalaVersion: =" 2.10.5 "libraryDependencies + =" org.apache.spark "%" spark-core "%" 1.6.0 "

In order for sbt to execute correctly, we need to lay out SimpleApp.scala and simple.sbt according to the directory structure required by sbt. If the layout is correct, the JAR package for the application can be generated and the program can be run using the spark-submit command.

Javaga

/ * SimpleApp.java * / import org.apache.spark.api.java.*;import org.apache.spark.SparkConf;import org.apache.spark.api.java.function.Function;public class SimpleApp {public static void main (String [] args) {String logFile = "YOUR_SPARK_HOME/README.md"; / / Should be some file on your system SparkConf conf = new SparkConf () .setAppName ("Simple Application"); JavaSparkContext sc = new JavaSparkContext (conf); JavaRDDlogData = sc.textFile (logFile) .cache () Long numAs = logData.filter (new Function

The code logic for this example is the same as the previous Scala sample code. Similar to the Scala example, the SparkContext is initialized first, and the JavaSparkContext object is created through SparkContext. And create the RDDs and perform the transformations operation. *, pass the function to Spark through the class that inherits spark.api.java.function.Function.

Here, using Maven to compile, the pom.xml of Maven is as follows:

Edu.berkeley simple-project 4.0.0 Simple Project jar 1.0 org.apache.spark spark-core_2.10 1.6.0

The location of the schema configuration file as required by Maven:

$find.. / pom.xml./src./src/main./src/main/java./src/main/java/SimpleApp.java

Now, you can package the application with Maven and use the command. / bin/spark-submit. Execute the application. Examples are as follows:

# Package a JAR containing your application$ mvn package... [INFO] Building jar: {..} / {..} / target/simple-project-1.0.jar# Use spark-submit to run your application$ YOUR_SPARK_HOME/bin/spark-submit\-class "SimpleApp"\-- master local [4]\ target/simple-project-1.0.jar...Lines with a: 46 Lines with b: 23 the above is the quick start of how to get started with Spark 1.6.0 beginners. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.