Introduction to Spark, installation 07/01 Update SLTechnology News&Howtos

Introduction to Spark, installation

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Brief introduction and installation:

(1) Spark is written in scala and runs on JVM (java virtual machine). Therefore, you need to install JDK first to install Spark. After installing java, download the installation package (compressed file) from the official website: http://spark.apache.org/downloads.html. The current version is: spark-1.6.1-bin-hadoop2.4.tgz.

(2) decompress and view the contents of the directory:

Tar-zxvf spark-1.6.1-bin-hadoop2.4.tgzcd spark-1.6.1-bin-hadoop2.4

In this way, we can run Spark in stand-alone mode, and Spark can also run on Mesos, YARN, etc.

2.Spark Interactive shell:

(1) Spark only supports Scala and Python Shell. In order to have a perceptual understanding of Spark Shell, we can use the quick-start tutorial on the official website of follow: http://spark.apache.org/docs/latest/quick-start.html

First, there are two different ways to start Spark shell,Scala and Python (below, let's take Scala as an example):

Scala:

. / bin/spark-shell

Parameters for scala startup shell:

. / bin/spark-shell-- name "axx"-- conf spark.cores.max=5-- conf spark.ui.port=4041

Python:

. / bin/pyspark

After startup, the interface is as follows:

If you need to change the log level displayed, modify the $SPARK_HOME/conf/log4j.properties file.

(2) the first important term in Spark: RDD (Resilient Distributed Dataset), flexible distributed dataset.

In Spark, RDD is used for distributed computing. RDD is the basic abstraction of Spark for distributed data and distributed computing.

RDD includes two types of operations, actions and transformations

Action action (actions): a new value is generated. It calculates a result on RDD and returns the result to the drive program (for example, on the shell command line, we enter a calculation instruction, and spark returns the result value for us), or stores the result in an external storage system (such as HDFS) (we'll see rdd.saveAsTextFile () later).

Conversion operation (transformations): a new RDD is generated.

Val lines = sc.textFile ("file:///spark/spark/README.md")

Define a RDD by reading the file. By default, textFile reads the file on HDFS, plus the file that file:// specifies to read the local path.

Lines.count () lines.first ()

Above are two actions operations that return the number of rows of RDD and the first row of data, respectively.

Val linesWithSpark = lines.filter (line= > lines.contains ("spark"))

Above is a transformations operation that generates a new RDD, which is a subset of lines and returns only rows containing spark.

3.Spark core concepts:

Each Spark application contains a driver that performs parallel computing in the cluster. In the previous example, the driver is spark shell itself. The driver accesses the Spark through the SparkContext object (a connection to the computing cluster).

In order to run the RDD operation, the driver manages nodes called executors. When running in a distributed system, the architecture diagram is as follows:

4. Stand-alone applications:

In addition to running in shell, you can also run stand-alone applications. The main difference from Spark Shell is that when developing stand-alone applications, you need to initialize SparkContext yourself.

4.1 initialize SparkContext

First, you need to create a SparkConf object to configure the application, and then create a SparkContext through SparkConf. Initialize the SparkContext object:

SparkConf conf = new SparkConf () .setAppName ("wc_ms"); JavaSparkContext sc = new JavaSparkContext (conf)

SetAppName can set the name of this stand-alone application, and we can monitor the application on WebUI later.

4.2 develop WordCount programs:

To build a Spark program through Maven, pom only needs to introduce a dependency (depending on the specific version of Spark):

Org.apache.spark spark-core_2.10 1.6.1

WordCount.java

Package com.vip.SparkTest;import java.util.Arrays;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaPairRDD;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.api.java.function.FlatMapFunction;import org.apache.spark.api.java.function.Function2;import org.apache.spark.api.java.function.PairFunction;import scala.Tuple2 Public class WordCount {public static void main (String [] args) {String inputfile = args [0]; String outputfile = args [1]; / / get SparkContext SparkConf conf = new SparkConf () .setAppName ("wc_ms"); JavaSparkContext sc = new JavaSparkContext (conf) / / load the file to RDD JavaRDD input = sc.textFile (inputfile); / / flatMap method, which inherits interface JavaRDDLike from interface JavaRDDLike,JavaRDD. / / split the file into words (separated by spaces); transformation operation to generate a new RDD. JavaRDD words = input.flatMap (new FlatMapFunction () {@ Override public Iterable call (String content) throws Exception {/ / TODO Auto-generated method stub return Arrays.asList (content.split ("")) }}); / / first convert to tuple (key-value), word-1 word2-1 / / then Reduce aggregate calculation JavaPairRDD counts = words.mapToPair (new PairFunction () {@ Override public Tuple2 call (String arg0) throws Exception { / / TODO Auto-generated method stub return new Tuple2 (arg0,1) }}) .reduceByKey (new Function2 () {@ Override Public Integer call (Integer x Integer y) throws Exception {/ / TODO Auto-generated method stub return xroomy }}); counts.saveAsTextFile (outputfile); sc.close () }}

The corresponding steps above are annotated.

4.3 release and apply to Spark (stand-alone or cluster):

(1) first of all, package the developed program:

Mvn package

Get the jar package: SparkTest-0.0.1-SNAPSHOT.jar

(2) upload the relevant files to the server:

Will do count text file, jar file upload server.

(3) use spark-submit to start the application:

$SPARK_HOME/bin/spark-submit\-- class "com.vip.SparkTest.WordCount"\-- master local\. / SparkTest-0.0.1-SNAPSHOT.jar "input directory" and "output directory"

Description:-- class specifies the main class of the program

-- master specifies the URL of Spark. Because it is native, local is specified.

Input directory: contains all input text files (possibly one or more files).

Output directory: pay special attention to this: first of all, this is a directory, not a file; again, this directory can not be created in advance, otherwise an error will be reported.

(4) running result:

After the successful execution, two files are generated:

Part-00000 _ SUCCESS

Contents of part-00000 file:

At this point, our wordcount program is over.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.