First acquaintance of Spark 04/11 Update SLTechnology News&Howtos

First acquaintance of Spark

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Characteristics of Spark

Spark is a top-level project of Apache. Apache Spark is a fast and general computing engine designed for large-scale data processing. Spark is a general parallel framework like Hadoop MapReduce opened by UC Berkeley AMP lab (AMP Lab of the University of California, Berkeley). Spark has the advantages of Hadoop MapReduce, but what is different from MapReduce is that the intermediate output of Job can be saved in memory, so there is no need to read and write HDFS, so Spark can be better applied to iterative MapReduce algorithms such as data mining and machine learning.

Spark is also much faster than MapReduce, with an advanced DAG execution engine that supports acyclic data flow and memory computing. According to the official website, it is 100 times faster with memory and 10 times faster with disk.

And Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as local collection objects.

Spark is also easier to use than MapReduce and can be developed in Java, Scala, Python, R and other languages. Spark provides more than 80 advanced API, which can easily implement parallel computing applications. It can also be used interactively through interactive command lines such as Scala, Python, and R shells.

Spark has four main characteristics:

Advanced API strips away the focus on the cluster itself, and Spark application developers can focus on the computing itself that the application needs to do. The following is the code for python to use Spark API:

Spark is fast, supporting interactive computing and complex algorithms, as well as acyclic data flow and memory computing. The following figure shows the comparison of the calculation speed between MapReduce and Spark in regression calculation shown on the official website:

Spark is a very general-purpose computing engine, which can be used to complete a variety of operations, including SQL query, text processing, machine learning and so on. Before the emergence of Spark, we generally need to learn a variety of engines to deal with these requirements separately. As shown below:

Spark can run on a variety of platforms, such as Hadoop, Mesos, Kubernetes, standalone, etc., or on cloud. And can access a variety of data sources, including HDFS, Cassandra, HBase and S3.

Spark official website address:

Http://spark.apache.org/

Deep comparison between Spark and Hadoop

The ecosystem of Spark is referred to as BDAS. As shown below:

Hadoop biosphere vs. Spark BDAS:

Hadoop vs. Spark:

MapReduce vs. Spark:

Introduction to Spark Development language and Operation Mode

Development languages supported by Spark:

PythonScala (recommended) JavaR

Spark operation mode:

Standalone (built-in) Yarn (recommended) MesosLocalScala&Maven installation

When installing Scala, you need to prepare the JDK environment first, and I have already prepared the jdk1.8 environment here.

Download address of Scala official website:

Http://www.scala-lang.org/download/

Download Scala:

[root@study-01 ~] # cd / usr/local/ src [root @ study-01 / usr/local/src] # wget https://downloads.lightbend.com/scala/2.12.5/scala-2.12.5.tgz

Decompress:

[root@study-01 / usr/local/src] # tar-zxvf scala-2.12.5.tgz-C / usr/local/ [root @ study-01 / usr/local/src] # cd.. / [root@study-01 / usr/local] # lsbin etc games include lib lib64 libexec sbin scala-2.12.5 share src [root @ study-01 / usr/local] # cd scala-2.12.5/ [root@study-01 / usr/local/scala-2.12.5] # lsbin doc lib man [root @ study-01 / usr/local/scala-2.12.5] #

Configure environment variables:

[root@study-01 ~] # vim .bash _ profile # change the following content: export SCALA_HOME=/usr/local/scala-2.12.5PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$SCALA_HOME/binexport path [root @ study-01 ~] # source .bash _ profile [root@study-01 ~] # scala # test whether the scala command Welcome to Scala 2.12.5 (Java HotSpot (TM) 64-Bit Server VM, Java 1.8.0x161). Type in expressions for evaluation. Or try: help.scala >

Download address of Maven official website:

Https://maven.apache.org/download.cgi

Download and extract:

[root@study-01 ~] # cd / usr/local/src/ [root @ study-01 / usr/local/src] # wget http://mirror.bit.edu.cn/apache/maven/maven-3/3.5.2/binaries/apache-maven-3.5.2-bin.tar.gz[root@study-01 / usr/local/src] # tar-zxvf apache-maven-3.5.2-bin.tar.gz-C / usr/ local [root @ study-01 / Usr/local/src] # cd.. / apache-maven-3.5.2/ [root@study-01 / usr/local/apache-maven-3.5.2] # lsbin boot conf lib LICENSE NOTICE README.txt [root @ study-01 / usr/local/apache-maven-3.5.2] #

Configure environment variables:

[root@study-01 ~] # vim .bash _ profile # change the following content: export MAVEN_HOME=/usr/local/apache-maven-3.5.2PATH=$PATH:$HOME/bin:$JAVA_HOME/bin:$SCALA_HOME/bin:$MAVEN_HOME/ bin [root @ study-01 ~] # source .bash _ profile [root@study-01 ~] # mvn-- version # test whether the mvn command Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d) can be executed 2017-10-18T15:58:13+08:00) Maven home: / usr/local/apache-maven-3.5.2Java version: 1.8.0mm 161, vendor: Oracle CorporationJava home: / usr/local/jdk1.8/jreDefault locale: zh_CN, platform encoding: UTF-8OS name: "linux", version: "3.10.0-327.el7.x86_64", arch: "amd64", family: "unix" [root@study-01 ~] # Spark Environment Construction and wordcount case implementation

Download address of Spark official website:

Http://spark.apache.org/downloads.html

What I download here is the 2.1.0 source code package, and the compilation and installation documentation of the official website:

Http://spark.apache.org/docs/2.1.0/building-spark.html

From the introduction on the official website, we know that:

Java requires version 7 +, and after Spark2.0.0, Java 7 has been identified as deprecated, but does not affect use, but after the Spark2.2.0 version, Java 7 support will be removed; Maven needs version 3.3.9 +

Download the Spark2.1.0 version of the source code package:

Download and extract:

[root@study-01 / usr/local/src] # wget https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0.tgz[root@study-01 / usr/local/src] # tar-zxvf spark-2.1.0.tgz-C / usr/ local [root @ study-01 / usr/local/src] # cd.. / spark-2.1.0/ [root@study-01 / usr/local/spark-2.1 .0] # lsappveyor.yml common data external licenses NOTICE R scalastyle-config.xml yarnassembly conf dev graphx mesos pom.xml README.md sqlbin CONTRIBUTING.md docs launcher mllib project repl streamingbuild core examples LICENSE mllib-local python sbin tools [root @ study-01 / usr/local/spark-2.1.0] #

After the installation is completed, we also need to compile using the make-distribution.sh script under dev in the Spark source directory. The official compilation commands are as follows:

. / dev/make-distribution.sh-- name custom-spark-- tgz-Psparkr-Phadoop-2.4-Phive- Phive-thriftserver-Pmesos-Pyarn

Parameter description:

-- name: specifies the name of the Spark installation package after compilation is completed-- tgz: compressed as tgz-- Psparkr: compiled Spark supports R language-Phadoop-2.4: compiled with hadoop-2.4 's profile Specific profile can be seen in the pom.xml in the root directory of the source code-Phive and-Phive-thriftserver: compiled Spark supports operation on Hive-Pmesos: compiled Spark supports running on Mesos-Pyarn: compiled Spark supports running on YARN

Then we can compile Spark according to specific conditions. For example, the version of Hadoop we use is 2.6.0-cdh6.7.0, and we need to run Spark on YARN to support the operation of Hive. Then our Spark source code compilation script is:

[root@study-01 / usr/local/spark-2.1.0] #. / dev/make-distribution.sh-- name 2.6.0-cdh6.7.0-- tgz-Pyarn-Phadoop-2.6-Phive- Phive-thriftserver-Dhadoop.version=2.6.0-cdh6.7.0

But before executing this command, we need to edit the pom.xml file to add the maven repository of cdh:

[root@study-01 / usr/local/spark-2.1.0] # vim pom.xml # in the tag, add the following content cloudera https://repository.cloudera.com/artifactory/cloudera-repos/ [root@study-01 / usr/local/spark-2.1.0] #

Then you need to change the mvn command path of the compilation script, because it is a bit slow to compile with the native mvn:

[root@study-01 / usr/local/spark-2.1.0] # vim dev/make-distribution.shMVN= "$MAVEN_HOME/bin/mvn" [root@study-01 / usr/local/spark-2.1.0] #

After completing the above changes, you can execute the compilation command, and the compilation process will be a little slow (I compiled here for more than half an hour). And the memory should be allocated as much as possible to avoid compilation interruptions caused by insufficient memory.

After the compilation is complete, a .tgz file is added to the spark directory and unzipped to the / usr/local/ directory:

[root@study-01 / usr/local/spark-2.1.0] # ls | grep * .tgzspark-2.1.0-bin-2.6.0-cdh6.7.0.tgz [root@study-01 / usr/local/spark-2.1.0] # tar-zxvf spark-2.1.0-bin-2.6.0-cdh6.7.0.tgz-C / usr/ local [root @ study-01 / usr/local/spark-2.1.0] # cd .. / spark-2.1.0-bin-2.6.0-cdh6.7.0/ [root@study-01 / usr/local/spark-2.1.0-bin-2.6.0-cdh6.7.0] # lsbin conf data examples jars LICENSE licenses NOTICE python README.md RELEASE sbin yarn [root@study-01 / usr/local/spark-2.1.0-bin-2.6.0-cdh6.7.0] #

At this point, our spark installation is complete. Next, let's try to start the shell terminal of Spark:

[root@study-01 / usr/local/spark-2.1.0-bin-2.6.0-cdh6.7.0] #. / bin/spark-shell-- master local [2]

Command description:

Master is used to specify which mode to use to start local indicates local mode startup, and the number in square brackets indicates how many threads are started

The official documentation on launching spark shell:

Http://spark.apache.org/docs/2.1.0/submitting-applications.html

Started successfully:

After the startup is successful, let's implement the case of wordcount. Quick start documentation for the official website:

Http://spark.apache.org/docs/2.1.0/quick-start.html

There is now a file with the following contents:

[root@study-01 / data] # cat hello.txt hadoop welcomehadoop hdfs mapreducehadoop hdfshello hadoopspark vs mapreduce [root @ study-01 / data] #

Complete the wordcount of the file in spark shell:

Scala > val file = sc.textFile ("file:///data/hello.txt") # read file file: org.apache.spark.rdd.RDD [String] = file:///data/hello.txt MapPartitionsRDD [1] at textFile at: 24scala > file.collect # print read data res1: Array [String] = Array (hadoop welcome, hadoop hdfs mapreduce, hadoop hdfs, hello hadoop Spark vs mapreduce) scala > val a = file.flatMap (line = > line.split ("")) # split by space a: org.apache.spark.rdd.RDD [String] = MapPartitionsRDD [2] at flatMap at > a.collectres2: Array [String] = Array (hadoop, welcome, hadoop, hdfs, mapreduce, hadoop, hdfs, hello, hadoop, spark, vs, mapreduce) scala > val b = a.map (word = > (word,1)) # for map operation Attach 1b to each word: org.apache.spark.rdd.RDD [(String, Int)] = MapPartitionsRDD [3] at map at: 28scala > b.collectres3: Array [(String, Int)] = Array ((hadoop,1), (welcome,1), (hadoop,1), (hdfs,1), (mapreduce,1), (hadoop,1), (hdfs,1), (hello,1), (hadoop,1), (spark,1), (vs,1), (mapreduce) 1) scala > val c = b.reduceByKey (_ + _) # perform Reduce operation Add the values of each same key and integrate them together c: org.apache.spark.rdd.RDD [(String, Int)] = ShuffledRDD [4] at reduceByKey at: 30scala > c.collectres4: Array [(String, Int)] = Array ((mapreduce,2), (hello,1), (welcome,1), (spark,1), (hadoop,4), (hdfs,2), (vs,1)) scala >

As you can see above, we can complete the word frequency statistics of the file through simple interactive code, and these methods can form a method chain call, so in fact, a sentence of code can complete wordcount, as shown in the following example:

Scala > sc.textFile ("file:///data/hello.txt").flatMap(line = > line.split (")) .map (word = > (word,1)) .reduceByKey (_ + _) .collectres5: Array [(String, Int)] = Array ((mapreduce,2), (hello,1), (welcome,1), (spark,1), (hadoop,4), (hdfs,2), (vs,1)) scala >

We can also see the task execution information on the web page. You can access port 4040 of the host ip, as follows:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.