I. basic overview of spark--spark 07/11 Update SLTechnology News&Howtos

I. basic overview of spark--spark

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

[TOC]

1. Overview of spark 1. What is spark

Spark is a fast, general-purpose and scalable big data analysis engine. It was born in 2009 at the University of California, Berkeley, AMPLab, open source in 2010, became an Apache incubator in June 2013, and became a top-level Apache project in February 2014. At present, the Spark ecosystem has developed into a collection of multiple sub-projects, including SparkSQL, Spark Streaming, GraphX, MLlib and other sub-projects. Spark is a big data parallel computing framework based on memory computing. Spark is based on memory computing, which improves the real-time performance of data processing in big data environment, ensures high fault tolerance and high scalability, and allows users to deploy Spark on a large number of cheap hardware to form clusters. Spark is supported by many big data companies, including Hortonworks, IBM, Intel, Cloudera, MapR, Pivotal, Baidu, Ali, Tencent, JD.com, Ctrip and Youku Tudou. Currently, Baidu's Spark has been applied to Fengnao, Big search, Direct account, Baidu big data and other services; Ali has built a large-scale graph computing and graph mining system using GraphX, and implemented many recommendation algorithms for production systems; Tencent Spark cluster has reached 8000 units, which is currently the largest known Spark cluster in the world.

2. Advantages of spark

The biggest performance deficiency that has to compare with MapReduce,MapReduce here is that during the shuffle process, the intermediate results will be output to disk (that is, hdfs), and at least 6 times of IO will be generated in this process. It is these frequent IO that make the performance of mr unsatisfactory.

for spark, the intermediate results are all in memory (checkpoint added), so from this point, there are a lot of performance problems caused by IO. Of course, this is just one of the points, which will be discussed later.

3. The characteristics of spark are 3.1 and fast.

Compared with Hadoop's MapReduce, Spark's memory-based operations are more than 100 times faster, and even Spark's hard disk-based operations are 10 times faster. Spark implements an efficient DAG execution engine so that data streams can be processed efficiently through memory.

3.2 easy to use

Spark supports API for Java, Python and Scala, as well as more than 80 advanced algorithms, allowing users to quickly build different applications. And Spark supports interactive Python and Scala shell, so you can easily use Spark clusters in these shell to verify problem-solving methods.

3.3 Universal

Spark provides a unified solution. Spark can be used for batch processing, interactive query (Spark SQL), real-time streaming processing (Spark Streaming), machine learning (Spark MLlib), and graph computing (GraphX). These different types of processing can be used seamlessly in the same application. Spark unified solution is very attractive, after all, any company wants to use a unified platform to deal with the problems encountered, reducing the human cost of development and maintenance and the material cost of deploying the platform.

In addition, Spark can also be well integrated into the architecture of Hadoop, can directly operate HDFS, and provide Hive on Spark, Pig on Spark framework integration Hadoop.

3.4 compatibility

Spark can be easily integrated with other open source products. For example, Spark can use Hadoop's YARN and Apache Mesos as its resource management and scheduler, and can handle all the data supported by Hadoop, including HDFS, HBase, Cassandra, and so on. This is particularly important for users who have deployed a Hadoop cluster, as no data migration is required to take advantage of the powerful processing power of Spark. Spark can also be independent of third-party resource management and scheduler, it implements Standalone as its built-in resource management and scheduling framework, which further reduces the threshold for the use of Spark, making it very easy for everyone to deploy and use Spark. In addition, Spark provides tools to deploy Standalone's Spark cluster on EC2.

4. Spark components

Spark biosphere:

Spark Core: the most important, the most important of which is RDD (flexible distributed dataset)

Spark SQL: similar to Hive using SQL statements to manipulate RDD DataFrame (table)

Spark Streaming: streaming computing

The first three are used more, and the last two depend on the demand.

Spark MLLib: Spark machine learning class library

Spark GraphX: figure calculation

Second, spark architecture and deployment 1. Spark cluster architecture

figure 2.1 spark architecture

Spark roughly has several major components, namely: driver, master (cluster manager), worker.

figure 2.2 spark work task chart

The figure above illustrates the function of each component.

2. Deployment

Spark can be deployed on the above environments:

Standalone:spark built-in resource manager

Resource Manager for YARN:hadoop

Mesos

Amazon EC2

The scala version is scala2.11.8,spark and the version is spark-2.1.0-bin-hadoop2.7.

Jdk version 1.8 Hadoop version 2.8.4

2.1 pseudo-distributed construction

After unzipping the spark program, enter the decompressed directory. Modify the configuration file:

Cd confcp spark-env.sh.template spark-env.shvim spark-env.sh adds the following: export JAVA_HOME=/opt/modules/jdk1.8.0_144# specifies the master node hostname and port export SPARK_MASTER_HOST=bigdata121. Here, configure the slave node hostname for the master node export SPARK_MASTER_PORT=7077cp slaves.template slavesvim slaves# and specify the worker node host bigdata121 according to the actual ip.

After the configuration is complete, start the cluster:

Cd sbin./start-all.shjps to check whether there are master and worker processes 20564 JobHistoryServer127108 Jps51927 Worker41368 ResourceManager11130 SecondaryNameNode10875 NameNode41467 NodeManager51868 Master10973 DataNode2.2 fully distributed build

Basically, it's the same as pseudo-distributed, that is, just configure a few more worker nodes in the conf/slaves file, and then start the cluster to OK.

After the construction is completed, you can enter http://masterIP:8080 to view the cluster status.

3. Master node HA deployment

in spark, master node, as the manager of the whole cluster, is single point and prone to single point of failure, so in order to ensure the availability of master node, it is necessary to implement HA for it.

3.1 single point of recovery based on file system

is mainly used in development or test environments. Spark provides the directory to save the registration information of spark Application and worker and write their recovery status information to the directory. In case of Master failure, you can restore the registration information of running spark Application and worker by restarting the Master process (sbin/start-master.sh).

is based on a single point of recovery of the file system, which mainly sets the following contents in SPARK_DAEMON_JAVA_OPTS in spark-env.sh:

Specify two operation parameters: export SPARK_DAEMON_JAVA_OPTS= "- Dspark.deploy.recoveryMode=FILESYSTEM-Dspark.deploy.recoveryDirectory=/root/training/spark-2.1.0-bin-hadoop2.7/recovery" where: spark.deploy.recoveryMode=FILESYSTEM is set to FILESYSTEM to enable single point of recovery. Default: directory where NONEspark.deploy.recoveryDirectory Spark saves the recovery state.

It should be noted that in essence, there is only one master node in this method, but the worker and application information can be automatically restored when the master node is restarted to prevent all tasks from losing their execution state after the master is hung up, and then the previous task needs to be re-executed from beginning to end after master restart.

3.2 implementation based on zookeeper

ZooKeeper provides a Leader Election mechanism that ensures that although there are multiple Master in the cluster, only one is Active and the rest is Standby. When the Master of the Active fails, another Standby Master is elected. Since the information of the cluster, including Worker, Driver and Application, has been persisted to ZooKeeper, it will only affect the submission of the new Job during the switching process, but will not have any impact on the ongoing Job.

here configures the master node with two hosts, while the worker node is still a single node (for convenience). First of all, we need to ensure the normal operation of the zookeeper service. I won't repeat it here. You can take a look at the previous zookeeper article. Let's talk about the configuration of spark directly here.

Modify spark-env.sh configuration file

Export SPARK_DAEMON_JAVA_OPTS= "- Dspark.deploy.recoveryMode=ZOOKEEPER-Dspark.deploy.zookeeper.url=bigdata121:2181,bigdata123:2181,bigdata122:2181-Dspark.deploy.zookeeper.dir=/spark" where: spark.deploy.recoveryMode=ZOOKEEPER is set to ZOOKEEPER to enable single point of recovery. Default: the directory where the spark.deploy.zookeeper.dir Spark information of the NONEspark.deploy.zookeeper.url ZooKeeper cluster is saved in ZK. Default: / spark. In addition: on each node. The following two lines need to be commented out. # export SPARK_MASTER_HOST=bigdata121# export SPARK_MASTER_PORT=7077 will automatically select which is the master node address and port of active. It does not need to be specified.

The above configuration needs to ensure that the configuration of all master and worker nodes in the entire spark cluster is the same.

After the configuration is complete, start the cluster

Start the entire cluster on any one master node: sbin/start.all.sh and then start master:sbin/start-master.sh separately on another master node

After launching, you can view the corresponding status on the management pages of the two master:

If http://masterip1:8080http://masterip2:8080 works properly, it usually displays one active and one standby.

Then let's take a look at what information is stored on zookeeper:

A / spark node is created on zk, and there are two directories: master_status has a child node created with the name of worker, that is, worker information, leader_election heartbeat information of the host where the master / slave master node is located, all temporary nodes. If the heartbeat is lost, then the corresponding node disappears, for example: the name of the node, it is the information node of worker [zk: localhost:2181 (CONNECTED) 0] ls / spark/master_ status [worker _ worker-20190822120853-192.168.50.121-59531] this is the state node of two master nodes. If there is no heartbeat, it disappears [zk: localhost:2181 (CONNECTED) 1] ls / spark/leader_ option [_ c_dcc9ec86-80f9-4212-a5db-d1ec259add80-latch-0000000003, _ c_fa42411d-0aa0-4da8-bbe9-483c198ee1f9-latch-0000000004] 3. The demo program of spark runs

Spark provides some example programs

[root@bigdata121 spark-2.1.0-bin-hadoop2.7] # ls examples/jars/scopt_2.11-3.3.0.jar spark-examples_2.11-2.1.0.jar

Spark provides two tools for submitting and executing spark tasks, namely spark-shell and spark-submit

1 、 spark-submit

Commonly used in the production environment to submit tasks to the cluster for execution

Example: Monte Carlo begging PI

Spark-submit-- master spark://bigdata121:7077-- class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.1.0.jar 100-- master specifies the master address-- the full class name of the class run by class. The main function of the class running at the location of the main package requires input parameters (in fact, the args parameter in the main function) if there are multiple additional jar packages. Write: submit-- master xxx-- jars jar1 jar2. -- class full class name contains the jar package parameters of the running class-- jar is used to specify the address of the jar package other than the running class, such as dependent jar packages. If you need to specify some driver, such as mysql-connecter, you need to add an option-- driver-class-path xxxx or add it directly to the jars directory of spark if it is convenient.

In general, in a production environment, after writing a spark program in IDE, it will be packaged into a jar package and then uploaded to the cluster. Submit the task to the cluster for execution through the spark-submit command above

2 、 spark-shell

Spark-shell is an interactive Shell program that comes with Spark, which is convenient for users to program interactively. Users can write spark programs with scala under this command line. Generally used for testing

There are two modes of operation:

(1) Local mode: do not need to connect to the Spark cluster and run it directly locally for test startup: no parameters are written after bin/spark-shell. Represents the local mode Spark context available as' sc' (master = local [*], app id = local-1553936033811). Local represents the local mode local [*] represents the cpu core (2) Cluster mode command: bin/spark-shell-- master spark://.Spark context available as' sc' (master = spark://node3:7077, app id = app-20190614091350-0000). Special note: Spark session saved as spark: Spark2.0 later provided, use session to access all spark components (core sql..) spark context saved as sc, is the context object of the task. Spark sc two objects, you can directly use the

Example: writing WordCount programs in Spark Shell

The program is as follows: sc.textFile ("hdfs://bigdata121:9000/data/data.txt"). FlatMap (_ .split (")). Map ((_, 1)). ReduceByKey (_ + _) .saveAsTextFile (" hdfs://bigdata121:9000/output/spark/wc ") Note: files on hdfs://bigdata121:9000/data/data.txt need to be uploaded to your hdfs cluster first, and make sure that the output directory does not exist. Sc is a SparkContext object. When the object is submitted to the spark program, the entry textFile ("hdfs://bigdata121/data/data.txt") is to read the data in hdfs flatMap (_ .split (")) first map flattens the word and 1 to form a tuple reduceByKey (_ + _) in map ((_, 1)) to reduce according to key, and to add value to saveAsTextFile (" hdfs://bigdata121:9000/output/spark/wc ") to write the result into hdfs 3. Idea to write spark program

First of all, you need to configure the scala development environment with idea.

Install the scala plug-in in the plug-in center.

Create a maven project, and then add scala support to add framework support

Add scala source folder to project structure

Finally, right-click to see the option to create a scala class.

Note: scala and jdk must be installed locally

After configuring the scala environment, you need to add the maven dependency corresponding to spark and add the dependency to pom.xml:

4.0.0 king sparkTest 1.0-SNAPSHOT UTF-8 2.1.0 2.11.8 2.7.3 org.apache.spark spark-core_2.11 2.1.0 org.apache.spark spark-sql_2.11 2.1.0 org.apache.spark spark-hive_2.11 2.1.0 provided org.apache.spark spark-streaming_2.11 2.1.0 provided org.apache.spark Spark-mllib_2.11 2.1.0 runtime org.apache.hadoop hadoop-client ${hadoop.version} org.apache.spark spark-streaming-kafka_2.11 1.6.3 mysql Mysql-connector-java 8.0.12 junit junit 4.12 org.apache.hive hive-jdbc 1.2.1 org.scala-tools Maven-scala-plugin 2.15.2 compile testCompile Maven-compiler-plugin 3.6.0 1.8 1.8 org.apache.maven.plugins maven-surefire-plugin 2.19 True

Remember that the above configuration of build should not be left out. Here is a talk about the small pit I encountered before.

Xiaokeng:

After I packaged jar with maven, when I ran it on Linux, I found an error saying that the specified main class could not be found in the jar package. I can't repackage it several times. Then I went to idea to add the packaged jar as an engineering dependency, and then I went to the jar package to see what was there, only to find that the code I wrote was not packaged into it. But java can be packaged, I guess maven directly ignored the scala code, to the Internet a search, need to add the above build configuration, the configuration can be packaged.

Wordcount instance code:

Import org.apache.spark. {SparkConf, SparkContext} object WordCount {def main (args: Array [String]): Unit = {/ / create a spark profile object. Sets the app name, master address, and local as the local mode. / / if it is submitted to the cluster, it is usually not specified. Because it is possible to run on multiple cluster sinks, it is inconvenient to write val conf = new SparkConf (). SetAppName ("wordCount"). SetMaster ("local") / / create a spark context object val sc = new SparkContext (conf) sc.textFile (args (0). FlatMap (_ .split (")) .map ((_, 1)) .reduceByKey (_ + _) .saveAsTextFile (args (1)}}

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.