How to get started with Spark 10/17 Update SLTechnology News&Howtos

How to get started with Spark

2025-10-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article shows you how to get started with Spark. The content is concise and easy to understand. It will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

1. Brief introduction to Spark

Spark was born in 2009 at the AMPLab Lab at the University of Berkeley. Spark is just an experimental project with a very small amount of code and a lightweight framework. In 2010, Berkeley officially opened up the Spark project. In June 2013, Spark became a project under the Apache Foundation and entered a period of rapid development. Third-party developers have contributed a lot of code, and activity is very high. In February 2014, Spark was called Apache's top project at breakneck speed. At the same time, big data's Cloudera announced that it would increase the investment in Spark framework to replace MapReduce in April 2014. Big data's MapR joined the Spark camp, and Apache Mahout abandoned MapReduce to use Spark as the computing engine. Spark 1.0.0 was released in May 2014. Since 2015, Spark has become more and more popular in the domestic IT industry, and more and more companies have begun to deploy or use Spark to replace traditional big data parallel computing frameworks such as MR2, Hive, Storm, etc.

2. What is Spark?

Apache Spark ™is a unified analytics engine for large-scale data processing. Spark, a unified analysis engine for large-scale data sets, is a general parallel computing framework based on memory, which aims to make data analysis faster. Spark includes various computing frameworks commonly used in the field of big data: spark core (offline computing) sparksql (interactive query) spark streaming (real-time computing) Spark MLlib (machine learning) Spark GraphX (graph computing)

3. Can Spark replace hadoop?

Not exactly right.

Because we can only use spark core instead of mr for offline computing, the storage of data still depends on hdfs

The combination of Spark+Hadoop is the hottest combination and the most promising combination in the field of big data in the future!

4. Characteristics of Spark

Speed

Memory computing is 100 times faster than mr, disk computing is more than 10 times faster than mr.

Easy to use

Provides api interface for java scala python R language

One-stop solution

Spark core (offline Computing) spark sql (Interactive query) spark streaming (Real-time Computing).

Can run on any platform

Yarn Mesos standalone

5. Shortcomings of Spark

JVM's memory overhead is too large, and 1 gigabyte of data usually consumes 5 gigabytes of memory (Project Tungsten is trying to solve this problem)

Lack of an effective shared memory mechanism between different spark app (Project Tachyon is trying to introduce distributed memory management so that different spark app can share cached data)

6. Spark vs MR

6.1 limitations of mr

Low level of abstraction, need to write code by hand to complete, it is difficult to use only two operations, Map and Reduce, lack of expression of a Job only two stages of Map and Reduce (Phase), complex computing requires a large number of Job to complete, the dependency between Job is managed by the developers themselves (the output of reduce) is also put in the HDFS file system with high latency, only suitable for Batch data processing For interactive data processing, the support of real-time data processing is not enough for iterative data processing.

6.2 which problems in mr have been solved by Spark?

The level of abstraction is low, and it needs to be written by hand, so it is difficult to use.

Abstract through RDD (Resilient distributed datasets) in spark

Only two operations, Map and Reduce, are provided and lack of expressiveness.

Multiple operators are provided in spark

There are only two stages of a Job: Map and Reduce.

There can be multiple phases in spark (stage)

The intermediate results are also placed in the HDFS file system (slow)

The intermediate result is stored in memory and will be written to the local disk instead of HDFS.

High latency, only suitable for Batch data processing, for interactive data processing, real-time data processing is not enough.

Sparksql and sparkstreaming solved the above problem.

The performance of iterative data processing is poor.

Improve the performance of iterative computing by caching data in memory

Therefore, it is the trend of technological development that Hadoop MapReduce will be replaced by the new generation of big data processing platform, and in the new generation of big data processing platform, Spark is currently the most widely recognized and supported.

7. Version of Spark

Spark1.6.3: version 2.10.5 of scala spark2.2.0: version 2.11.8 of scala (version of spark2.x is recommended for new projects) hadoop2.7.5

8. Installation of Spark stand-alone version

Prepare to install the package spark-2.2.0-bin-hadoop2.7.tgz

Tar-zxvf spark-2.2.0-bin-hadoop2.7.tgz-C / opt/mv spark-2.2.0-bin-hadoop2.7/ spark

Modify spark-env.sh

Export JAVA_HOME=/opt/jdkexport SPARK_MASTER_IP=uplooking01export SPARK_MASTER_PORT=7077export SPARK_WORKER_CORES=4export SPARK_WORKER_INSTANCES=1export SPARK_WORKER_MEMORY=2gexport HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop

Configure environment variables

# configure the environment variable export SPARK_HOME=/opt/sparkexport PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin of Spark

Start the stand-alone spark

Start-all-spark.sh

View Startup

Http://uplooking01:8080

9. Installation of Spark distributed cluster

Configure spark-env.sh

[root@uplooking01 / opt/spark/conf] export JAVA_HOME=/opt/jdk # configure master host export SPARK_MASTER_IP=uplooking01 # configure master host communication port export SPARK_MASTER_PORT=7077 # configure cpu core number spark uses in each worker export SPARK_WORKER_CORES=4 # configure each host have one worker export SPARK_WORKER_INSTANCES=1 # worker memory is the directory export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop in the configuration file of 2gb export SPARK_WORKER_MEMORY=2g # hadoop

Configure slaves

[root@uplooking01 / opt/spark/conf] uplooking03 uplooking04 uplooking05

Distribute spark

[root@uplooking01 / opt/spark/conf] scp-r / opt/spark uplooking02:/opt/ scp-r / opt/spark uplooking03:/opt/ scp-r / opt/spark uplooking04:/opt/ scp-r / opt/spark uplooking05:/opt/

Distribute environment variables configured on uplooking01

[root@uplooking01 /] scp-r / etc/profile uplooking02:/etc/ scp-r / etc/profile uplooking03:/etc/ scp-r / etc/profile uplooking04:/etc/ scp-r / etc/profile uplooking05:/etc/

Start spark

[root@uplooking01 /] start-all-spark.sh

10. Spark high availability cluster

Stop the running spark cluster first

Modify spark-env.sh

# comment the following two lines # export SPARK_MASTER_IP=uplooking01#export SPARK_MASTER_PORT=7077

Add content

Export SPARK_DAEMON_JAVA_OPTS= "- Dspark.deploy.recoveryMode=ZOOKEEPER-Dspark.deploy.zookeeper.url=uplooking03:2181,uplooking04:2181,uplooking05:2181-Dspark.deploy.zookeeper.dir=/spark"

Distribute modified [configuration

Scp / opt/spark/conf/spark-env.sh uplooking02:/opt/spark/confscp / opt/spark/conf/spark-env.sh uplooking03:/opt/spark/confscp / opt/spark/conf/spark-env.sh uplooking04:/opt/spark/confscp / opt/spark/conf/spark-env.sh uplooking05:/opt/spark/conf

Start the cluster

[root@uplooking01 /] start-all-spark.sh [root@uplooking02 /] start-master.sh

11. The first Spark-Shell program

Spark-shell-- master spark://uplooking01:7077 # spark-shell can specify the resources (total number of cores, memory used on each work) used by the spark-shell application at startup-- master spark://uplooking01:7077-- total-executor-cores 6-- executor-memory 1g# if you do not specify the default use of all cores on each worker, and 1g of memory on each worker

Sc.textFile ("hdfs://ns1/sparktest/"). FlatMap (_ .split (","). Map ((_, 1)). ReduceByKey (_ + _). Collect

12. Roles in Spark

Master

The request master responsible for receiving submitted jobs is responsible for scheduling resources (starting CoarseGrainedExecutorBackend in woker)

Worker

Executor in worker is responsible for executing task

Spark-Submitter = = > Driver

Submit the spark application to master

The above is how to get started with Spark. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.