Getting started, installation and configuration of Spark 07/15 Update SLTechnology News&Howtos

Getting started, installation and configuration of Spark

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The following is the material of big data compiled by the education of the old boy. Please indicate the source of the reprint: http://www.oldboyedu.com

Hadoop

Hadoop is a distributed computing engine with four modules, common, hdfs, mapreduce and yarn.

Concurrency and parallelism

Concurrency usually refers to the ability of a single node to respond to multiple requests, which is a measure of computing power on a single node. Parallelism usually refers to the use of multiple nodes for distributed collaborative work, which we call parallel computing.

Spark

Fast as lightning cluster computing engine, applied to large-scale data processing fast general-purpose engine, using memory computing.

1.Speed

Memory computing is more than 100 times faster than hadoop, hard disk computing is more than 10 times faster than Hadoop, and Spark uses an advanced DAG (Direct acycle graph) execution engine.

two。 Easy to use

Provide 80 + advanced operators, you can easily build parallel applications, you can also use scala,python,r 's shell for interactive operations.

3. Versatility

SQL, flow calculation and complex analysis can be combined. Spark provides class library stacks, including SQL, MLlib, graphx, and Spark streaming.

4. Architecture

Including: Spark core, Spark SQL, Spark streaming, Spark mllib and Spark graphx

5. Run everywhere

Spark can run on hadoop, mesos, standalone and clound, and can access a variety of data sources, such as hdfs, hbase, hive, Cassandra, S3 and so on.

Spark cluster deployment model

1.local

You don't need to start any Spark processes, but use one JVM to run all the components of Spark, mainly for debugging and testing.

2.standalone

In stand-alone mode, you need to install the Spark cluster, start the master node and the worker node, respectively. Master is the management node and worker is the execution node of task.

3.yarn

There is no need to deploy Spark clusters separately, it can be said that there is no concept of Spark clusters at all.

In this mode, the Job execution process of Hadoop is completely used, but the Task execution of Spark is used when the task is started at the end, which is equivalent to that Spark is a Job of Hadoop. All the jar packages of Spark are put into the dependent package run by job, and the process is carried out according to the execution process of hadoop.

Install spark

1. Download spark-2.1.0-bin-hadoop2.7.tgz

The following is the official download address of Spark:

Https://www.apache.org/dyn/closer.lua/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

two。 Extract the file to the / soft directory

$> tar-xzvf spark-2.3.0-bin-hadoop2.7.tgz-C / soft

3. Create a soft connection

After creating a soft connection, it is very convenient to compile various file configurations and to upgrade and replace versions later.

$> cd / soft

$> ln-s spark-2.3.0-bin-hadoop2.7 spark

4. Configure environment variables

Edit the / etc/profile environment variable file:

$> sudo nano / etc/profile

Add the following at the end of the file:

...

SPARK_HOME=/soft/spark

PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Note: add both the bin directory and the sbin directory of Spark to the environment variable path, and Linux uses ":" as the delimiter.

5. Environmental variables take effect

$> source / etc/profile

6. Enter the Spark-shell command line

$> / soft/spark/spark-shell

# enter the scala command prompt

$scala >

7. Experience Spark-shell

Because Spark uses the scala language, it is exactly the same as the use of Scala.

$scala > 1 + 1

# output results

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.