Spark getting started Guid 07/08 Update SLTechnology News&Howtos

Spark getting started Guid

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

First acquaintance of Spark and Hadoop

Apache Spark is a new general-purpose engine for big data processing, which provides distributed memory abstraction. As the name suggests, the biggest feature of Spark is Lightning-fast, which is 100 times faster than Hadoop MapReduce.

Hadoop is essentially more of a distributed data infrastructure: it distributes huge data sets to multiple nodes in a cluster of ordinary computers for storage, meaning you don't need to buy and maintain expensive server hardware.

At the same time, Hadoop also indexes and tracks the data, making big data's processing and analysis more efficient than ever before. Spark is a tool specially used to deal with those distributed storage big data, it will not carry out distributed data storage.

Hadoop not only provides the common HDFS distributed data storage function, but also provides a data processing function called MapReduce. So we can completely put aside Spark and use Hadoop's own MapReduce to complete the data processing.

Of course, Spark doesn't have to be attached to Hadoop to survive. However, it does not provide a file management system, so it must be integrated with other distributed file systems to work. We can choose Hadoop's HDFS or other cloud-based data system platforms. But Spark is still used on Hadoop by default, after all, everyone thinks that their combination is the best.

Second, install Spark

The following are the operations on the Ubuntu 16.04 system

1. Install Java JDK and configure the environment variables (I won't go into detail in this section)

two。 Install Hadoop

2.1Create Hadoop users:

Open the terminal and enter the command:

Sudo useradd-m hadoop-s / bin/bash

Add hadoop user and set / bin/bash as shell

2.2 set the login password for Hadoop users:

Sudo passwd hadoop

Then enter your own password twice according to the prompts, and then add administrator privileges to the hadoop user:

Sudo adduser hadoop sudo

2.3 switch the current user to the hadoop user you just created (there is a gear in the upper right corner of the screen, click in and see it)

Update the apt of the system. Open the terminal and enter the command:

Sudo apt-get update

2.5 install ssh and configure ssh password-less login

Both cluster and single-node mode need to encounter SSH login. Ubuntu has SSH Client installed by default, but you need to install SSH Server yourself:

Sudo apt-get install openssh-server

After installation, log in to this machine directly

Ssh localhost

SSH login for the first time needs to confirm, according to the prompt to enter: yes, and then press the prompt to enter the password of the hadoop just set, and then log in.

Download Hadoop

Download address: http://mirror.bit.edu.cn/apache/hadoop/common/

Select the stable folder and click to download the hadoop-2.x.y.tar.gz file. It will be downloaded to the download directory by default

Open the terminal under this folder, extract the file into the / usr/local file, and execute the command:

Sudo tar-zxf ~ / hadoop-2.9.0.tar.gzcd / usr/local/sudo mv. / hadoop-2.9.0/. / hadoop # modify the file name to hadoopsudo chown-R hadoop. / hadoop # modify file permissions

After unzipping the folder of Hadoop, you can use it directly. Check whether Hadoop can be used properly, and if normal, display the version information of Hadoop.

Cd / usr/local/hadoop./bin/hadoop version

Here is the preliminary completion of the Hadoop installation, there are a lot of configuration what to write when used, such as pseudo-distributed system configuration.

3. Install Spark

Download Spark: http://spark.apache.org/downloads.html

The first item I choose is the latest version: 2.3.1, the second option is "Pre-build with user-provided Apache Hadoop", and then click on the download "spark-2.3.1-bin-without-hadoop-tgz" after the third item.

3.2 decompress the file

This step is the same as the Hadoop decompression, we all decompress it to the / usr/local path:

$sudo tar-zxf ~ / download / spark-2.3.1-bin-without-hadoop.tgz-C / usr/local/$ cd / usr/local$ sudo mv. / spark-2.3.1-bin-without-hadoop/. / spark$ sudo chown-R hadoop:hadoop. / spark

3.3 set environment variables

Execute the following command to copy a configuration file:

$cd / usr/local/spark$. / conf/spark-env.sh.template. / conf/spark-env.sh

Then edit spark-env.sh:

$vim. / conf/spark-env.sh

After opening it, add the following to the last line of the file:

Export SPARK_DIST_CLASSPATH=$ (/ usr/local/hadoop/bin/hadoop classpath)

Then save and exit Vim, and you can use Spark.

III. Example of getting started with Spark

In the file path "/ usr/local/spark/examples/src/main", we can find some examples that come with spark. For example, in the following figure, you can see that Spark supports Scala, Python, Java, R and so on.

The easiest way to 1.Spark is to use an interactive command line prompt. Open the PySpark terminal and type pyspark on the command line:

~ $pyspark

2.PySpark will automatically create a SparkContext using the local Spark configuration. We can access it through the sc variable to create the first RDD:

> text=sc.textFile ("file\\ usr\ local\ spark\ exp\ test1.txt") > > print text

3. Convert this RDD to the "hello world" of distributed computing: "word count"

First you import the add operator, which is a named function that can be used as a closure for addition. We'll use this function later. The first thing we need to do is to split the text into words. We created a tokenize function with arguments to text snippets that return a list of words split by spaces. Then we create a wordsRDD by transforming the textRDD by passing the tokenize closure to the flatMap operator. You will find that words is a PythonRDD, but execution should have taken place immediately. Obviously, we haven't split the entire dataset into word lists.

4. Map each word to a key-value pair, where the key is a word and the value is 1, and then use reducer to calculate the total number of 1 for each key

> wc = words.map (lambda x: (xmem1)) > print wc.toDebugString ()

I used an anonymous function (using the lambda keyword in Python) instead of a named function. This line of code will map lambda to each word. Therefore, each x is a word, and each word is converted into a tuple by an anonymous closure (word, 1). To see the transformation relationship, we use the toDebugString method to see how the PipelinedRDD is transformed. You can use the reduceByKey action to count the word count, and then write the results to disk.

5. Use the reduceByKey action to count the word count, and then write the statistical results to disk

> counts = wc.reduceByKey (add) > counts.saveAsTextFile ("wc")

Once we finally call the saveAsTextFile action, the distributed job starts to execute, and you should see a lot of INFO statements when the job runs "across clusters" (or many of your native processes). If you exit the interpreter, you can see that there is a "wc" directory under the current working directory. Each part file represents the final RDD calculated by the process on your local machine that is saved to disk.

IV. Spark data form

4.1Elastic distributed dataset (RDD)

The main abstraction of Spark is a distributed set of elements (distributed collection of items), called RDD (Resilient Distributed Dataset, flexible distributed dataset), which can be distributed to each node of the cluster for parallel operations. RDDs can be created through Hadoop InputFormats (such as HDFS) or transformed from other RDDs.

There are three ways to get RDD:

Parallelize: change an existing collection into a RDD, which can be used to learn spark and do some spark tests.

> sc.parallelize (['cat','apple','bat'])

MakeRDD: this function is available only in the scala version and is similar to parallelize

TextFile: read data from external storage to create a RDD

> sc.textFile ("file\\ usr\ local\ spark\ README.md")

Two features of RDD: immutable and distributed.

RDD supports two operations: Transformation (conversion operation: return value or RDD) such as map (), filter (), and so on. This operation is lazy (lazy), that is, the conversion from one RDD to another RDD is not performed immediately, but is recorded. The calculation will not be started until there is an Action operation. The new RDD generated will be written to memory or hdfs, and the value of the original RDD will not be changed. Action (Action: the return value is not RDD) actually triggers the Spark calculation, calculates a result for RDD, and returns the result to memory or hdfs, such as count (), first (), and so on.

4.2 caching strategy for RDD

One of the most powerful features of Spark is the ability to cache data in the memory of the cluster. This is done by calling the cache function of RDD: rddFromTextFile.cache

Calling a cache function of RDD will tell Spark to cache the RDD in memory. When RDD first invokes an execution operation, the calculation corresponding to that operation is performed immediately, and the data is read from the data source and saved to memory. Therefore, the time required to call the cache function for the first time depends in part on the time it takes for Spark to read data from the input source. However, the next time the dataset is accessed, the data can be read directly from memory, reducing inefficient Igamo operations and speeding up computation. In most cases, this will achieve several times the speed increase.

Another core function of Spark is the ability to create two special types of variables: broadcast variables and accumulators. The broadcast variable (broadcast variable) is a read-only variable that is created by the driver running SparkContext and sent to the node that will participate in the calculation. This is useful for application scenarios that require efficient access to the same data by working nodes, such as machine learning. To create a broadcast variable under Spark, you only need to call a method on SparkContext:

> broadcastAList = sc.broadcast (list (["a", "b", "c", "d", "e"])

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.