Novice beginner: introduction to Spark deployment 07/01 Update SLTechnology News&Howtos

Novice beginner: introduction to Spark deployment

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Novice beginner: introduction to Spark deployment

Introduction to Spark

Overall understanding

Apache Spark is a big data processing framework built around speed, ease of use, and complex analysis. It was originally developed by AMPLab at the University of California, Berkeley in 2009 and became one of Apache's open source projects in 2010. Spark is in the upper-middle position in the whole big data system, as shown in the following figure, which complements hadoop:

Basic concept

Fork/Join framework is a framework provided by Java7 for parallel execution of tasks. It is a framework that divides large tasks into several small tasks, and finally summarizes the results of each small task to get the results of large tasks.

The first step is to split the task. First of all, we need a fork class to split a large task into subtasks, which may still be large, so we need to keep splitting until the subtasks are small enough.

The second step is to perform the task and merge the results. The split subtasks are placed in a double-ended queue, and then several startup threads get the task execution from the double-ended queue. The results of the execution of the subtasks are all put in a queue, start a thread to take the data from the queue, and then merge the data.

For more information, please refer to Fork/Join

Core concept

RDD (Resilient Distributed Dataset) Elastic distributed datasets introduction Elastic distributed datasets (research papers based on Matei) or RDD is the core concept in the Spark framework.

You can think of a RDD as a table in a database. Any type of data can be saved in it. Spark stores data in RDD on different partitions. RDD can help reschedule calculations and optimize data processing.

In addition, it is fault tolerant because RDD knows how to recreate and recalculate datasets.

RDD is immutable. You can modify the RDD with Transformation, but this transformation returns a brand new RDD, while the original RDD remains the same.

RDD supports two types of operations: O transformation (Transformation) o action (Action) transformation: the return value of the transformation is a new RDD collection, not a single value. When a transformation method is called, there is no evaluation, it just takes a RDD as a parameter and returns a new RDD. Transform functions include: map,filter,flatMap,groupByKey,reduceByKey,aggregateByKey,pipe and coalesce.

Action: the action operation calculates and returns a new value. When the action function is called on a RDD object, the entire data processing query is calculated at this time and the result value is returned.

Action operations include: reduce,collect,count,first,take,countByKey and foreach. Shared variable (Shared varialbes) o broadcast variable (Broadcast variables) o accumulator (Accumulators) Master/Worker/Driver/Executor

O Master:

1. Accept the registration request of Worker, record all Worker's CPU, Memory and other resources, and track the activity status of Worker nodes; 2. Accept the registration request of App in Driver (this request is issued by Client on the driver side), allocate CPU and Memory resources to App on Worker, generate background Executor process, and then track the activity status of Executor and App. O Worker: responsible for receiving instructions from Master and creating an Executor process for App. Worker acts as a bridge between Master and Executor and will not actually participate in the computing work. O Driver: responsible for user-side logic processing. O Executor: responsible for calculating, accepting and executing Task tasks divided by App, and caching the results in local memory or disk.

Spark deployment

There are a lot of information about the deployment of Spark on the Internet. Here is a summary of the deployment environment Ubuntu 14.04LTS Hadoop:2.7.0 Java JDK 1.8 Spark 1.6.1 Scala 2.11.8

Hadoop installation

Since Spark can take advantage of HDFS and YARN, you need to configure Hadoop in advance. For configuration tutorials, please refer to: Setting up an Apache Hadoop 2.7single node on Ubuntu14.04 Hadoop installation tutorial _ stand-alone / pseudo-distributed configuration _ Hadoop2.6.0/Ubuntu14.04

Spark installation

Based on the installation of Hadoop, set up Spark and refer to the configuration tutorial:

Spark Quick start Guide-Spark installation and basic use

Scala installation

Scala, as the source language for writing Spark, definitely has the best update speed and support, but on the other hand, the combination of object-oriented and functional programming in Scala language makes the language have a lot of cool syntax candy, so I use Scala language to develop in the process of using Spark.

The final compilation of Scala into bytecode needs to be run in JVM, so you need to rely on jdk. You need to deploy jdk Eclipse as an IDE artifact for developing Java. Of course, you can also use it in Scala. There are two ways: O Eclipse- > Help- > Install New Software install the integrated Scala IDE already provided on the official Scala Plugins o download website. Based on the above two steps, Scala development can be carried out. Students who need to use Scala's own SBT compilation can install the Scala official website download address. I have been using Maven for package management to continue the use of Maven

Simple example: WordCount (Spark Scala) development IDE:Eclipse Scala package management: Maven development language: Scala

Create a Maven project

Skip the selection of archetype project templates

Download template pom.xml

Add the Scala attribute to the maven project: Right click on project-> configure-> Add Scala Nature.

Adjust the version of the Scala compiler, corresponding to the Spark version: Right click on project- > Go to properties-> Scala compiler-> update Scala installation version to 2.10.5

Remove Scala Library from Build Path (because Spark Core dependencies have been added to Maven, while Spark is dependent on Scala, Scala's jar package already exists in Maven Dependency): Right click on the project-> Build path-> Configure build path and remove Scala Library Container.

Add package package com.spark.sample

Create Object WordCount and SimpleCount as two simple examples of Spark Spark Sample SimpleCount.scala

Package com.spark.sample

Import org.apache.spark.SparkConf import org.apache.spark.SparkContext

Object SimpleCount {def main (args: Array [String]) {val conf = new SparkConf (). SetAppName ("TrySparkStreaming"). SetMaster ("local [2]") / / Create spark context val sc = new SparkContext (conf) / / val ssc = new StreamingContext (conf, Seconds (1)) / / create streaming context

Val txtFile = "test" val txtData = sc.textFile (txtFile) txtData.cache () txtData.count () val wcData = txtData.flatMap {line = > line.split (",")}. Map {word = > (word, 1)}. ReduceByKey (_ + _) wcData.collect (). Foreach (println) sc.stop}

}

WordCount.scala

Package com.spark.sample import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD.rddToPairRDDFunctions object WordCount {def main (args: Array [String]) = {

/ / Start the Spark context val conf = new SparkConf () .setAppName ("WordCount") .setMaster ("local") val sc = new SparkContext (conf) / / Read some example file to a test RDD val test = sc.textFile ("input.txt") test.flatMap {line = > / / for each line line.split ("") / / split the line in word by word. }. Map {word = > / for each word (word, 1) / / Return a key/value tuple, with the word as key and 1 as value}. ReduceByKey (_ + _) / / Sum all of the value with same key .saveAsTextFile ("output.txt") / / Save to a text file / / Stop the Spark context sc.stop}

}

The principle is as follows:

References:

Http://km.oa.com/group/2430/articles/show/181711?kmref=search&from_page=1&no=1&is_from_iso=1

Http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

Http://www.infoq.com/cn/articles/apache-spark-introduction?utm_source=infoq_en&utm_medium=link_on_en_item&utm_campaign=item_in_other_langs

Http://www.infoq.com/cn/articles/apache-spark-sql

Http://www.infoq.com/cn/articles/apache-spark-streaming

Http://www.devinline.com/2016/01/apache-spark-setup-in-eclipse-scala-ide.html

Https://databricks.gitbooks.io/databricks-spark-reference-applications/content/

Http://wuchong.me/blog/2015/04/06/spark-on-hbase-new-api/

Http://colobu.com/2015/01/05/kafka-spark-streaming-integration-summary/

Http://www.devinline.com/2016/01/apache-spark-setup-in-eclipse-scala-ide.html

Author: Zhang Jinglong Changyan (Shanghai) Information Technology Co., Ltd. CTO,CCFYOCSEF Shanghai member, JD.com tonight hotel special APP technology founder and the first CTO, China's first generation smartphone developer.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.