What is the core concept of Spark? 04/28 Update SLTechnology News&Howtos

What is the core concept of Spark?

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the core concept of Spark". In daily operation, I believe many people have doubts about what the core concept of Spark is. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubt of "what is the core concept of Spark?" Next, please follow the editor to study!

Overview

What is Spark?

◆ Spark is a general parallel computing framework like Hadoop MapReduce, which is open source by UC Berkeley AMP lab. The distributed computing of Spark based on map reduce algorithm has the advantages of Hadoop MapReduce, but unlike MapReduce, the intermediate output and results of Job can be saved in memory, so there is no need to read and write HDFS, so Spark can be better applied to map reduce algorithms that need iteration, such as data mining and machine learning. Its architecture is shown in the following figure:

Comparison between Spark and Hadoop

The intermediate data of ◆ Spark is put in memory, which is more efficient for iterative operations.

Spark is more suitable for ML and DM operations with more iterative operations. Because in Spark, there is the abstract concept of RDD.

◆ Spark is more generic than Hadoop.

Spark provides many types of dataset operations, unlike Hadoop, which only provides Map and Reduce operations. For example, there are many types of operations, such as map, filter, flatMap, sample, groupByKey, reduceByKey, union, join, cogroup, mapValues, sort,partionBy and so on. Spark calls these operations Transformations. At the same time, it also provides Count, collect, reduce, lookup, save and other actions operations.

These various types of dataset operations provide convenience for users who develop upper-level applications. The communication model between processing nodes is no longer the only Data Shuffle mode like Hadoop. Users can name, materialize, and control the storage and partition of intermediate results. It can be said that the programming model is more flexible than Hadoop.

However, due to the nature of RDD, Spark is not suitable for applications with asynchronous fine-grained status updates, such as storage of web services or incremental web crawlers and indexes. It is not suitable for the application model of incremental modification.

◆ fault tolerance.

Fault tolerance is achieved through checkpoint in distributed dataset computing, while checkpoint has two ways, one is checkpoint data, the other is logging the updates. Users can control which way to achieve fault tolerance.

◆ availability.

Spark improves usability by providing rich Scala, Java,Python API and interactive Shell.

The combination of Spark and Hadoop

◆ Spark can read and write data to HDFS directly, and it also supports Spark on YARN. Spark and MapReduce can run in the same cluster and share storage resources and computing. Hive is borrowed from the data warehouse Shark implementation, which is almost completely compatible with Hive.

Applicable scenarios for Spark

◆ Spark is a memory-based iterative computing framework that is suitable for applications that need to manipulate specific data sets multiple times. The more repeated operations are required, the greater the amount of data that needs to be read, and the greater the benefit. When the amount of data is small but the computing intensity is high, the benefit is relatively small.

◆ due to the nature of RDD, Spark is not suitable for asynchronous fine-grained status update applications, such as web service storage or incremental web crawlers and indexes. It is not suitable for the application model of incremental modification.

◆ generally speaking, Spark has a wide range of applications and is more general.

Operation mode

◆ local mode

◆ Standalone mode

◆ Mesoes mode

◆ yarn mode

Spark ecosystem

◆ Shark (Hive on Spark): Shark basically provides the same H iveQL command interface as Hive based on the framework of Spark. In order to maximize compatibility with Hive, Shark uses Hive's API to implement query Parsing and Logic Plan generation, and the final PhysicalPlan execution phase uses Spark instead of Hadoop MapReduce. By configuring Shark parameters, Shark can automatically cache specific RDD in memory to achieve data reuse, thus speeding up the retrieval of specific data sets. At the same time, Shark implements specific data analysis learning algorithms through UDF user-defined functions, so that SQL data query and operation analysis can be combined to maximize the reuse of RDD.

◆ Spark streaming: build a framework for processing Stream data on Spark. The basic principle is to divide Stream data into small time fragments (seconds) and process these small pieces of data in a manner similar to batch batch processing. Spark Streaming is built on Spark, on the one hand, because Spark's low-latency execution engine (100ms +) can be used for real-time computing, on the other hand, compared with other Record-based processing frameworks (such as Storm), RDD data sets are easier to do efficient fault-tolerant processing. In addition, the way of small batch processing makes it compatible with both batch and real-time data processing logic and algorithms. It facilitates some specific applications that require joint analysis of historical data and real-time data.

◆ Bagel: Pregel on Spark, you can use Spark for graph calculation, this is a very useful small project. Bagel comes with an example that implements Google's PageRank algorithm.

Use in the industry

The ◆ Spark project was launched in 2009, open source in 2010, and now uses: Berkeley, Princeton, Klout, Foursquare, Conviva, Quantifind, Yahoo! Research & others, Taobao, etc., Douban are also using python clones of Spark Dpark.

Core concepts of Spark

Resilient Distributed Dataset (RDD) resilient distributed dataset

◆ RDD is the most basic abstraction of Spark, is the abstract use of distributed memory, and implements the abstract implementation of manipulating distributed data sets by manipulating local sets. RDD is the core of Spark. It represents a set of data that has been partitioned, immutable and can be operated in parallel. Different data set formats correspond to different RDD implementations. RDD must be serializable. RDD can be cache into memory, and the results of each operation on the RDD dataset can be stored in memory, and the next operation can be entered directly from memory, saving a lot of disk IO operations of MapReduce. For the iterative operation of the more common machine learning algorithms, interactive data mining, the efficiency is greatly improved.

Features of ◆ RDD:

It is an immutable, partitioned collection object on a cluster node.

Create parallel transformations such as (map, filter, join, etc).

Fail to rebuild automatically.

You can control the storage level (memory, disk, etc.) for reuse.

Must be serializable.

It's a static type.

Benefits of ◆ RDD

RDD can only be generated from persistent storage or through Transformations operations, and fault tolerance can be achieved more efficiently than distributed shared memory (DSM). For lost data partitions, it can be recalculated according to its lineage without the need to do a specific Checkpoint.

The invariance of RDD makes it possible to implement Hadoop MapReduce-like inference execution.

RDD's data partitioning feature can improve performance through the locality of the data, which is the same as Hadoop MapReduce.

RDD is serializable, and it can be automatically degraded to disk storage when memory is low, and RDD is stored on disk, so that the performance will be greatly degraded but not worse than today's MapReduce.

Storage and Partition of ◆ RDD

Users can choose different storage levels to store RDD for reuse.

The current RDD is stored in memory by default, but when there is insufficient memory, RDD will spill to disk.

When RDD needs to partition and distribute data in the cluster, it will partition according to each record Key (such as Hash partition), so as to ensure that the two data sets can be efficient in Join.

Internal representation of ◆ RDD

In the internal implementation of RDD, each RDD can be represented by five features:

Partition list (block list)

Calculate the function of each shard (calculate this RDD based on the parent RDD)

List of dependencies on the parent RDD

Partitioner for key-value RDD [optional]

Predefined address list for each data shard (such as the address of a block on HDFS) [optional]

Storage level of ◆ RDD

RDD provides 11 storage levels based on the combination of useDisk, useMemory, deserialized, and replication parameters:

Val NONE = new StorageLevel (false, false, false) val DISK_ONLY = new StorageLevel (true, false, false) val DISK_ONLY_2 = new StorageLevel (true, false, false, 2) val MEMORY_ONLY = new StorageLevel (false, true, true) val MEMORY_ONLY_2 = new StorageLevel (false, true, true, 2) val MEMORY_ONLY_SER = new StorageLevel (false, true, false) val MEMORY_ONLY_SER_2 = new StorageLevel (false, true, false 2) val MEMORY_AND_DISK = new StorageLevel (true, true, true) val MEMORY_AND_DISK_2 = new StorageLevel (true, 2) val MEMORY_AND_DISK_SER = new StorageLevel (true, true, false) val MEMORY_AND_DISK_SER_2 = new StorageLevel (true, true, false, 2)

◆ RDD defines various operations, different types of data are abstracted by different RDD classes, and different operations are also implemented by RDD.

Generation of RDD

There are two ways to create ◆ RDD:

1. Create from Hadoop file system (or other storage system compatible with Hadoop) input (for example, HDFS).

2. Convert the parent RDD to get the new RDD.

◆, let's take a look at how to generate RDD from the Hadoop file system, such as: val file = spark.textFile ("hdfs://..."). The file variable is RDD (actually a HadoopRDD instance). The core code generated is as follows:

/ / SparkContext creates RDD based on files / directories and optional number of shards. Here we can see that Spark is very similar to Hadoop MapReduce / / requires the types of InputFormat, Key and Value. In fact, Spark uses the InputFormat and Writable types of Hadoop. Def textFile (path: String, minSplits: Int = defaultMinSplits): RDD [String] = {hadoopFile (path, classOf [TextInputFormat], classOf [LongWritable], classOf [Text], minSplits) .map (pair = > pair._2.toString)} / / create HadoopRDD new HadoopRDD (this, conf, inputFormatClass, keyClass, valueClass, minSplits) based on Hadoop configuration, InputFormat, etc.

When ◆ calculates RDD, RDD reads data from HDFS almost the same as Hadoop MapReduce:

Conversion and Operation of RDD

◆ can be calculated for RDD in two ways: conversion (return value or a RDD) and operation (return value is not a RDD).

◆ conversion (Transformations) (such as map, filter, groupBy, join, etc.), the Transformations operation is Lazy, that is, the operation from one RDD transformation to another RDD is not performed immediately. Spark will only record the operation needed when it encounters a Transformations operation, and will not perform it. You need to wait until there is an Actions operation to start the calculation process.

◆ operation (Actions) (e.g. count, collect, save, etc.), Actions operation will return the result or write RDD data to the storage system. Actions is the cause that triggers Spark to start computing.

◆ below uses an example to illustrate the use of Transformations and Actions in Spark.

Val sc = new SparkContext (master, "Example", System.getenv ("SPARK_HOME"), Seq (System.getenv ("SPARK_TEST_JAR")) val rdd_A = sc.textFile (hdfs://.) Val rdd_B = rdd_A.flatMap ((line = > line.split ("\\ s +")) .map (word = > (word, 1)) val rdd_C = sc.textFile (hdfs://.) Val rdd_D = rdd_C.map (line = > (line.substring (10), 1) val rdd_E = rdd_D.reduceByKey ((a, b) = > a + b) val rdd_F = rdd_B.jion (rdd_E) rdd_F.saveAsSequenceFile (hdfs://....)

Lineage (pedigree)

◆ uses memory to speed up data loading, which is also implemented in many other In-Memory databases or Cache systems. The main difference of Spark lies in the scheme it adopts to deal with the problem of data fault tolerance (node effectiveness / data loss) in distributed computing environment. In order to ensure the robustness of the data in RDD, the RDD dataset remembers how it evolved from other RDD through the so-called Lineage. Compared with other systems' fine-grained in-memory data update-level backup or LOG mechanisms, RDD's Lineage records coarse-grained specific data conversion (Transformation) operations (filter, map, join etc.) Behavior. When part of the partition data of the RDD is lost, it can obtain enough information through the Lineage to recalculate and recover the lost data partition. This coarse-grained data model limits the application of Spark, but at the same time, compared with the fine-grained data model, it also brings performance improvement.

In terms of Lineage dependence, ◆ RDD is divided into two types: Narrow Dependencies and Wide Dependencies to solve the high efficiency of data fault tolerance. Narrow Dependencies means that each partition of a parent RDD is used by a partition of a child RDD at most. It is shown that a partition of a parent RDD corresponds to a partition of a child RDD or a partition of multiple parent RDD corresponds to a partition of a child RDD, that is to say, a partition of a parent RDD cannot correspond to multiple partitions of a child RDD. Wide Dependencies means that the partition of a child RDD depends on multiple or all partitions of the parent RDD, that is, there are multiple partitions of a parent RDD corresponding to a child RDD. For Wide Dependencies, the input and output of this calculation are on different nodes, and the lineage method is intact with the input node, but when the output node is down, by recalculating, this method is effective, otherwise it is invalid, because it cannot be retried, and its ancestors need to be traced up to see if it can be retried (this is lineage, pedigree means). The data recalculation cost of Narrow Dependencies is much less than that of Wide Dependencies.

Fault tolerance

◆ calculates in RDD and uses checkpint for fault tolerance. There are two ways to do checkpoint, one is checkpoint data, the other is logging the updates. Users can control which way to achieve fault tolerance. The default is logging the updates, which recalculates the missing partition data by recording and tracking all the transformations that generate RDD (transformations), that is, recording the lineage (lineage) of each RDD.

Resource Management and Job scheduling

◆ Spark can use Standalone (stand-alone mode), Apache Mesos and Hadoop YARN to implement resource management and job scheduling. Spark on Yarn is referenced in Spark0.6, but it is really available in the current version of branch-0.8. Spark on Yarn follows the official specification of YARN, and thanks to the good design of Spark that naturally supports a variety of Scheduler and Executor, it is very easy to support YARN, the general framework of Spark on Yarn.

◆ allows Spark to run on YARN to share cluster resources with Hadoop to improve resource utilization.

Programming interface

◆ Spark exposes RDD operations through integration with programming languages, similar to DryadLINQ and FlumeJava. Each dataset is represented as a RDD object, and the operation on the dataset is represented as an operation on the RDD object. Spark's main programming language is Scala, and Scala was chosen because of its simplicity (Scala can be easily used interactively) and performance (a static strongly typed language on JVM).

◆ Spark is similar to Hadoop MapReduce in that it consists of Master (Jobtracker similar to MapReduce) and Workers (Slave worker node of Spark). The Spark program written by the user is called the Driver program. The Dirver program connects the master and defines the conversion and operation of each RDD, while the conversion and operation of the RDD is represented by the Scala closure (literal quantity function). Scala uses the Java object to represent the closure and is serializable, so as to send the closure operation to each Workers node. Workers stores data chunks and enjoys cluster memory. It is a daemon running on the work node. When it receives operations on RDD, it localizes data operations according to data fragmentation information, generates new data fragments, returns results, or writes RDD to the storage system.

Scala

◆ Spark is developed using Scala and defaults to Scala as the programming language. Writing Spark programs is much easier than writing Hadoop MapReduce programs. SparK provides Spark-Shell, which can be tested in Spark-Shell. The general step in writing a SparK program is to create or use an instance of (SparkContext), create a RDD using SparkContext, and then operate on RDD. Such as:

Val sc = new SparkContext (master, appName, [sparkHome], [jars]) val textFile = sc.textFile ("hdfs://.") TextFile.map (....). Filter (.).

Java

◆ Spark supports Java programming, but for using Java, there is no such convenient tool as Spark-Shell. Other programming tools are the same as Scala programming, because they are all languages on JVM, Scala and Java can be interoperable, and Java programming interface is actually the encapsulation of Scala. Such as:

JavaSparkContext sc = new JavaSparkContext (...); JavaRDD lines = ctx.textFile ("hdfs://..."); JavaRDD words = lines.flatMap (new FlatMapFunction () {public Iterable call (String s) {return Arrays.asList (s.split ("));}})

Python

◆ now Spark also provides Python programming interface, Spark uses py4j to achieve the interoperation between python and java, thus realizing the use of python to write Spark programs. Spark also provides pyspark, a python shell for Spark, which allows you to write Spark programs in Python interactively. Such as:

From pyspark import SparkContext sc = SparkContext ("local", "Job Name", pyFiles= ['MyFile.py',' lib.zip', 'app.egg']) words = sc.textFile ("/ usr/share/dict/words") words.filter (lambda w: w.startswith ("spar")) .take (5)

Use the example

Standalone mode

◆ to facilitate the promotion and use of Spark, Spark provides the Standalone pattern, Spark is designed to run on the Apache Mesos resource management framework from the beginning, which is a very good design, but with the complexity of deployment testing. In order to make it easier for Spark to deploy and try, Spark therefore provides a Standalone running mode, which consists of a Spark Master and multiple Spark worker, which is very similar to Hadoop MapReduce1, and even the cluster starts in almost the same way.

◆ runs the Spark cluster in Standalone mode

Download Scala2.9.3 and configure SCALA_HOME

Download the Spark code (you can compile with source code or download the compiled version) download the compiled version here (http://spark-project.org/download/spark-0.7.3-prebuilt-cdh5.tgz)

Extract the spark-0.7.3-prebuilt-cdh5.tgz installation package

Modify configuration (conf/*) slaves: configure the hostname of the worker node spark-env.sh: configure the environment variable.

SCALA_HOME=/home/spark/scala-2.9.3 JAVA_HOME=/home/spark/jdk1.6.0_45 SPARK_MASTER_IP=spark1 SPARK_MASTER_PORT=30111 SPARK_MASTER_WEBUI_PORT=30118 SPARK_WORKER_CORES=2 SPARK_WORKER_MEMORY=4g SPARK_WORKER_PORT=30333 SPARK_WORKER_WEBUI_PORT=30119 SPARK_WORKER_INSTANCES=1

◆ copy the Hadoop configuration to the conf directory

◆ does ssh password-less login to other machines on the master host.

◆ uses scp copy to transfer configured Spark programs to other machines.

◆ starts the cluster in master

$SPARK_HOME/start-all.sh

Yarn mode

◆ Spark-shell does not support Yarn mode yet. To run in Yarn mode, you need to package all Spark programs into a jar package and submit them to Yarn to run. Only the branch-0.8 version of the directory really supports Yarn.

◆ runs Spark in Yarn mode

Download the Spark code.

Git clone git://github.com/mesos/spark

◆ switches to branch-0.8

Cd spark git checkout-b yarn-- track origin/yarn

◆ uses sbt to compile Spark and

$SPARK_HOME/sbt/sbt > package > assembly

◆ copy the Hadoop yarn configuration to the conf directory

◆ runs the test

SPARK_JAR=./core/target/scala-2.9.3/spark-core-assembly-0.8.0-SNAPSHOT.jar\. / run spark.deploy.yarn.Client-- jar examples/target/scala-2.9.3/\-- class spark.examples.SparkPi-- args yarn-standalone

Use Spark-shell

◆ Spark-shell is easy to use. When Spark runs in Standalon mode, use $SPARK_HOME/spark-shell to enter shell. In Spark-shell, SparkContext has been created, and the instance name sc can be used directly. Another thing to note is that in Standalone mode, Spark defaults to the scheduler's FIFO scheduler rather than fair scheduling, while Spark-shell, as a Spark program, has been running on Spark, and other Spark programs can only wait in line. This means that only one Spark-shell can be running at a time.

◆ is very simple to write programs on Spark-shell, just like writing programs on Scala Shell.

Scala > val textFile = sc.textFile ("hdfs://hadoop1:2323/user/data") textFile: spark.RDD [String] = spark.MappedRDD@2ee9b6e3 scala > textFile.count () / / Number of items in this RDD res0: Long = 21374 scala > textFile.first () / / First item in this RDD res1: String = # Spark

Write Driver programs

◆ in Spark Spark program is called Driver program, writing Driver program is almost the same as writing program on Spark-shell, the difference is that SparkContext needs to create its own. For example, WorkCount program is as follows:

Import spark.SparkContext import SparkContext._ object WordCount {def main (args: Array [String]) {if (args.length = = 0) {println ("usage is org.test.WordCount")} println ("the args:") args.foreach (println) val hdfsPath = "hdfs://hadoop1:8020" / / create the SparkContext Args (0) the appMaster address val sc = new SparkContext (args (0), "WrodCount", System.getenv ("SPARK_HOME"), Seq (System.getenv ("SPARK_TEST_JAR") val textFile = sc.textFile (hdfsPath + args (1)) val result = textFile.flatMap (line = > line.split ("\\ s +")) .map (word = > (word)) is passed in from yarn 1)). ReduceByKey (_ + _) result.saveAsTextFile (hdfsPath + args (2))}} so far The study on "what is the core concept of Spark" is over. I hope I can solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.