What are the performance characteristics of Spark 07/19 Update SLTechnology News&Howtos

What are the performance characteristics of Spark

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article focuses on "what are the performance characteristics of Spark". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what are the performance characteristics of Spark"?

Spark:

Apache Spark is a fast and general computing engine specially designed for large-scale data processing.

Spark is a general parallel framework like Hadoop MapReduce opened by UC Berkeley AMP lab (AMP Lab of the University of California, Berkeley). Spark has the advantages of Hadoop MapReduce; but what is different from MapReduce is that the intermediate output of Job can be saved in memory, so it is no longer necessary to read and write HDFS, so Spark can be better applied to iterative MapReduce algorithms such as data mining and machine learning.

Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two. These useful differences make Spark superior in some workloads. In other words, Spark enables in-memory distributed datasets to optimize iterative workloads in addition to interactive queries.

Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as local collection objects.

Although Spark was created to support iterative jobs on distributed datasets, it is actually a complement to Hadoop and can be run in parallel in the Hadoop file system. This behavior can be supported through a third-party cluster framework called Mesos. Developed by AMP Lab (Algorithms, Machines, and People Lab) at the University of California, Berkeley, Spark can be used to build large, low-latency data analysis applications.

Performance characteristics of Spark:

1. Faster: Spark is 100 times faster than Hadoop in memory computing.

Memory computing engine, which provides Cache mechanism to support iterative calculation or multiple data sharing, reducing the cost of data reading.

DAG engine to reduce the overhead of writing intermediate results to HDFS between multiple computations

Using the multi-thread pool model to reduce task startup overhead, avoiding unnecessary sort operations in the shuffle process has reduced disk Icano operations.

two。 Ease of use:

Spark provides more than 80 advanced operators.

Provides rich API, supporting JAVA,Scala,Python and R four languages

The amount of code is 2 to 5 times less than that of MapReduce.

3. Versatility: Spark provides a large number of libraries, including SQL, DataFrames, MLlib, GraphX, Spark Streaming.

4. Support for multiple resource managers: Spark supports Hadoop YARN,Apache Mesos and its own independent cluster manager

Basic principles of Spark:

Spark Streaming: build a framework for processing Stream data on Spark. The basic principle is to divide Stream data into small time fragments (seconds) and process these small pieces of data in a manner similar to batch batch processing. Spark Streaming is built on Spark, on the one hand, because Spark's low latency execution engine (100ms+) can also be used for real-time computing compared with specialized streaming data processing software, on the other hand, compared with other processing frameworks based on Record (such as Storm), some narrow dependent RDD data sets can be recalculated from source data to achieve fault-tolerant processing. In addition, the way of small batch processing makes it compatible with both batch and real-time data processing logic and algorithms. It facilitates some specific applications that require joint analysis of historical data and real-time data.

Spark background:

Limitations of 1.MapReduce:

1. Only Map and Reduce operations are supported.

two。 Inefficient processing; not suitable for iterative computing (such as machine learning, graph computing, etc.), interactive processing (data mining) and loss processing (log analysis)

The intermediate result of 3.Map needs to write disk, Reduce writes HDFS, and data is exchanged between multiple MR through HDFS.

4. High cost of task scheduling and startup

5. Unable to make full use of memory; (related to the time when MR was born, the price of memory was higher when MR appeared, and the cost of disk storage was low.)

Sorting is required on both 6.Map side and Reduce side.

7.MapReduce programming is not flexible enough. (compared with Scala functional programming)

8. Framework diversification [using a framework technology (Spark) to simultaneously implement batch processing, streaming computing, interactive computing]:

Batch processing: MapReduce, Hive, Pig

Streaming Computing: Storm

Interactive Computing: Impala

Spark core concepts:

RDD:Resilient Distributed Datasets, resilient distributed dataset

Can be stored on disk or in memory (multiple storage levels)

Constructed by parallel "transformation" operations

Automatic reconstruction after failure

RDD basic Operations (operator)

Specific content of Transformation

Map (func): returns a new distributed dataset consisting of each original element transformed by the func function

Filter (func): returns a new dataset consisting of the original elements whose value is true after the func function

* flatMap (func): similar to map, but each input element is mapped to 0 to multiple output elements (therefore, the return value of the func function is a Seq, not a single element)

FlatMap (func): similar to map, but each input element is mapped to 0 to multiple output elements (therefore, the return value of the func function is a Seq, not a single element)

Sample (withReplacement, frac, seed):

According to the given random seed seed, the data of frac is randomly sampled.

Union (otherDataset): returns a new dataset consisting of a combination of the original dataset and parameters

GroupByKey ([numTasks]):

Called on a dataset consisting of (KMagol V) pairs and returns a dataset of (KMagi Seq [V]) pairs. Note: by default, using 8 parallel tasks for grouping, you can pass in the optional parameter numTask and set a different number of Task according to the amount of data.

ReduceByKey (func, [numTasks]): used on a dataset of (KMagi V) pairs, returns a dataset of (KMagne V) pairs, with the same value of key, all aggregated together using the specified reduce function. Like groupbykey, the number of tasks can be configured with a second optional parameter.

Join (otherDataset, [numTasks]):

Called on a dataset of types (K, (V) V) and (K, (V) W), returning a dataset in which all elements in each key are together.

GroupWith (otherDataset, [numTasks]): called on a dataset of type (KMager V) and (KMagol W) to return a dataset consisting of (K, Seq [V], Seq [W]) Tuples. This operation is done in other frameworks, called CoGroup

Cartesian (otherDataset): Cartesian product. However, when called on datasets T and U, a pair of datasets is returned, and all elements interact with Cartesian products.

FlatMap (func):

Similar to map, but each input element is mapped to 0 to multiple output elements (therefore, the return value of the func function is a Seq, not a single element)

Specific content of Actions

Reduce (func): aggregates all elements in the dataset through the function func. The Func function takes two arguments and returns a value. This function must be associative to ensure that it can be executed concurrently correctly

Collect (): in a Driver program, returns all elements of a dataset as an array. This usually returns a sufficiently small subset of data after using filter or other operations, and directly returns the entire RDD set Collect, which is likely to make the Driver program OOM

Count (): returns the number of elements in the dataset

Take (n): returns an array consisting of the first n elements of the dataset. Note that this operation is not currently performed in parallel on multiple nodes, but on the same machine as the Driver program, where all elements are calculated on a single machine (the memory pressure of Gateway will increase and need to be used with caution)

First (): returns the * elements of the dataset (similar to take (1))

SaveAsTextFile (path): saves the elements of the dataset in the form of textfile to the local file system, hdfs, or any other file system supported by hadoop. Spark will call the toString method of each element and convert it to a line of text in the file

SaveAsSequenceFile (path): saves the elements of the dataset in sequencefile format to a specified directory, local system, hdfs, or any other file system supported by hadoop. The elements of RDD must consist of key-value pairs, all of which implement Hadoop's Writable interface, or can be implicitly converted to Writable (Spark includes basic types of conversions, such as Int,Double,String, etc.)

Foreach (func): on each element of the dataset, run the function func. This is usually used to update an accumulator variable or to interact with an external storage system

Operator classification

It can be roughly divided into three categories of operators:

The Transformation operator of the Value data type, this transformation does not trigger a submission job, and the data item being processed is value-type data.

The Transfromation operator of the Key-Value data type, this transformation does not trigger a submission job, and the data item being processed is a data pair of Key- value type.

Action operators, which trigger SparkContext to submit Job jobs.

Spark RDD cache/persist

Spark RDD cache

1. Allows RDD to be cached in memory or on disk for reuse

two。 A variety of cache levels are provided to facilitate users to adjust according to their actual needs

3.cache usage

We have used MapReduce to implement WordCount before, but now we use Scala to implement wordCount. Isn't it very concise?!

Import org.apache.spark. {SparkConf SparkContext} object SparkWordCount {def main (args: Array [String]) {if (args.length = = 0) {System.err.println ("Usage: SparkWordCount") System.exit (1)} val conf = new SparkConf (). SetAppName ("SparkWordCount") val sc = new SparkContext (conf) val file=sc.textFile ("file:///hadoopLearning/spark-1.5.1-bin-hadoop2.4/README.md") val counts=file.flatMap (line= > line.split () "") .map (word= > (word) 1)) .reduceByKey (_ + _) counts.saveAsTextFile ("file:///hadoopLearning/spark-1.5.1-bin-hadoop2.4/countReslut.txt")}})

Let's introduce the characteristics of Transformation and Action of RDD.

1. Interfaces are defined in different ways:

Transformation: RDD [X]-> RDD [y]

Action:RDD [x]-> Z (Z is not a RDD, may be a primitive type, array, etc.)

two。 Lazy execution:

Transformation: only RDD conversion relationships are recorded, not calculation triggered

Action: is the operator that triggers program execution (distributed).

The execution process of the program:

Spark operation mode:

Local (local mode):

1. Stand-alone operation, usually used for testing

Local: start only one executor

Local [k]: start k executor

Local [*]: start executor with the same number of cpu

2.standalone (standalone mode)

Run independently in a cluster

3.Yarn/mesos

1. Run on a resource management system, such as Yarn or mesos

There are two modes of 2.Spark On Yarn

Yarn-client

Yanr-cluster

The difference between the two ways:

Application scenario of Spark in Enterprise

Fast query system Service based on Log data

SparkSQL, which is built on Spark, takes advantage of its high speed and memory table to undertake the impromptu query of log data.

Spark implementation of typical algorithms

Predict the click probability of users' ads

Calculate the number of common friends between two friends

SparkSQL and DAG tasks for ETL.

At this point, I believe you have a deeper understanding of "what are the performance characteristics of Spark?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.