What is the basic concept of Spark 07/04 Update SLTechnology News&Howtos

What is the basic concept of Spark

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about the basic concept of Spark, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

Introduction to Spark

Apache Spark is a fast and general computing engine specially designed for large-scale data processing. At present, it has formed a rapid development and a wide range of applications.

The ecosystem.

Apache Spark features:

1, fast

Most operations are iterated in memory, and only a small number of functions need to be landed on disk.

2. Ease of use

Support scala, Java, Python, R and other languages; provide more than 80 operators, API and easy to use.

3, versatility

Spark provides an one-stop solution for data processing, using the Spark Core operator (instead of Hadoop Map/Reduce).

Spark SQL (instead of HiveSQL) for batch processing, Spark Streaming (instead of Stom) for real-time computing, machine learning

Spark MLlib (instead of Mahout); in addition, graph calculation is done through Spark GraphX.

4, cross-platform

Spark can be run in Local mode, Standalone mode, and Cluster mode.

Local mode: run debugging locally, support breakpoints, and specify the number of parallel threads.

Standalone mode: Spark manages resources, with Master and Worker, the equivalent of ResourceManager and NodeManager.

Cluster mode: distributed mode for use in production environments. Resource managers use Yarn or Mesos.

Applicable scenarios for Spark

At present, there are several types of scenarios handled by big data:

Complex batch processing (Batch Data Processing) focuses on the ability to handle large amounts of data. As for the tolerable processing speed, the usual time may be tens of minutes to hours.

Interactive query based on historical data (Interactive Query), usually between tens of seconds and tens of minutes

Data processing based on real-time data streams (Streaming Data Processing), usually between hundreds of milliseconds and several seconds

Spark success stories

At present, big data is mainly used in finance, advertising, reporting, recommendation system and other business in Internet companies. In the advertising business, big data needs to do application analysis, effect analysis, directional optimization and so on, while in the recommendation system, big data needs to optimize the relevant rankings, personalized recommendations and hot click analysis. The common characteristics of these application scenarios are large amount of computation and high efficiency. Tencent / Xiaomi / Taobao / Youku Tudou

Hadoop Map/Reduce and Spark

The Development History of Spark

Version Development of Spark

Scala and functional programming

Scala introduction

Scala is a multi-paradigm programming language, which is designed to integrate the features of object-oriented programming and functional programming. Scala runs on the Java virtual machine and is compatible with existing Java programs.

Characteristics

1, object-oriented

Scala is a pure object-oriented language, and every value is an object. The data types and behaviors of objects are described by classes and trait.

2, functional programming

Scala is also a functional language, and its functions can be used as values. Scala provides lightweight syntax for defining anonymous functions, supports higher-order functions, allows nesting of multi-layer functions, and supports Corialization. Scala's case class and its built-in pattern matching are equivalent to the algebraic types commonly used in functional programming languages.

3, static type

Scala has a type system that ensures code security and consistency through compile-time checking.

4, concurrency

Scala uses Actor as its concurrency model, and Actor is a threaded entity that sends and receives messages through mailboxes. Actor can reuse threads, so you can use millions of Actor in your program, while threads can only create a few thousand. In versions later than 2.10, Akka was used as its default Actor implementation.

Functional programming

Functional programming is a programming method that regards computer operations as the calculation of functions. The most important foundation of functional programming language is lambda calculus (lambda calculus), and the functions of lambda calculus can accept functions as inputs (parameters) and outputs (return values).

Characteristics

1, the function is a first-class citizen

Functional programming supports functions as the first class of objects, sometimes called closures or functor objects.

2, inert calculation

In lazy evaluation, expressions are not evaluated immediately when bound to a variable, but when the evaluator needs to produce the value of the expression. Delayed calculations allow you to write functions that potentially generate infinite output.

3. No statements are required for support expressions.

"expression" is a simple operation, and there is always a return value; "statement" is to perform some operation and there is no return value. Functional programming requires that only expressions be used, not statements. In other words, each step is a simple operation, and there is a return value.

Sc.thriftParquetFile (fileStr, classOf [SomeClass], force = true) .filter (infor = > infor.classId = = CLASSID_REPAY) .filter (StorageLevel.DISK_ONLY)

4, no "side effects"

The so-called "side effect" refers to the interaction between the inside and outside of the function (in the most typical case, changing the value of the global variable) to produce results other than the operation.

Functional programming emphasizes that there are no "side effects", which means that the function should remain independent, all function is to return a new value, no other behavior, especially not to modify the value of external variables.

Spark RDD characteristics

RDD (Resilient Distributed Datasets), flexible distributed data sets, is a memory abstraction of distributed data sets, which provides fault tolerance through limited shared memory. At the same time, this memory model makes computing more efficient than traditional data flow models. RDD has five important features, as shown in the following figure:

1. A set of partitions, the basic units of a dataset.

two。 Calculate the function for each data partition.

3. Dependence on parent RDD, which describes the lineage (lineage) between RDD.

4. Optionally, for the key-value pair RDD, there is a Partitioner (usually HashPartitioner,RangePartitioner).

5. Optionally, a set of Preferred location information (for example, the location information where the Block of the HDFS file is located).

Spark operator

The data processing process of Spark includes input, running transformation, output and other processes, which are defined in the form of operators in Spark. Operator is a function defined in RDD, which can transform and manipulate the data in RDD.

Generally speaking, Spark operators can be divided into the following two categories:

1) Transformation transform / transform operator: this transformation does not trigger the submission of the job and completes the intermediate process of the job.

The Transformation operation is delayed, that is, the transformation from one RDD transformation to another RDD is not performed immediately.

You need to wait until there is an Action operation to actually trigger the operation.

According to the type of structure in which the data is processed, Transformation operators can be divided into the following two categories:

A) Transformation operator of the Value data type, this transformation does not trigger the submission job, and the data item processed is value-type data

B) the Transformation operator of the Key-Value data type, this transformation does not trigger the submission job, and the data items processed are

Data pairs of Key- value type.

2) Action Action operators: this kind of operators will trigger SparkContext to submit Job jobs.

The Action operator triggers Spark to submit a job (Job) and outputs the data to the Spark system.

1) Transformation operator A) input partition and output partition one-to-one type a) map operator b) flatMap operator c) mapPartitions operator d) glom operator B) input partition and output partition many to one type a) union operator C) many input partition and output partition For polytype a) groupBy operator D) the output partition is input partition a) filter operator b) distinct operator c) subtract operator d) sample operator e) takeSample operator E) Cache type a) cache operator d) persist operator 2) Transformation operator A) input of Key- value data type Partition and output partition one to one type a) mapValues operator B) for single RDD aggregation a) combineByKey operator b) reduceByKey operator c) partitionBy operator C) for two RDD aggregation a) Cogroup operator D) join a) join operator b) leftOutJoin operator c) rightOutJoin operator 3) Action operator A) No output a) foreach operator B) HDFS a) saveAsTextFile operator b) saveAsObjectFile operator C) set and data type a) collect operator b) collectAsMap operator c) reduceByKeyLocally operator d) lookup operator e) count operator f) top operator g) reduce operator h ) fold operator I) aggregate operator 4) data loaded Transformation operator A) File read a) textFile operator B) memory generation a) makeRDD operator b) parallelize operator Spark operator-value type

1) map: a specified function class (mapping) is executed on each element in RDD to produce a new RDD. Any element in the original RDD has and only one element corresponds to it in the new RDD. Of course, map can also turn Key elements into Key-Value pairs.

2) flatMap: converts each element in the original RDD into a new element through the function f, and merges the elements in each collection of the generated RDD into a collection.

3) mapPartiions:map operates on every element in rdd, while mapPartitions (foreachPartition) operates on the iterator of each partition in rdd. MapPartitions is much more efficient than map. The mapPartitions function gets the iterator of each partition and manipulates the elements of the entire partition through the iterator of the partition as a whole.

4) glom: converts partition elements into an array.

5) union: merge the same data type RDD without duplicates.

6) cartesian: perform Cartesian product operation on all elements in RDD

7) groupBy: the corresponding Key is generated by the element through the function, and the data is converted to Key-Value format, and then the elements with the same Key are divided into groups.

8) filter: filter RDD elements

9) distinct: deduplicates elements in RDD

10) subtract between subtract:RDD to remove the same data elements (remove items with duplicates)

11) sample: sampling RDD elements to get a subset of all elements (proportional random sampling)

12) takesample: the above sample function is a principle, except that it does not use relative proportional sampling, but samples according to the set number of samples.

Spark operator-Key- value type

1) mapValues: (transform the value) the Key in the original RDD remains the same, and together with the new Value, the elements in the new RDD are formed. Therefore, this function applies only to RDD whose elements are KV pairs. That is, Map operations are performed on the Value in (Key, Value) data without processing the Key.

2) combineByKey: (aggregate by key) is equivalent to transforming a RDD whose element is (Int, Int) into an RDD of type (Int,Seq [Int]).

3) reduceByKey:reduceByKey is a simpler case than combineByKey, except that two values are merged into one value, that is, the same key merges value.

4) partitionBy: repartition the RDD according to the key value.

5) cogroup: aggregate operations by key value.

6) join: join by key value.

Spark operator-Action operator

1) foreach: loop through each element.

2) saveAsTextFile: save the final result data in text format to the specified HDFS directory

3) saveAsObjectFile: save the final result data to the specified HDFS directory in binary file format

4) collect: collect elements

5) collectAsMap: collect elements in RDD of key/ value type

6) reduceByKeyLocally: realize the function of first reduce and then collectAsMap, reduce the whole RDD first, and then collect all the results and return them to a HashMap

7) lookup: find elements, search for elements corresponding to the specified Key for (Key,Value) type RDD operations

8) count: count the number of elements

9) top:top (n) finds the first n elements with the largest value

10) reduce: first aggregate the dataset of each partition through the function func, and then aggregate the data between the partitions. Func receives two parameters and returns a new value, which is then passed to the function func as an argument until the last element

11) fold: merge

12) aggregateByKey: data aggregation is performed by merging, which is parallelized and aggregated through key.

After reading the above, do you have any further understanding of the basic concepts of Spark? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.