Spark Note arrangement (2): core Conceptual nouns of RDD and spark 07/09 Update SLTechnology News&Howtos

Spark Note arrangement (2): core Conceptual nouns of RDD and spark

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

[TOC]

Spark RDD

With a very basic explanation, the following picture can have a basic understanding:

Basic description of Spark RDD

1. The core concept of Spark is RDD (resilient distributed dataset, flexible distributed dataset), which refers to a read-only, partitioned distributed dataset. All or part of this dataset can be cached in memory and reused among multiple computations.

2. RDD is abstractly a collection of elements that contains data. It is partitioned, divided into multiple partitions, each distributed on different Worker nodes in the cluster, so that the data in the RDD can be manipulated in parallel. (distributed dataset)

3. RDD is usually created through files on Hadoop, that is, HDFS files or Hive tables; sometimes it can also be transformed from local creation of RDD.

4. Although the traditional MapReduce has the advantages of automatic fault tolerance, load balancing and extensibility, its biggest disadvantage is that it uses a non-cyclic data flow model, which makes a large number of disk IO operations in the iterative calculation formula (each running a Job, getting the intermediate result, and then running the next Job, Lenovo uses MR to do data cleaning). RDD is the abstract method to solve this shortcoming. The most important feature of RDD is that it provides fault tolerance and automatically recovers from node failures. That is, if the RDD partition on a node is lost due to a node failure, the RDD will automatically recalculate the partition through its own data source. All this is transparent to the user. RDD's lineage feature (similar to the genealogy, like the figure above, if a partition's data is lost, just find its parent partition and recalculate, which I call traceability).

5. RDD data is stored in memory by default, but when memory resources are insufficient, Spark will automatically write RDD data to disk. (elastic)

The position and function of RDD in Spark

(1) Why is there Spark? Because the traditional parallel computing model can not effectively solve iterative computing (iterative) and interactive computing (interactive), and the mission of Spark is to solve these two problems, which is also the value and reason for its existence.

(2) how does Spark solve iterative computation? The main idea of its implementation is RDD, which stores all calculated data in distributed memory. Iterative computing is usually repeated iterative computation of the same data set, and the data in memory will greatly improve the IO operation. This is also the core of Spark: memory computing. (wc:sc.textFile (". / hello"). FlatMap (_ .split (")). Map ((_, 1)). ReduceByKey (_ + _). Foreach (println), which is a typical iterative calculation.

(3) how does Spark realize interactive computing? Because Spark is implemented in Scala language, Spark and scala can be tightly integrated, so Spark can perfectly use scala's interpreter, so that scala in it can manipulate distributed datasets as easily as local collection objects.

(4) the relationship between Spark and RDD? It can be understood that RDD is a fault-tolerant abstract method of cluster computing based on memory, and Spark is the implementation of this abstract method.

Spark common core module 1, core module development: offline batch processing Spark Core2, real-time computing: the underlying layer is also based on RDD Spark Streaming3, Spark SQL/Hive: interactive analysis 4, Spark Graphx: graph computing 5, Spark Mlib: data mining and machine learning core concept terms

Most of them should have written spark programs in the field and submitted tasks to the spark cluster before they have a better understanding.

ClusterManager: in Standalone mode, it is the Master (master node), which controls the entire cluster and monitors the Worker. Resource manager in YARN mode. Worker: slave node, responsible for controlling the compute node, start Executor. In YARN mode, it is NodeManager, which is responsible for the control of computing nodes. Driver: run the main () function of Application and create a SparkContext. Executor: executor, a component that executes tasks on the worker node, used to start thread pools to run tasks. Each Application has a separate set of Executors. SparkContext: the context of the entire application, which controls the life cycle of the application. The basic computing unit of RDD:Spark, a set of RDD can form an executed directed acyclic graph RDD Graph. DAG Scheduler: the implementation divides the Spark job into one or more Stage. Each Stage determines the number of Partition according to the number of Task in the RDD, and then generates the corresponding Task set and puts it into the TaskScheduler. TaskScheduler: distribute tasks (Task) to Executor for execution. (so what Executor executes is our code) Stage: a Spark job typically contains one or more Stage. Task: a Stage contains one or more Task, which can run in parallel through multiple Task. Transformations: Transformations (e.g. map, filter, groupBy, join, etc.). The Transformations operation is Lazy, that is, the conversion from one RDD to another RDD operation is not performed immediately. Spark will only record the operation needed when it encounters a Transformations operation, and will not perform it. You need to wait until there is an Actions operation to start the calculation process. (as will be well illustrated in the following wc examples) Actions: Actions (e.g. count, collect, save, etc.), Actions operation will return the result or write RDD data to the storage system. Actions is the cause that triggers Spark to start computing. SparkEnv: thread-level context that stores references to important components at run time. Create and contain references to some of the following important components within the SparkEnv. MapOutPutTracker: responsible for storing Shuffle meta-information. BroadcastManager: responsible for controlling broadcast variables and storing meta-information. BlockManager: responsible for storage management, creating, and finding blocks. MetricsSystem: monitors runtime performance metrics information. SparkConf: responsible for storing configuration information.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.