Spark (1): an overview of spark and logic execution diagram 04/27 Update SLTechnology News&Howtos

Spark (1): an overview of spark and logic execution diagram

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The figure above is an overview of the spark framework. Some important concepts of spark are briefly introduced first:

Cluster manager: resource management cluster, such as standalone and yarn;application: applications written by users; the main function in Driver:application, and the created SparkContext is responsible for communicating with cluster manager, applying for resources, assigning tasks and monitoring. It is generally believed that SparkContext is Driver;worker: the node in the cluster that can run tasks; the process running tasks on executor:worker, which is responsible for executing task;task: the smallest unit executed by executor, a stage is composed of multiple task; stage: multiple stages in a job, generally, whenever shuffle occurs, it will be divided into a stage;job: an application has at least one job,spark, as long as there is an action, it will generate a job. The four summary steps of the spark logic execution diagram create an initial RDD; from the data source to perform a series of transformation operations on the RDD to generate a new RDD [T], where the type T can be the basic data type in the scala, or, if so, k cannot be a complex data structure; action the final final RDD, and each partition generates a result; to send the result back to the driver for the final calculation. How does the generation of logic execution diagram generate RDD and what RDD should be generated?

Generally, each transformation method returns a RDD, and some transformation also have some child transformation, so it is possible to produce a dependency of more than one RDD;RDD.

It is easy to see which parent RDD RDD depends on, which can be seen visually from the code

How many partition are there in RDD? This is generally specified by the user. If it is not specified, it will go to the one with the largest number of partition in the parent RDD.

How is the dependency between RDD and the partitions of the parent RDD?

The first three are narrow dependencies, and the last one is wide dependencies. Narrow dependency is generally also called complete dependency, that is, all data of the partition in the parent RDD is dependent on the specific partition of the child RDD; wide dependency is also called partial dependency, that is, part of the data of a certain partition in the parent RDD is relied on by the partition1 of the child RDD, while the other part of the data is relied on by the partition2 of the child RDD, this situation will occur shuflle.

It is generally believed that all partition of the parent RDD are narrow dependencies as long as they are not covered by multiple partition dependencies of the child RDD, and shuffle will not occur, but there is a special case that is the third case: the partition of the parent RDD is dependent on multiple partition of the child RDD, but still does not need to occur shuffle (this is the case in general Cartesian product). Introduction to commonly used transformation union: merge two RDD without changing the data groupByKey in the partition: aggregate the records of the same key together, and each corresponding value after the aggregation is an array of all the original value of the same key. (conbine is not enabled on map by default) reduceByKey: equivalent to traditional MR, perform certain functions on the value of the same key, and get the last value, such as reduceByKey (+), the value of the same key will be continuously added.

ReduceByKey () enables combine () on the map side by default, so before shuffle, combine is performed through mapPartitions operation to get MapPartitionsRDD, then shuffle gets ShuffledRDD, and then reduce (implemented through aggregate + mapPartitions () operation) to get MapPartitionsRDD. Distinct: to remove the weight, the transformation will first transfer the value out of the form of rdd, then reduceByKey it in turn, and then restore it. Cogroup (otherRdd, numPartitions): similar to groupByKey, but the result of this aggregation of two or more RDD is not quite the same. The value corresponding to the same key within each RDD is first aggregated into an array, and then the arrays corresponding to the same key of two rdd are aggregated into a two-dimensional array, similar to [(a, c), (f)]. Intersection (otherRdd): extract the common data of two rdd, and internally, like distinct, first convert value to form, then call cogroup, and finally leave and restore those with the same key. Join (otherRdd): aggregates two RDD [(K, V)] together in the join manner in SQL. Similar to intersection (), first do cogroup (), get the type MappedValuesRDD, then do Cartesian sets for Iterable [V1] and Iterable [V2], and flat () the set. SortByKey: sort the records in RDD [(K, V)] by key, ascending = true for ascending order and false for descending order. Cartesion:

Cartesian product is the multiple partition dependencies of the parent RDD's partition quilt RDD mentioned above, but there is still no need for shuffle to occur. Coalesce: when shuffle = false, filterByRange (lower: K, upper: K) cannot increase the number of partition: filter by the range of element key in RDD, including lower and upper upper and lower boundary spark common action operation reduce (func): use the passed function parameter func to aggregate the elements in the dataset (pairwise merge). Collect (): returns the elements in the dataset as an array on driver program. This is useful after performing a filter or other operation that returns a subset that is small enough. Count (): returns the number of elements in the dataset first (): returns the first element in the dataset (similar to take (1)) take (n): returns the first n elements in the dataset takeOrdered (n) [ordering]): returns the first n elements of RDD in its natural order or using a custom comparator saveAsTextFile (path): elements in the dataset are written to one or more text files under the specified directory, which can exist on the local file system, HDFS or other file systems supported by Hadoop. CountByKey (): applies only to (K, V) type RDD. Returns a hashmap (K, int) pair.foreach (func) of the number of value of each key: executes the function func.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.