What are the ways in which spark creates RDD 07/06 Update SLTechnology News&Howtos

What are the ways in which spark creates RDD

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "what are the ways in which spark creates RDD". In daily operation, I believe that many people have doubts about the way spark creates RDD. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts about "what is the way spark creates RDD?" Next, please follow the editor to study!

# # Technical environment generated

Avoid deployment in multiple computing scenarios and clusters

Cluster computing data, the typical idea of mr, and later the most important is hadoop, distributed cluster, will simplify programming to location awareness, fault tolerance, load balancing, and operate large data on the cluster. This mode is the way of data flow. Hdfs- > Computing-> hdfs tez dag data flow-based dag implements task scheduling and fault recovery, but each operation reads and writes the disk, if the same operation, if the second operation, full accounting, such as graph calculation, machine learning, such as interactive query

# # how to solve the problem

To solve the above problem is RDD.

Checkpoint or rollback mechanism vs shared memory model

# three ways to create a RDD

Through existing scala collections

Through hdfs, hbase, etc.

Through the conversion of other rdd

# Lazy feature of Transformation in Spark RDD

Textfile produces hadoopRDD and mappartitionRDD

# Runtime process resolution in Spark RDD

RDD is a logical structure, and the data itself is block manager

Detailed explanation of Transformation operator in # Spark RDD

Map: changes are made for the elements in each partition. The number of partitions cannot be v1-> v`1. Map is performed according to the composite operation in stage.

Flatmap: converts and aggregates the numbers in each partition into one partition

MapPartitions:Iter= > iter.filter (_ > 3) for a partition, the number of partitions remains unchanged

Glom: each partition is formed into an array with the same number of partitions

Filter: according to the return value of the passed function, if it is true, it will be ignored otherwise, and the number of partitions will remain the same.

Distinct: filter for repeated elements in each partition, with the same number of partitions

Cartesian: Cartesian product for multiple RDD partitions

Union: merge multiple RDD partitions (without deduplication), which will change the number of partitions

MapValues: operate on v in the RDD of kv structure in the partition, it will not affect k, and the number of partitions will not be changed.

Subtract: remove intersecting elements from multiple partitions

Sample: sample fraction=0.5,seed=9 for RDD, and the returned result is still RDD.

TakeSample:num=1,seed=9, the returned result is not RDD groupBy: according to KMagneGroup, the same k has v array partitionBy: for RDD partition cogroup: this groups the k of kvRDD, each k is a tuple of v combineByKey:groupbykey, grouped for partitions, the number of partitions is unchanged reduceByKey: operate on the v of the same k for each partition

Join:

LeftOuterJoin:

RightOuterJoin:

# detailed explanation of cache and persist in Spark RDD

Cache is an implementation of persist, which is all lazy operations and unpersist is immediate operations.

Detailed explanation of Action operator in # RDD

Foreach:

Collect:toArray

The hashmap,k repetition v of collectAsMap:kv mode will overwrite

ReduceByKeyLocally:reduce+collectAsMap:kv lookup: look for the sequence sequence of the specified k, and find partition first, otherwise brute force scanning

Count: count the number of elements in all partitions

Top:

Reduce: reduceLeft each partition separately, and reduceLeft all partition results

Fold: one zero by default than reduce

Aggregate:

SaveAsTextFile:

SaveAsObjectFile:c sequence

# complete resolution of different usage scenarios and working mechanisms of caching and checkpoints in RDD

# # caching (persist)

Cache will be reused eg:1 2 3 4 [5] 6.1 6.2 6.3

[5] there are 10000 pieces, which may be lost

# # where is the checkpoint (checkpoint) cached? when getting a large amount of data, up and down stage, after a long calculation chain, time-consuming calculation, checkpoint

Checkpoint: will change the consanguinity of rdd, trigger after action, introduce checkpoint to avoid cache loss, recalculate the performance overhead, checkpoint, generate a new job after action trigger, be sure to use checkpoint for rdd using checkpoint, so faster, stream computing, graph computing uses a lot of checkpoint

# RDD narrowly dependent on NarrowDependency and wide dependent on ShuffleDependency

1 narrow dependency: 1 computing task, independent of each other

Source code Dependency

Narrow dependencies can optimize scheduler optimizations

Wide dependency is the basis for dividing stage, and stage is the large granularity that makes up dag.

# two types of Spark RDD Task resolution and iterator resolution

The task of the last stage is resultTask, and the task of the previous dependent stage is called shuffleMapTask. There is a common iterator of the runTask method to start the calculation.

ShuffleMapTask- > bucket

Determine whether there is a cache

Determine if there is a checkpoint

SparkEnv

# detailed explanation of inside source code for cache processing in iterator of # RDD

# detailed explanation of insider source code level in checkpoint processing of Spark RDD

# Spark RDD Fault tolerance principle and Analysis of its four Core points

# detailed explanation of core concepts and common terms in Spark applications

An application can have several assignments

# Insider Overview of Spark Application Job scheduling process and underlying running Mechanism

# detailed description of the two modes of running Cluster and Client in Spark applications

Cluster is in a worker of the cluster, and client is local

All schedule will be managed by schedulebackend in driver

Excutor is executed in parallel with multiple threads.

# DAGScheduler, TaskScheduler, SchedulerBackend parsing

At this point, the study of "what are the ways in which spark creates RDD" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.