In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces "what are the ways in which spark creates RDD". In daily operation, I believe that many people have doubts about the way spark creates RDD. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts about "what is the way spark creates RDD?" Next, please follow the editor to study!
# # Technical environment generated
Avoid deployment in multiple computing scenarios and clusters
Cluster computing data, the typical idea of mr, and later the most important is hadoop, distributed cluster, will simplify programming to location awareness, fault tolerance, load balancing, and operate large data on the cluster. This mode is the way of data flow. Hdfs- > Computing-> hdfs tez dag data flow-based dag implements task scheduling and fault recovery, but each operation reads and writes the disk, if the same operation, if the second operation, full accounting, such as graph calculation, machine learning, such as interactive query
# # how to solve the problem
To solve the above problem is RDD.
Checkpoint or rollback mechanism vs shared memory model
# three ways to create a RDD
Through existing scala collections
Through hdfs, hbase, etc.
Through the conversion of other rdd
# Lazy feature of Transformation in Spark RDD
Textfile produces hadoopRDD and mappartitionRDD
# Runtime process resolution in Spark RDD
RDD is a logical structure, and the data itself is block manager
Detailed explanation of Transformation operator in # Spark RDD
Map: changes are made for the elements in each partition. The number of partitions cannot be v1-> v`1. Map is performed according to the composite operation in stage.
Flatmap: converts and aggregates the numbers in each partition into one partition
MapPartitions:Iter= > iter.filter (_ > 3) for a partition, the number of partitions remains unchanged
Glom: each partition is formed into an array with the same number of partitions
Filter: according to the return value of the passed function, if it is true, it will be ignored otherwise, and the number of partitions will remain the same.
Distinct: filter for repeated elements in each partition, with the same number of partitions
Cartesian: Cartesian product for multiple RDD partitions
Union: merge multiple RDD partitions (without deduplication), which will change the number of partitions
MapValues: operate on v in the RDD of kv structure in the partition, it will not affect k, and the number of partitions will not be changed.
Subtract: remove intersecting elements from multiple partitions
Sample: sample fraction=0.5,seed=9 for RDD, and the returned result is still RDD.
TakeSample:num=1,seed=9, the returned result is not RDD groupBy: according to KMagneGroup, the same k has v array partitionBy: for RDD partition cogroup: this groups the k of kvRDD, each k is a tuple of v combineByKey:groupbykey, grouped for partitions, the number of partitions is unchanged reduceByKey: operate on the v of the same k for each partition
Join:
LeftOuterJoin:
RightOuterJoin:
# detailed explanation of cache and persist in Spark RDD
Cache is an implementation of persist, which is all lazy operations and unpersist is immediate operations.
Detailed explanation of Action operator in # RDD
Foreach:
Collect:toArray
The hashmap,k repetition v of collectAsMap:kv mode will overwrite
ReduceByKeyLocally:reduce+collectAsMap:kv lookup: look for the sequence sequence of the specified k, and find partition first, otherwise brute force scanning
Count: count the number of elements in all partitions
Top:
Reduce: reduceLeft each partition separately, and reduceLeft all partition results
Fold: one zero by default than reduce
Aggregate:
SaveAsTextFile:
SaveAsObjectFile:c sequence
# complete resolution of different usage scenarios and working mechanisms of caching and checkpoints in RDD
# # caching (persist)
Cache will be reused eg:1 2 3 4 [5] 6.1 6.2 6.3
[5] there are 10000 pieces, which may be lost
# # where is the checkpoint (checkpoint) cached? when getting a large amount of data, up and down stage, after a long calculation chain, time-consuming calculation, checkpoint
Checkpoint: will change the consanguinity of rdd, trigger after action, introduce checkpoint to avoid cache loss, recalculate the performance overhead, checkpoint, generate a new job after action trigger, be sure to use checkpoint for rdd using checkpoint, so faster, stream computing, graph computing uses a lot of checkpoint
# RDD narrowly dependent on NarrowDependency and wide dependent on ShuffleDependency
1 narrow dependency: 1 computing task, independent of each other
Source code Dependency
Narrow dependencies can optimize scheduler optimizations
Wide dependency is the basis for dividing stage, and stage is the large granularity that makes up dag.
# two types of Spark RDD Task resolution and iterator resolution
The task of the last stage is resultTask, and the task of the previous dependent stage is called shuffleMapTask. There is a common iterator of the runTask method to start the calculation.
ShuffleMapTask- > bucket
Determine whether there is a cache
Determine if there is a checkpoint
SparkEnv
# detailed explanation of inside source code for cache processing in iterator of # RDD
# detailed explanation of insider source code level in checkpoint processing of Spark RDD
# Spark RDD Fault tolerance principle and Analysis of its four Core points
# detailed explanation of core concepts and common terms in Spark applications
An application can have several assignments
# Insider Overview of Spark Application Job scheduling process and underlying running Mechanism
# detailed description of the two modes of running Cluster and Client in Spark applications
Cluster is in a worker of the cluster, and client is local
All schedule will be managed by schedulebackend in driver
Excutor is executed in parallel with multiple threads.
# DAGScheduler, TaskScheduler, SchedulerBackend parsing
At this point, the study of "what are the ways in which spark creates RDD" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.