RDD programming of 5.spark core 07/19 Update SLTechnology News&Howtos

RDD programming of 5.spark core

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

spark provides the core abstraction of data-flexible distributed datasets (Resilient Distributed Dataset, referred to as RDD). RDD is a distributed data set, and the data can be executed in parallel across multiple machine nodes in the cluster.

in spark, all operations on data are nothing more than creating RDD, transforming existing RDD and calling RDD operation to evaluate. Spark automatically distributes the data from RDD to the cluster for parallel execution.

Five characteristics a list of partitions

RDD is a list; composed of multiple partition (a continuous piece of data in a node). When loading data into RDD, it generally follows the locality of the data (usually the block in a hdfs will be loaded as a partition). A function for computing each split

There is function in every partition of RDD, that is, function application, which is used to realize the conversion of partition between RDD. A list of dependencies on other RDDs

RDD records its dependencies for fault tolerance (cache,checkpoint), that is, recalculating when an in-memory RDD operation goes wrong or is lost. Optionally,a Partitioner for Key-value RDDs

is optional. If the data stored in RDD is in the form of key-value, you can pass a custom Partitioner to repartition. For example, if the custom Partitioner is partitioned based on key, the data of the same key in different RDD will be put into the same partition. Optionally, a list of preferred locations to compute each split on

is optional, the optimal location to calculate each shard, that is, the locality of the data. Create RDD

spark provides two ways to create an RDD: reading external data sources and parallelizing collections in driver programs.

Parallelized set

parallelizes the collection using the parallelize () method of sparkContext.

The second parameter of the parallelize () method specifies the number of partitions. Spark creates a task task for each partition, usually 2-4 partitions per cpu. Spark automatically sets the number of partitions based on the size of the cluster, and it also supports manual assignment through the second parameter of the parallelize () method.

Scalaval data = Array (1,2,3,4,5) val distData = sc.parallelize (data) javaList data = Arrays.asList (1,2,3,4,5); JavaRDD distData = sc.parallelize (data); pythondata = [1,2,3,4,5] distData = sc.parallelize (data)

note: this approach is rarely used except for development and testing. This approach requires putting the entire dataset into the memory of one machine first.

Read external data sources

spark can access a variety of hadoop-supported data sources to create distributed datasets. Including: local file system, HDFS, Cassandra, HBase, Amazon S3 and so on.

spark supports multiple storage formats, including textFiles, SequenceFiles, and other hadoop storage formats.

Scalascala > val distFile = sc.textFile ("data.txt") distFile: org.apache.spark.rdd.RDD [String] = data.txt MapPartitionsRDD [10] at textFile at: 26javaJavaRDD distFile = sc.textFile ("data.txt"); python > distFile = sc.textFile ("data.txt") RDD operation

RDD supports two operations: conversion operations and action operations.

Conversion operation

The conversion operation of RDD returns a new RDD. The conversion operation is lazily evaluated, and the conversion will only take place when the action uses the RDD generated by the conversion operation.

spark uses lineage (lineage) to record the dependencies between the different RDD generated by the conversion operation. Dependency can be divided into narrow dependency (narrow dependencies) and wide dependency (wide dependencies).

Each partition of a narrowly dependent child RDD depends on constant parent partition input and output one to one. As a result, the partition structure of RDD remains unchanged, mainly map, flatMap input and output one to one, but the partition structure of RDD changes, such as union, coalesce operators that select some elements from the input, such as filter, distinct, subtract, sample.

Wide dependence

Each partition of the child RDD relies on all parent RDD partitions to reassemble a single RDD based on key and reduce, such as groupByKey, reduceByKey

Merge and reorganize two RDD based on key, such as join

Action operation

The action returns the result to the driver program or writes the result to an external system, triggering the actual calculation.

Caching mode

RDD can cache the previous calculation results through the persist method or the cache method, but instead of caching immediately when these two methods are called, when the subsequent action is triggered, the RDD will be cached in the memory of the compute node and reused later.

cache finally calls the persist method, and the default storage level is to store only one copy in memory.

There are many other storage levels for Spark, which are defined in object StorageLevel.

The cache may be lost, and RDD's cache fault tolerance mechanism ensures that the calculation can be performed correctly even if the cache is lost. Through a series of RDD-based transformations, the lost data is recalculated, and because each Partition of the RDD is relatively independent, only the missing part needs to be calculated, not all the Partition.

Fault-tolerant mechanism

Lineage mechanism

RDD's Lineage records the coarse-grained behavior of specific data Transformation operations. When part of the partition data of RDD is lost, Lineage can be used to recalculate and recover the lost data partition. This coarse-grained data model limits the application of Spark, so Spark is not suitable for all scenarios with high performance requirements, but at the same time, compared with the fine-grained data model, it also brings performance improvement.

The Spark Lineage mechanism is implemented through the dependency of RDD

Narrow dependency can calculate a block of data corresponding to a child RDD directly by calculating a block of data of the parent RDD on a computing node.

On the other hand, the child RDD can not be calculated until all the data of the parent RDD is calculated, and the calculation result of the parent RDD is hash and transmitted to the corresponding node. Wide dependencies recalculate all data blocks in the ancestral RDD, so when there is a long "bloodline" chain, especially when there is a wide dependency, you need to set up data checkpoints at the appropriate time.

Checkpoint mechanism

Brief introduction

When RDD's action operator triggers the end of the calculation, the checkpoint;Task calculation will be performed and the data will be read from checkpoint for calculation when it fails.

There are two ways to implement checkpoint. If checkpoint is not set in the code, the checkpoint mode of local is used, and if the path is set, the checkpoint mode of reliable is used. )

LocalRDDCheckpointData: temporarily stored on disk and memory of the local executor. The feature of this implementation is that it is fast and suitable for scenarios where lineage information needs to be deleted frequently (such as GraphX), and can tolerate executor hanging.

ReliableRDDCheckpointData: stored in external reliable storage (such as hdfs) to tolerate driver hanging. Although not as efficient as local storage, the level of fault tolerance is the best.

Loyal to technology, love sharing. Welcome to the official account: java big data programming to learn more technical content.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.