What are the ways to create a RDD 07/12 Update SLTechnology News&Howtos

What are the ways to create a RDD

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what are the ways to create RDD". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the ways to create RDD"?

1. Create a RDD from the collection

Val conf = new SparkConf () .setAppName ("Test") .setMaster ("local")

Val sc = new SparkContext (conf)

/ / both methods have a second parameter that is a default value of 2 slices (number of partition)

/ / the scala collection creates a RDD through makeRDD, and the underlying implementation is also parallelize

Val rdd1 = sc.makeRDD (Array (1, 2, 3, 4, 5, 6))

/ / the scala collection creates a RDD through parallelize

Val rdd2 = sc.parallelize (Array (1, 2, 3, 4, 5, 6))

two。 Create RDD from external Stora

/ / create a RDD from external storage

Val rdd3 = sc.textFile ("hdfs://hadoop01:8020/word.txt")

RDD programming API

RDD supports two operations: conversion operations and action operations. The conversion operation of RDD is to return a new RDD operation, such as map () and filter (), while the action action is the operation of returning the result to the driver program or writing the result to the external system. Such as count () and first ().

Spark uses the lazy computing model, and RDD only calculates when it is used in an action operation for the first time. Spark can optimize the whole calculation process. By default, Spark's RDD is recalculated every time you take action on them. If you want to reuse the same RDD in multiple actions, you can use RDD.persist () to have Spark cache the RDD.

Transformation operator

All transformations in RDD are lazy-loaded, that is, they do not evaluate the results directly. Instead, they simply remember the transformation actions that are applied to the underlying dataset, such as a file. These transformations actually run only when an action occurs that requires the result to be returned to Driver. This design allows Spark to run more efficiently.

Conversion

Meaning

Map (func)

Returns a new RDD consisting of each input element converted by the func function

Filter (func)

Returns a new RDD consisting of input elements whose value is true after being calculated by the func function

FlatMap (func)

Similar to map, but each input element can be mapped to 0 or more output elements (so func should return a sequence, not a single element)

MapPartitions (func)

Similar to map, but runs independently on each shard of RDD, so when running on a RDD of type T, the function type of func must be Iterator [T] = > Iterator [U]

MapPartitionsWithIndex (func)

Similar to mapPartitions, but func takes an integer parameter to represent the index value of the fragment, so when running on a RDD of type T, the function type of func must be (Int, Iterator [T]) = > Iterator [U]

Sample (withReplacement, fraction, seed)

The data is sampled according to the proportion specified by fraction, and you can choose whether to replace it with a random number. Seed is used to specify the random number generator seed.

Union (otherDataset)

Returns a new RDD after the union of the source RDD and the parameter RDD

Intersection (otherDataset)

Returns a new RDD after intersecting the source RDD and the parameter RDD

Distinct ([numTasks]))

Deduplicates the source RDD and returns a new RDD

GroupByKey ([numTasks])

Call on a RDD of (KQuery V) and return a RDD of (K, Iterator [V])

ReduceByKey (func, [numTasks])

Call on a (KQuery V) RDD and return a (KMagazine V) RDD. Use the specified reduce function to aggregate the values of the same key. Similar to groupByKey, the number of reduce tasks can be set by a second optional parameter.

AggregateByKey (zeroValue) (seqOp, combOp, [numTasks])

The same key value is used for aggregation operation, and a neutral initial value zeroValue: neutral value is also used in the aggregation process. It defines the type that returns value, and participates in the operation seqOp: used to merge values combOp in the same partition: to merge values in different partiton

SortByKey ([ascending], [numTasks])

When called on a (KMagi V) RDD, K must implement the Ordered interface and return a RDD sorted by key (KMagol V).

SortBy (func, [ascending], [numTasks])

Similar to sortByKey, but more flexible

Join (otherDataset, [numTasks])

Called on a RDD of types (KMagol V) and (KMagol W), returning a (K, (VMagneW)) RDD of all the elements corresponding to the same key.

Cogroup (otherDataset, [numTasks])

Called on a RDD of type (K _ Iterable,Iterable V) and (K _ Magol W) and returns a RDD of type (K, (Iterable,Iterable))

Cartesian (otherDataset)

Cartesian product

Pipe (command, [envVars])

Use some shell commands in Spark to generate a new RDD

Coalesce (numPartitions)

Rezoning

Repartition (numPartitions)

Rezoning

RepartitionAndSortWithinPartitions (partitioner)

Repartition and sort

Action operator

Run the calculation on RDD and return the result to Driver or write to the file system

action

Meaning

Reduce (func)

All elements in the RDD are aggregated by the func function, which must be interchangeable and parallel.

Collect ()

In the driver, return all elements of the dataset as an array

Count ()

Returns the number of elements of RDD

First ()

Returns the first element of RDD (similar to take (1))

Take (n)

Returns an array of the first n elements of the dataset

TakeSample (withReplacement,num, [seed])

Returns an array consisting of num elements randomly sampled from the dataset. You can choose whether to replace the insufficient parts with random numbers. Seed is used to specify the random number generator seed.

TakeOrdered (n, [ordering])

TakeOrdered is similar to top, except that elements are returned in the reverse order of top

SaveAsTextFile (path)

Save the elements of the dataset as textfile to the HDFS file system or other supported file system. For each element, Spark will call the toString method to replace it with text in the file.

SaveAsSequenceFile (path)

Saving the elements in the dataset to a specified directory in Hadoop sequencefile format allows HDFS or other file systems supported by Hadoop.

SaveAsObjectFile (path)

CountByKey ()

For a RDD of type (KMagna V), return a map of (KMagneInt), which represents the number of elements corresponding to each key.

Foreach (func)

On each element of the dataset, run the function func to update.

Transformation operator * *

The meaning of transformation map (func) returns a new RDD. The RDD is composed of each input element transformed by the func function to form filter (func) to return a new RDD. The RDD is made up of input elements with a return value of true calculated by the func function. FlatMap (func) is similar to map, but each input element can be mapped to 0 or more output elements (so func should return a sequence Instead of a single element) mapPartitions (func) is similar to map, but runs independently on each shard of RDD, so when running on a RDD of type T, the function type of func must be Iterator [T] = > Iterator [U] mapPartitionsWithIndex (func) similar to mapPartitions, but func takes an integer parameter to represent the index value of the shard, so when running on a RDD of type T The function type of func must be (Int, Iterator [T]) = > Iterator [U] sample (withReplacement, fraction, seed) samples the data according to the ratio specified by fraction, and you can choose whether to replace it with random numbers. Seed is used to specify the random number generator seed union (otherDataset) to join the source RDD and the parameter RDD and return a new RDDintersection (otherDataset) after the intersection of the source RDD and the parameter RDD return a new RDDdistinct ([numTasks]). After deduplicating the source RDD, a new RDDgroupByKey ([numTasks]) is called on a (KMJ V) RDD, and a (K, Iterator [V]) RDDreduceByKey (func, [numTasks]) is called on a (KMJ V) RDD. Returns a RDD that aggregates the values of the same key using the specified reduce function. Similar to groupByKey, the number of reduce tasks can be aggregated by setting the same key value of aggregateByKey (zeroValue) (seqOp, combOp, [numTasks]) through a second optional parameter. A neutral initial value, zeroValue: neutral, is also used in the aggregation process to define the type that returns value. And participate in the operation seqOp: used to merge the value combOp in the same partition: to merge the value sortByKey ([ascending], [numTasks]) in different partiton and call it on a (KMagi V) RDD. K must implement the Ordered interface and return a RDDsortBy (func, [ascending], [numTasks]) sorted by key (func, [ascending], [numTasks]) similar to sortByKey. But more flexible join (otherDataset, [numTasks]) is called on RDD of types (KMagneV) and (KMagol W), returning a RDDcogroup (otherDataset, [numTasks]) of (K, (VMagneW) of all elements corresponding to the same key, called on RDD of types (KMageV) and (KMagw). Return a (K, (Iterable,Iterable)) RDDcartesian (otherDataset) Cartesian product pipe (command, [envVars]) use some shell commands to generate new RDDcoalesce (numPartitions) re-partition repartition (numPartitions) re-partition repartitionAndSortWithinPartitions (partitioner) re-partition and sort in Spark

* * Action operator * *

Run the calculation on RDD and return the result to Driver or write to the file system

Action meaning reduce (func) aggregates all the elements in the RDD through the func function. This function must be an interchangeable and parallel collect () in the driver. Return all elements of the dataset as an array count () returns the number of elements of RDD first () returns the first element of RDD (similar to take (1)) take (n) returns an array of the first n elements of the dataset takeSample (withReplacement,num, [seed]) returns an array of num elements randomly sampled from the dataset, you can choose whether to replace the insufficient parts with random numbers Seed is used to specify the random number generator seed takeOrdered (n, [ordering]) takeOrdered is similar to top, except that the element saveAsTextFile (path) is returned in the opposite order as top. Save the elements of the dataset as textfile to the HDFS file system or other supported file system. For each element, Spark will call the toString method. Replace it with the text in the file saveAsSequenceFile (path) save the elements in the dataset to the specified directory in Hadoop sequencefile format, so that HDFS or other file systems supported by Hadoop. SaveAsObjectFile (path)

CountByKey () returns a map of type (KMagne V) RDD, indicating the number of elements corresponding to each key. Foreach (func) runs the function func on each element of the dataset to update it. Thank you for your reading, the above is the content of "what are the ways to create RDD?" after the study of this article, I believe you have a deeper understanding of the way to create RDD, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.