A brief introduction to transformation and action operators of Spark 03/31 Update SLTechnology News&Howtos

A brief introduction to transformation and action operators of Spark

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Transformation operator map (func)

Returns a new distributed dataset consisting of new elements of each original element processed by the func function

Filter (func)

Returns a new dataset consisting of the original elements whose value is true after being processed by the func function

FlatMap (func)

Similar to map, but each input element is mapped to 0 or more output elements, (therefore, the return value of the func function is a seq, not a single element)

MapPartitions (func)

Similar to map, it works for each partition of RDD, and when running on a RDD of type T, the function type of func must be Iterator [T] = > Iterator [U]

Sample (withReplacement,fraction,seed)

According to the given random seed seed, the data with the quantity of fraction are randomly sampled

Pipe (command, [envVars])

Use the shell command to operate each partition of RDD through the pipe, and return the corresponding result.

Union (otherDataSet)

Returns a new dataset made up of the parameters of the original data set

Intersection (otherDataset)

Find the intersection of two RDD

Distinct ([numtasks])

Returns a new I dataset containing all the non-repeating elements in the source dataset

GroupByKey ([numtasks])

Called on a dataset consisting of (Kmenv) pairs and returns a dataset consisting of (KMagneSeq [V]) pairs. By default, the parallelism of the output depends on the number of partitions of the parent RDD, and if you want to aggregate the key, using reduceByKey or combineByKey will have better performance

ReduceByKey (func, [numTasks])

Use on the dataset of a (KQuery V) pair and return the dataset of a (KMagne V) pair. The same value of key is aggregated using the specified reduce function. The number of reduce tasks can be configured by the second optional parameter.

SortByKey ([ascending], [numTasks])

Called on a dataset of type (KQuery V) to return (KQuery V) sorted by K to pair the dataset, ascending or descending with Boolean type ascending parameter determination

Join (otherDataset, [numTasks])

Called on a dataset of types (K, (V) V) and (K, (V) W), returning a dataset in which all elements in each key are together.

Cogroup (otherDataset, [numTasks])

Called on a dataset of types (KMagol V) and (KMagol W) to return a dataset consisting of (KMagne Iterable [V], Iterable [W]) tuples

Cartesian (otherDataset)

Cartesian product, but when called on datasets T and U, it returns a dataset of (TMagol U) pairs, and all elements interact with Cartesian product.

Coalesce (numPartitions)

Reduce the number of partitions in RDD by a specified number, usually after filtering a large dataset

Repartition (numpartitions)

Divide all records in RDD into numparitions partition equally

Action operator reduce (func)

All elements in the dataset are aggregated by the function func, which must be associative to ensure that it can be executed concurrently correctly.

Collect ()

In driver programs, return all elements of a dataset as an array, which usually returns a sufficiently small subset of data for reuse after using filter or other operations

Count ()

Returns the number of elements in the dataset

First ()

Returns the first element of the dataset (similar to take (1))

Take (n)

Returns an array consisting of the first n elements of the dataset. Note that this operation is not currently performed in parallel, but on the machine where the driver program is located

TakeSample (withReplacement,num,seed)

Returns an array consisting of randomly sampled num elements in the dataset. You can choose whether to replace the insufficient parts with random numbers. Seed is used for the specified random number generator seed.

SaveAsTextFile (path)

Save the elements of the dataset as textfile to the local file system hdfs or any other file system supported by Hadoop, and spark will call the toString method of each element and convert it to a line of text in the file

TakeOrderd (n, [ordering])

Sorted limit (n)

SaveAsSequenceFile (path)

Save the elements of the dataset in sequencefile format to the specified directory, local system, hdfs, or any other file system supported by hadoop. The elements of RDD must be composed of key-value pairs. And all implement the writable interface of hadoop or can be implicitly converted to writable.

SaveAsObjectFile (path)

Use the serialization method of Java to save to a local file, which can be loaded by sparkContext.objectFile ()

CountByKey ()

Valid for RDD of type (KMagne V). Return a map of (KMagneInt) pair, indicating the number of elements that each can correspond to.

Foreach (func)

On each element of a dataset, the run function func,t is usually used to update an accumulator variable or to interact with an external storage system

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.