RDD programming 07/09 Update SLTechnology News&Howtos

RDD programming

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1.RDD fundamentals:

RDD in Spark is an immutable collection of distributed objects. Each RDD is divided into multiple partitions that run on different nodes in the cluster. RDD can contain any type of object (or even custom).

As mentioned earlier, Spark includes conversion operations and action operations. Spark only lazily calculates these RDD. They are really counted only when they are first used in an action operation. By default, Spark's RDD is recalculated every time you take action on them. If you want to reuse the same RDD in multiple actions, you can use RDD.persist () to ask Spark to cache the RDD (memory or disk).

two。 Create a RDD:

Spark provides two creation methods:

(1) read an external dataset: the previous sc.textFile () belongs to this type. In a more common way.

(2) to parallelize a collection (list, Set, etc.) in the driver program, use the SparkContext.parallelize () method.

3.RDD operation:

RDD is mainly divided into data type RDD and key-value pair RDD. There are some operations that can be applied to all types of RDD, where you can create JavaRDD objects directly, such as map (), filter (), and so on. Some operations apply only to RDD of data types, for example, when a JavaDoubleRDD object is created. Some operations only apply to key-value pairs RDD, for example, when a JavaPairRDD object is created.

3.1 conversion operation:

3.1.1 pedigree diagram:

Through the conversion operation, deriving a new RDD from an existing RDD,Spark uses a pedigree graph to record the dependencies between these different RDD. As shown in the following figure:

3.1.2:

Basic conversion operations (map, flatMap, filter, distinct, sample), assuming that the data of RDD {1,2,3,3}:

The set operations of RDD (union, intersection, subtract, cartesian). The two RDD are {1Mage2JEI 3} and {3JEI 4JE 5} respectively:

Function name action example runs the result map () Apply a function to each element in the RDD and return an RDD of the result.rdd.map (x = > x + 1) {2,3,4,4} flatMap () Apply a function to each element in the RDD and return an RDD of the contents of the iterators returned. Often used to extract words.rdd.flatMap (x = > x.to (3)) {1, 2, 3, 2, 3, 3} filter () Return an RDD consisting of only elements that pass the condition passed to filter (). Rdd.filter (x = > x = 1) {2, 3, 3} distinct () Remove duplicates.rdd.distinct () {1,2,3} sample (withReplacement,fraction, [seed]) Sample an RDD, with or without replacement.rdd.sample (false) Union () Produce an RDD containing elements from both RDDs.rdd.union (other) {1,2,3,3,4,5} intersection () RDD containing only elements found in both RDDs.rdd.intersection (other) {3} subtract () Remove the contents of one RDD (e.g., remove training data) .rdd.subtract (other) {1,2} cartesian () Cartesian product with the other RDD.rdd.cartesian (other) {(1,3), (1,4),... (3,5)}

4. Pass the function to Spark:

Most conversion operations and some action operations need to pass functions to the Spark method. In java, the function implements the class of any interface under the package org.apache.spark.api.java.function. There are many interfaces under the package, and here are some basic interfaces:

The method that the function name needs to be implemented

Use FunctionR call (T) Take in one input and return one output, for use with operations like map () and filter ().

Function2

R call (T1, T2) Take in two inputs and return one output, for use with operations like aggregate () or fold (). FlatMapFunctionIterable call (T) Take in one input and return zero or more outputs, for use with operations like flatMap ().

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.