Basic RDD operation-- PySpark 07/19 Update SLTechnology News&Howtos

Basic RDD operation-- PySpark

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Basic RDD conversion operation

Map ()

Syntax: RDD.map (, preservesPartitoning=False)

The conversion operation map () is the most basic of all conversion operations. It evaluates all elements in the dataset by a named or anonymous function. The map () function can be executed asynchronously and does not attempt to communicate or synchronize with other map () operations. In other words, this is a no-shared operation.

The parameter preserversPatitioning is optional and is a parameter of type Boolean that defines the RDD of the distinguishing rule, which have defined keys and are grouped according to the hash value or range of the key. If this parameter is set to True, these partitions will be saved intact. This parameter can be used by the Spark scheduler to optimize subsequent operations, such as join operations based on partition keys.

The conversion operation map () calculates the same function for each input record and generates the converted output record.

# map ()

Map_rdd=sc.textFile ('file:///usr/local/spark/README.md')

Print (map_rdd.take (5))

Map_rdd_new=map_rdd.map (lambda x:x.split ('))

Print (map_rdd_new.take (5))

# output

['# Apache Spark',', 'Spark is a fast and general cluster computing system for Big Data. It provides',' high-level APIs in Scala, Java, Python, and R, and an optimized engine that', 'supports general computation graphs for data analysis. It also supports a']

[['#', 'Apache',' Spark'], [''], ['Spark',' is', 'await,' fast', 'and',' general', 'cluster',' computing', 'system',' for', 'Big',' Data.', 'It',' provides'], ['high-level',' APIs', 'in',' Scala,', 'Java,',' Python,' 'and',' RJ, 'and',' an', 'optimized',' engine', 'that'], [' supports', 'general',' computation', 'graphs',' for', 'data',' analysis.', 'It',' also', 'supports',']]

In this example, the split function takes a string, generates a list, and each string element in the input data is mapped to a list element in the output data. The result is a list of lists.

2.flatMap ()

Syntax:: RDD.flatMap (, preservesPartitioning=False)

The conversion operation flatMap () is similar to the conversion operation map () in that it acts on each record of the input dataset. However, flatMap () also "flattens" the output data, which means it removes a layer of nesting. For example, given a list that contains a list of strings, the "flatten" operation produces a list of strings, that is, "flatten" all nested lists.

# flatMap ()

Flat_map_rdd=sc.textFile ('file:///usr/local/spark/README.md')

Print (flat_map_rdd.take (5))

Map_rdd_new=flat_map_rdd.flatMap (lambda x:x.split ('))

Print (map_rdd_new.take (5))

# output

['#', 'Apache',' Spark',', 'Spark']

In this example, flatMap () uses the same anonymous function as the map () operation. Note that each string does not produce a corresponding list object, and all the elements are flattened into a list. In other words, flatMap () in this example produces a combined list as output, rather than the list in map ().

3.filter ()

Syntax: RDD.filter ()

The conversion operation filter involves an expression of type Boolean that evaluates each element in the dataset, which is usually represented by an anonymous function. The Boolean value returned determines whether the record is included in the resulting output RDD. This is a common conversion operation that removes unwanted records from the RDD as an intermediate result, or removes records that do not need to be placed in the final output.

# filter ()

Licenses = sc.textFile ('file:///usr/local/spark/README.md')

Words = licenses.flatMap (lambda x:x.spilt ('))

Print (words.take (5))

Lowercase = words.map (lambda x:x.lower ())

Print (lowercase.take (5))

Longwords = lowercase.filter (lambda x:len (x) > 12)

Print (longwords.take (5))

# output

['#', 'Apache',' Spark',', 'Spark']

['#', 'apache',' spark',', 'spark']

[', 'documentation',' documentation,', 'page] (http://spark.apache.org/documentation.html).',' instructions.']

4.distinct ()

Syntax: RDD.distinct (numPartitions=None)

The conversion operation distinct () returns a new RDD that contains only the deduplicated elements in the input RDD. It can be used to remove duplicate values. The parameter numPartitions can repartition the data to a given number of partitions. If this parameter is not provided or the default value is used, the number of partitions returned by the conversion operation distinct () is the same as the number of partitions entered by RDD.

# distinct ()

Licenses = sc.textFile ('file:///usr/local/spark/README.md')

Words = licenses.flatMap (lambda x: x.split (''))

Lowercase = words.map (lambda x: x.lower ())

Allwords = lowercase.count ()

Diswords = lowercase.distinct () .count ()

Print ("Total words: {}, Distinct words: {}" .format (allwords,diswords))

# output

Total words: 579, Distinct words: 276

5.groupBy ()

Syntax: RDD.groupBy (, numPartitons=None)

The conversion operation groupBy () returns a RDD that groups elements by the specified function. Parameters can be named or anonymous functions that determine the key to group all elements, or specify an expression to evaluate the element to determine the grouping to which it belongs. The parameter numPartitions automatically creates a specified number of partitions by calculating the hash value of the key space output by the grouping function. Note that groupBy () returns an iterable object.

# groupBy ()

Licenses = sc.textFile ('file:///usr/local/spark/README.md')

Words = licenses.flatMap (lambda x: x.split ('')) .filter (lambda x:len (x) > 0)

Groupbyfirstletter = words.groupBy (lambda x: X [0] .lower)

Print (groupbyfirstletter.take (1))

# output

[,)]

6.sortBy ()

Syntax: RDD.sortBy (, ascending=True,numPartitions=None)

The conversion operation sortBy () sorts the RDD by the key of the specified dataset selected by the parameter. It sorts according to the type of the key object. The parameter ascending is a Boolean parameter, which defaults to True, which specifies the sort order used. If you want to use descending order, you need to set up ascending=False.

# sortBy ()

Readme = sc.textFile ('file:///usr/local/spark/README.md')

Words = readme.flatMap (lambda x:x.split ('')) .filter (lambda x:len (x) > 0)

Sortbyfirstletter = words.sortBy (lambda XRV x [0] .lower (), ascending=False)

Print (sortbyfirstletter.take (5))

# output

['You',' you', 'You',' you', 'you']

Basic RDD action operation

Actions in Spark either return a value, such as count (), or return data, such as collect (), or save data externally, such as saveAsTextFile (). In all cases, the action forces a calculation on the RDD and all its parent RDD. Some actions return a count, or the aggregate value of the data, or all or part of the data in the RDD. Unlike these, the action action foreach () executes a function on each element in the RDD.

1.count ()

Syntax: RDD.count ()

The action operation count () takes no arguments and returns a value of type long, representing the number of elements in the RDD.

# count ()

Licenses = sc.textFile ('file:///usr/local/spark/licenses')

Words = licenses.flatMap (lambda x: x.split (''))

Print (words.count ())

# output

22997

Note that for actions that do not receive parameters, you need parentheses () above the action name.

2.collect ()

Syntax: RDD.collect ()

The action collect () returns a list of all the elements in the RDD to the Spark driver process. Collect () does not limit the output, which may result in a considerable amount of output. It is generally only used in small-scale RDD or development.

# collect ()

Licenses = sc.textFile ('file:///usr/local/spark/licenses')

Words = licenses.flatMap (lambda x: x.split (''))

Print (words.collect ())

# output

[','

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.