Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Actions operator in spark RDD operator

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly shows you the "spark RDD operator in how to use the Actions operator", the content is easy to understand, clear, hope to help you solve your doubts, the following let Xiaobian lead you to study and learn "how to use the Actions operator in the spark RDD operator" this article.

Actions operator

In essence, the runJob operation of submitting a job is performed through SparkContext in the Actions operator, which triggers the execution of RDD DAG.

1. No output

(1) foreach (f)

The f function operation is applied to each element in RDD, and instead of returning RDD and Array, it returns Uint.

Figure 3-25 shows that the foreach operator operates on each data item through a user-defined function. In this example, the custom function is println (), and the console prints all the data items.

2.HDFS

SaveAsTextFile (path, compressionCodecClass=None)

The function outputs the data and stores it in the specified directory of HDFS.

Convert each element mapping in RDD to (Null, x.toString), and then write it to HDFS.

The box on the left in figure 3-26 represents the RDD partition and the box on the right represents the Block of HDFS. Each partition of the RDD is stored as a Block in the HDFS through the function.

3.Scala collections and data types

(1) collect ()

Collect returns the distributed RDD as a stand-alone scala Array array. Use the functional operation of scala on this array.

The left box in figure 3-28 represents the RDD partition and the right box represents the array in stand-alone memory. Through the function operation, the result is returned to the node where the Driver program is located and stored as an array.

(2) collectAsMap ()

CollectAsMap returns a stand-alone HashMap for (K, V) RDD data. For RDD elements that repeat K, the following elements override the previous elements.

The left box in figure 3-29 represents the RDD partition and the right box represents the stand-alone array. The data is returned to the Driver program through the collectAsMap function, and the result is stored in HashMap form.

(3) reduceByKeyLocally (func)

To achieve the function of first reduce and then collectAsMap, first reduce the whole RDD, and then collect all the results and return to a HashMap.

(4) lookup (key)

The Lookup function returns the Seq formed by the element corresponding to the specified Key for the RDD operation of type (Key, Value). The part of this function that handles optimization is that if the RDD contains a divider, it only corresponds to the partition in which the processing K is located, and then returns the Seq formed by (K, V). If the RDD does not contain a divider, you need to perform a brute force scan on the full RDD element, searching for the element corresponding to the specified K.

The left box in figure 3-30 represents the RDD partition, the right box represents Seq, and the final result is returned to the application of the node where the Driver is located.

(5) count ()

Count returns the number of elements for the entire RDD. The internal function is implemented as follows.

In figure 3-31, the number of data returned is 5. A square represents a RDD partition.

(6) top (num, key=None)

Top can return the largest k elements.

Similar functions are described below.

Top returns the largest k elements.

Take returns the smallest k elements.

TakeOrdered returns the smallest k elements and maintains the order of the elements in the returned array.

First is equivalent to top (1) returning the first k elements of the entire RDD, and you can define how it is sorted, Ordering [T]. An array of the first k elements is returned.

(7) reduce (f)

All the elements in the dataset are aggregated through the function func (which accepts two parameters and returns one). This function must be interchangeable and associative so that it can be executed correctly in parallel.

Example:

> from operator import add > sc.parallelize ([1,2,3,4,5]) .reduce (add) 15 > sc.parallelize ((2 for _ in range (10)) .map (lambda x: 1) .cache () .reduce (add) 10

(8) fold (zeroValue, op)

The principles of fold and reduce are the same, but unlike reduce, when each reduce is equivalent, the first element taken by the iterator is zeroValue.

> from operator import add > sc.parallelize ([1, 2, 3, 4, 5]) .fold (0, add) 15 is all the content of this article "how to use Actions operator in spark RDD operator". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report