Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the closure in Spark?

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly talks about "what is the closure in Spark". Interested friends may wish to take a look at it. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what is the closure in Spark"?

The concept of closure is shown in the following figure:

In spark applications, the scope and declaration period of variables and functions are difficult to understand in spark cluster mode, especially for beginners. The operation of RDD, to modify its scope of variables, often give a fork. Next, you can use foreach to modify an example of a counter.

Examples

The example of summing the RDD element, this example will have different output results depending on whether the code is executed in the same jvm. For example, the local mode runs in the same jvm, and the output is 15. The jvm output is 0 in different cluster modes.

Val data = Array (1,2,3,4,5)

Var counter = 0

Var rdd = sc.parallelize (data)

/ / Wrong: Don't do this!!

Rdd.foreach (x = > counter + = x)

Println ("Counter value:" + counter)

Local or cluster mode

The behavior of the above code is undefined and works differently in different modes. In order to execute the job, Spark breaks down the processing of the RDD operation into tasks, with each task executed by Executor. Before execution, Spark calculates the closure of the task. Closures are the variables and methods that must be visible when Executor calculates on RDD (in this case, foreach ()). The closure is serialized and sent to each Executor.

The variable in the closure sent to each Executor is a copy, so when a counter is referenced within the foreach function, it is no longer a counter on the driver node. There is still a counter in the memory of the driver node, but this variable is not visible to Executor! The executor can only see a copy of the serialized closure. Therefore, the final value of the counter is still zero because all operations on the counter refer to values within the serialized closure.

In local mode, in some cases, the foreach function will actually be executed in the same JVM as driver, and will reference the same original counter and may actually update it.

To ensure the behavior clearly defined in these scenarios, an Accumulator should be used. The accumulator in Spark is designed to provide a mechanism for securely updating variables when performing splits between working nodes in a cluster.

In general, closures-constructs, like looping or locally defined methods, should not be used to change some global state. Spark does not define or guarantee the changing behavior of objects referenced from outside the closure. Some of the code that does this can work in local mode, but this is accidental, and the code doesn't behave in distributed mode as you might think. If you need some global aggregation, use an accumulator instead.

Print elements of RDD

Another common idiom is to try to print out RDD elements using rdd.foreach (println) or rdd.map (println). On a single machine, this produces the expected output and prints all RDD elements. However, in cluster mode, the output executed by Executor is written to the stdout of Executor, not the stdout on driver, so the stdout of driver does not display these! To print all elements in driver, you can use the collect () method to first bring the RDD data to the driver node: rdd.collect (). Foreach (println). But this may cause the driver program to run out of memory because collect () extracts the entire RDD data to the driver side; if you only need to print some elements of RDD, it is safer to use take (): rdd.take (100) .foreach (println).

At this point, I believe you have a deeper understanding of "what is the closure in Spark". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report