What does the closure in Spark mean? 07/19 Update SLTechnology News&Howtos

What does the closure in Spark mean?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "what is the meaning of closures in Spark". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

In Spark code, the scope and declaration cycle of variables and functions are difficult to understand in spark cluster mode, especially for beginners. The closure problem here is related to manipulating variables outside the scope in the RDD operator.

Closure variables in Spark generally refer to variables that are declared outside the operator scope but are operated and executed in memory in the operator scope.

Here is a code example to help you better understand the closure problem. If you want to find the sum of 5 (1 meme 2 magistrate 3 magistrate 5) in Spark (the initial value is 0), paste the code here:

Package com.hadoop.ljs.spark220.study.closePackage;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.api.java.function.VoidFunction;import java.util.Arrays;import java.util.List / * @ author: Created By lujisen * @ company ChinaUnicom Software JiNan * @ date: 2020-02-18 20:08 * @ version: v1.0 * @ description: com.hadoop.ljs.spark220.study.closePackage * / public class SparkClosePackage {public static void main (String [] args) {SparkConf sparkConf = new SparkConf (). SetAppName ("SparkClosePackage"). SetMaster ("local [*]"); JavaSparkContext sc = new JavaSparkContext (sparkConf) List numList2 = Arrays.asList (1,2,3,4,5); final int [] sum = {0}; JavaRDD soureData = sc.parallelize (numList2); soureData.foreach (new VoidFunction () {@ Override public void call (Integer value) throws Exception {sum [0] + = value;}}) System.out.println ("sum result" + sum [0]); sc.close ();}}

The output of the program:

The result is not quite what you think, sum is not 15 but 0. Why?

Here involves the scope of RDD, for each operator of RDD, the scope is only the memory code of the operator, but the above code operates the variable sum outside the scope, according to the syntax of different programming languages, this function can be done, and this phenomenon is called closure, closure in simple terms, is the operation of variables that do not belong to a scope.

In production, we usually submit Spark tasks to the cluster for execution. Whether in standalone/yarn-client local mode or standalone/yarn-cluster cluster mode, the tasks are transformed into tasks and sent to the Executor of Worker nodes in batches to run. Each batch of Task executes the same code and processes different data. Closure variables must need to be processed on the driver side before task execution, and then serialized into multiple copies. Each copy is sent to each executor process for later task use.

The dry speech here is not easy to understand. Here, let me elaborate on the combination of a picture:

Here you enter the data (1Person2Person3Power4Die 5), here is the variable sum=0, we want to save the sum to sum through the foreach operator, we package the project and submit it to the cluster to run. Here we must produce a driver to run our main function, serialize the sum variable, copy multiple serialized copies to two Executor, and when running to the foreach operator, send task to the assigned Executor in batches for execution. Each saves a copy of sum. After calculating here, each Executor will calculate its own result: one is 6 and the other is 9. Finally, when you print the sum on the driver, the Executor is completely unaware of the operation of the sum, driver.

So to sum up, when you run a job in cluster mode, don't change the value of the closure variable outside the scope inside the operator, because that doesn't make any sense, the operator will only change the value of the copy of the variable in the executor process, it has no effect on the variable on the driver side, and we can't get the value of the variable copy on the executor side.

If you want to make distributed, parallel and global changes to a driver's variables in cluster mode, you can use the global accumulator (Accumulator) provided by Spark. Later, we will explain an advanced usage of Accumulator, customize Accumulator, and implement a global calculator for any mechanism and algorithm.

This is the end of the content of "what is the meaning of closures in Spark". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.