How to understand the accumulator and broadcast variables of Spark shared variables in big data's development 07/06 Update SLTechnology News&Howtos

How to understand the accumulator and broadcast variables of Spark shared variables in big data's development

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you how to understand the accumulator and broadcast variables of Spark shared variables in the development of big data. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

Spark Accumulator and broadcast variables I. introduction

In Spark, two types of shared variables are provided: accumulator (accumulator) and broadcast variable (broadcast variable):

Accumulator: used to aggregate information, mainly used in scenarios such as cumulative counting

Broadcast variables: mainly used to efficiently distribute large objects between nodes.

2. Accumulator

Let's first look at a specific scenario. For normal cumulative summation, if you use the following code to calculate in cluster mode, you will find that the execution result is not expected:

Var counter = 0val data = Array (1,2,3,4,5) sc.parallelize (data) .foreach (x = > counter + = x) println (counter)

The final result of counter is 0, and the main cause of this problem is closure.

2.1 understanding closures

1. The concept of closure in Scala

Let's first introduce the concept of closures in Scala:

Var more = 10val addMore = (x: Int) = > x + more

As shown above, there are two variables x and more in the function addMore:

X: is a binding variable (bound variable) because it is the input parameter to the function and is clearly defined in the context of the function

More: is a free variable (free variable), because the function numeral quantity does not give any meaning to more.

By definition: when creating a function, if you need to capture a free variable, the function that contains a reference to the captured variable is called a closure function.

2. Closures in Spark

You can also refer to https://blog.csdn.net/hu_lichao/article/details/112451982

In the actual calculation, Spark will decompose the RDD operation into Task,Task and run on Worker Node. Spark closes the task before execution, and if a free variable is involved in the closure, the program copies it and places the copy variable in the closure, which is then serialized and sent to each executor. Therefore, when you reference counter in the foreach function, it will no longer be the counter on the Driver node, but the copy counter in the closure. By default, the updated value of the copy counter will not be passed back to Driver, so the final value of counter is still zero.

It should be noted that in Local mode, the Worker Node that is likely to execute foreach is in the same JVM as Diver and references the same original counter, so the update may be correct at this time, but it must not be correct in cluster mode. Therefore, we should give priority to using the accumulator when we encounter such problems.

The principle of the accumulator is actually simple: the final value of each copy variable is passed back to Driver, which is aggregated by Driver to get the final value, and the original variable is updated.

2.2 use the accumulator

All methods for creating accumulators are defined in SparkContext, and it is important to note that accumulator methods that are crossed out of the middle line are identified as obsolete after Spark 2.0.0.

The usage example and the execution results are as follows:

Val data = Array (1, 2, 3, 4, 5) / define accumulator val accum = sc.longAccumulator ("My Accumulator") sc.parallelize (data) .foreach (x = > accum.add (x)) / / get the value of the accumulator accum.value 3, broadcast variable

In the process of the closure introduced above, we said that the closure of each Task task will hold a copy of the free variable. If the variable is very large and there are many Task tasks, it will inevitably put pressure on the network IO. In order to solve this situation, Spark provides broadcast variables.

The practice of broadcasting variables is simple: instead of distributing copy variables to each Task, distribute them to all Task in each Executor,Executor to share a copy variable.

/ / define an array as a broadcast variable val broadcastVar = sc.broadcast (Array (1, 2, 3, 4, 5)) / / the broadcast variable should be used first instead of the original value sc.parallelize (broadcastVar.value). Map (_ * 10). Collect () IV. Observation variable

The value of the created Accumulator variable can be seen on Spark Web UI. You should try to name it when you create it. Let's discuss how to view the value of the accumulator on Spark Web UI.

The above is the editor for you to share the development of big data Spark shared variable accumulator and broadcast variables how to understand, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.