Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to pass a function to Spark

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article is about how to give Spark transfer function, the editor thinks it is very practical, so share it with you to learn, hope you can get something after reading this article, say no more, follow the editor to have a look.

It is believed that many people will encounter the problem of Task not serializable when they start using Spark, most of which are caused by calling non-serializable objects in RDD's operators. Why do objects in incoming operators have to be serializable? This should start from Spark itself. Spark is a distributed computing framework. RDD (Resilient Distributed Datasets, flexible distributed dataset) is an abstraction of distributed data sets. The data is actually distributed in each node of the cluster. It is abstracted through RDD to make users feel as if they are interacting locally. But in the actual operation, the operations in the operator have to be sent to the computing node (Executor) for execution, which requires that the objects in the operator can be serialized.

To a large extent, the operator of Spark is realized by passing functions to the drivers on the cluster. The key to writing Spark applications is to use operators (or transformations) to transfer functions to Spark. There are two common ways to pass functions to Spark (from the official Spark documentation, Spark programming guide):

The first kind: anonymous function, when there is less code to deal with, you can use anonymous function and write it directly in the operator:

Myrdd.map (x = > x + 1)

The second: the static method in the global singleton object: first define the object object MyFunctions, and the static method: funcOne, and then pass MyFunctions.funcOne to the RDD operator.

Object MyFunctions {def funcOne (s: String): String = {...}} myRdd.map (MyFunctions.funcOne)

In salesman development, you need to pass the reference of RDD to a method of an instance of a class, and the function passed to RDD is the instance method of the instance of the class:

Class MyClass {def funcOne (s: String): String = {...} def doStuff (rdd: RDD [String]): RDD [String] = {rdd.map (funcOne}})

In this example, we define a class MyClass. In the instance method doStuff of the class, another instance method funcOne of the class is called in a RDD,RDD operator. When we New an instance of MyClass and call the method of doStuff, we need to send the entire instance object to the cluster, so class MyClass must be serializable and extends Serializable is required.

Similarly, object variables outside the access method also refer to the entire object, which needs to be sent to the cluster:

Class MyClass {val field = "Hello" def doStuff (rdd: RDD [String]): RDD [String] = {rdd.map (x = > field + x)}}

In order to avoid sending the whole object to the cluster, you can define a local variable to hold the reference of the external object field, especially in some large objects, which can prevent the whole object from being sent to the cluster and improve efficiency.

Def doStuff (rdd: RDD [String]): RDD [String] = {val field_ = this.field rdd.map (x = > field_ + x)}

Spark applications will eventually run in the cluster, and many problems can not be exposed in a single local environment. Sometimes, we often encounter the problem that the local running results are inconsistent with the cluster running results, which requires the development of functional programming style, as far as possible to make the written functions are pure functions. The advantages of pure functions are stateless, thread-safe, no need for thread synchronization, and the application or runtime environment (Runtime) can cache the results of pure functions to speed up the operation.

So what is a pure function?

A pure function (Pure Function) is a function in which the input and output data streams are all Explicit. Explicit (Explicit) means that there is only one channel for the function to exchange data with the outside world-parameters and return values; all input information received by the function from outside the function is passed through parameters to the function; and all information output from the function to the outside of the function is passed outside the function through the return value. If a function acquires data from the outside world or outputs data to the outside world by Implicit, then the function is not a pure function, it is called an Impure Function. Implicit means that functions exchange data with the outside world through channels other than parameters and return values. For example, reading global variables and modifying global variables are called implicit data exchange with the outside world; for example, reading configuration files using I API (input and output system function library), or exporting to files and printing to the screen are all called implicit data exchanges with the outside world.

When the interaction of objects is involved in the calculation process, try to choose stateless objects. For example, for a bean, the member variables are all val, and new a new one where data interaction is needed.

On (commutative and associative) commutative law and associative law. In passing to reudce,reduceByKey, and some other merge, the functions in the aggregate operation must satisfy the commutative law and the associative law, which we have learned mathematically:

A + b = b + arecine a + b + c = a + (b + c)

The defined functions func (a) and f (b) should get the same result, and f (f (a) c) and f (a) should get the same result.

Finally, let's talk about the use of broadcast variables and accumulators. Do not define a global variable in the program, if you need to share a data in multiple nodes, you can use the method of broadcasting variables. If you need some global aggregate calculations, you can use an accumulator.

The above is how to pass functions to Spark, the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report