Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the difference between foreachRDD, foreachPartition and foreach in Spark

2025-03-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the difference between foreachRDD, foreachPartition and foreach in Spark". In daily operation, I believe many people have doubts about the difference between foreachRDD, foreachPartition and foreach in Spark. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the question of "what is the difference between foreachRDD, foreachPartition and foreach in Spark?" Next, please follow the editor to study!

Difference

Recently, many students have asked me that the differences between foreachRDD, foreachPartition and foreach in Spark are often misused or don't know how to use them at work. Today, we will briefly talk about the differences between them:

In fact, it is very simple to distinguish them. First of all, the scope of action is different. The RDD,foreachPartition of foreachRDD acting on every time interval in DStream acts on every partition,foreach in each time interval RDD. Every partition,foreach acts on every element in each time interval RDD.

Http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

The description of foreachRDD in SparkStreaming.

Both foreach and foreachPartition operate on iterator in each partition, except that foreach performs foreach operations on iterator directly in each partition, while the incoming function is only used within the foreach, while foreachPartition sends iterator to the incoming function in each partition, allowing function to handle iterator itself (memory overflow can be avoided)

A simple example.

In the Spark official website, foreachRDD is divided into Output Operations on DStreams, so the first thing we need to make clear is that it is an output operator, and then take a look at the official website to explain its meaning: The most generic output operator that applies a function, func, to eachRDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

The most commonly used output operation requires a function as a parameter, which acts on every RDD in DStream. The function outputs the data in RDD to external systems, such as files and databases, and executes it on driver.

There is usually an action operator in a function, because foreachRDD itself is a transform operator

The official website also gives common mistakes made by developers:

Often writing data to external system requires creating a connection object (e.g. TCP connection to a remote server) and using it to send data to a remote system. For this purpose, a developer may inadvertently try creating a connection object at the Spark driver, and then try to use it ina Spark worker to save records in the RDDs. For example: (for Chinese parsing, see below)

/ / ① is the wrong ❌ dstream.foreachRDD {rdd = > val connection = createNewConnection () / / executed at the driver rdd.foreach {record = > connection.send (record) / / executed at the worker}}

It is said above that when we use foreachRDD to output data to an external system, we usually create a connection object, and it would be wrong to create it on driver as in the above code, because there is no connection object on the node when foreach executes on each node. There is only one driver node, and there are multiple worker nodes.

So, let's change it to this:

/ / ② writes the creation of a connection in forech, and each element in RDD creates a connection dstream.foreachRDD {rdd = > rdd.foreach {record = > val connection = createNewConnection () / / executed at the worker connection.send (record) / / executed at the worker connection.close ()}}

At this point, there is no case that the compute node does not have connected objects. However, writing like this creates a connection every time you cycle through RDD, creating and closing connections frequently, causing unnecessary overhead to the system.

You can solve this problem by using foreachPartirion:

/ / ③ uses foreachPartitoin to reduce connection creation. Each partition of RDD creates a link dstream.foreachRDD {rdd = > rdd.foreachPartition {partitionOfRecords = > val connection = createNewConnection () partitionOfRecords.foreach (record = > connection.send (record)) connection.close ()}}

The above approach can also be optimized, although there are fewer connection requests, but for each partition, there is still no way to reuse connections, so we can introduce static connection pooling. Official note: the connection pool must be static and lazily loaded.

/ / ④ uses static connection pooling to increase the reuse of connections and reduce the creation and closure of connections. Dstream.foreachRDD {rdd = > rdd.foreachPartition {partitionOfRecords = > / / ConnectionPool is a static, lazily initialized pool of connections val connection = ConnectionPool.getConnection () partitionOfRecords.foreach (record = > connection.send (record)) ConnectionPool.returnConnection (connection) / / return to the pool for future reuse}}

It should be noted here that connections in the connection pool should be created on demand, and if they are not used for a period of time, they should time out to achieve the most efficient delivery of data to external systems.

At this point, the study on "what is the difference between foreachRDD, foreachPartition and foreach in Spark" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report