In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces "Spark performance optimization with foreachPartition or with foreach". In daily operation, I believe many people have doubts about whether to optimize Spark performance with foreachPartition or with foreach. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "Spark performance optimization with foreachPartition or with foreach". Next, please follow the editor to study!
First of all, let's compare the implementation of foreachPartition and foreach, what are the differences:
Def foreach (f: t = > Unit): Unit = withScope {
Val cleanF = sc.clean (f)
Sc.runJob (this, (iter: Iterator [T]) = > iter.foreach (cleanF))
}
Def foreachPartition (f: Iterator [T] = > Unit): Unit = withScope {
Val cleanF = sc.clean (f)
Sc.runJob (this, (iter: Iterator [T]) = > cleanF (iter))
}
For both methods, the parameter is a function text, except that in foreach, the desired parameter for the function text is T, that is, the element type in RDD; in foreachPartition, the desired parameter for the function text is Iterator [T], that is, a partition.
In the internal implementation, it is more or less the same. For foreachPartition, it runs the incoming function text directly on each partition; for foreach, it gives the incoming function text to the foreach of each partition to execute.
When we look at some spark performance optimization guidelines, we will mention that replacing foreach with foreachPartition will help improve performance. So how do we understand this sentence? Take a look at the following code:
Rdd.foreach {x = > {
Val dbClient = new DBClient
DbClient.ins (x)
}}
In the above code, for each piece of data in RDD, it will new a db client, which is obviously extremely efficient. The correct way to write should look like this:
Rdd.foreachPartition {part = > {
Val dbClient = new DBClient
Part.foreach {x = > {
DbClient.ins (x)
}}
}}
So what's good about this way of writing? we should start with the core concept of spark. We all know that spark is a distributed real-time computing system, and RDD is the foundation of distributed computing, and partition partition is the key. For example, if we build a spark cluster of 3*4core, for a large task, we often want 12 threads to complete this task. We can achieve our goal by building rdd with the following code:
Val rdd = sc.textFile ("hdfs://master:9000/woozoom/mavlink1.log", 12)
Note that the part of the red font represents the number of partitions of the built rdd. After that, the rdd.foreachPartition,spark cluster will hand over the 12 partitions to 12 threads for processing. Combined with the above code, dbClient will be built separately in each thread, and 12 db client will be built.
So is there another possibility that we just build a db client,12 that all threads use this db client to perform database operations, like this:
Val dbClient = new DBClient
Rdd.foreach {x = > {
DbClient.ins (x)
}}
To write this, you need to have two prerequisites: 1. DbClient is thread-safe, and 2. DbClient implements the serialization interface of java. In many cases, for example, when accessing hbase, these two conditions are not met.
At this point, the study on "Spark performance optimization with foreachPartition or with foreach" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.