In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "basic introduction and operation tuning of Spark". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Basic introduction of Spark
Before we discuss spark tuning, let's take a look at some concepts in spark.
Action
Action is a RDD operation that gets non-RDD results. For example, the following common action operations are found in Spark: reduce, collect, count, first, take, takeSample, countByKey, saveAsTextFile
Job
The action of each spark is broken down into a job.
Stage
A job is divided into groups of task, and each group of task is called a stage. Stage is demarcated by one of the following two types of task:
ShuffleMapTask-before all the wide transformation, you can simply think of it as before shuffle.
ResultTask-you can simply think of it as an operation like take ()
Partition
RDD contains a fixed number of partition, and each partiton contains several record.
The RDD returned by narrow tansformation (such as map and filter). The record in a partition only needs to be calculated from the record in the partition corresponding to the parent RDD. Similarly, narrow transformation does not change the number of partition.
Task
A unit of work that is sent to the executor for execution; a task can only do data for one partition in a stage.
Operation optimization
Adjusting the number of partition in the stage can often affect the execution efficiency of the program to a great extent.
Associative reductive operation, can use reduceByKey without using groupByKey, because grouByKey will shuffle all the data, while reduceByKey will only Shuffle reduce the results.
Use aggregateByKey instead of reduceByKey when the input and output results are different
AggregateByKey: Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's, as in scala.TraversableOnce. The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.
Don't use flatMap-join-groupBy mode, you can use cogroup
When the result of two reduceByKey is join, if everyone's partition is the same, then spark will not do shuffle in join.
When a dataset join that can fit in memory, you can consider broadcast instead of using join
Scala > val broadcastVar = sc.broadcast (Array (1,2,3)) broadcastVar: org.apache.spark.broadcast.Broadcasting [Array [int]] = Broadcast (0) scala > broadcastVar.valueres0: Array [Int] = Array (1,2,3) Resource tuning
The resources in spark simply boil down to CPU and memory, and the following parameters affect memory and CPU usage.
The larger the executor, the better the parallelism. The larger the executor, the smaller the memory per executor.
Core, the bigger the parallelism, the better
HDFS client is a performance problem when there are a large number of concurrent threads. A rough estimate is that up to five parallel task in each executor can take up the full write bandwidth.
Partition is silly if it is smaller than excutor*core; the more it is, the less memory each partition takes up; and when it is large enough, it is no longer useful for performance improvement.
My naive thinks that it should be adjusted as follows:
Core = min (5 CPU cores)
Executor = instance * number of cpu cores / core
The average number of executor per instance determines executor.memory, which in turn determines shuffle.memory and storage.memory
Estimate the total amount of data, that is, the data size at the largest shuffle (there will be shuffle size in the spark driver running record)
Divide the result of 4 by 3 to get the partition number. If it is very small, set partition to several times the sum (executor*core).
This is the end of the content of "basic introduction and Operation tuning of Spark". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.