Optimization of some parameters of spark- 04/28 Update SLTechnology News&Howtos

Optimization of some parameters of spark-

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Several key points to be paid attention to in Spark program optimization-- the most important ones are data serialization and memory optimization

Spark sets relevant parameters

The number of problem 1:reduce task is not appropriate

Solution: the default configuration needs to be adjusted according to the actual situation, and the adjustment method is to modify the parameter spark.default.parallelism. Typically, the number of reduce is set to 2 to 3 times the number of core. The number is too large, resulting in a lot of small tasks, increasing the cost of starting tasks; the number is too small, the task runs slowly.

If you want to know big data's learning route, if you want to learn big data knowledge and need free learning materials, you can add group: 784789432. Welcome to join us. Every day, a live broadcast will be held at 3 pm to share basic knowledge, and at 20:00 in the evening, a live broadcast will be held to share the actual combat of big data project.

The problem 2:shuffle disk IO time is long

Solution: set spark.local.dir to multiple disks, and set disks to IO fast disks, and optimize shuffle performance by adding IO

Problem 3:map | large number of reduce, resulting in a large number of small shuffle files

Solution: the default number of shuffle files is map tasks * reduce tasks. Merge shuffle intermediate files by setting spark.shuffle.consolidateFiles to true, and the number of files is the number of reduce tasks

Problem 4: long serialization time and large result

Solution: Spark defaults to. Use JDK. The built-in ObjectOutputStream, which produces a large result and a long CPU processing time, can be set to org.apache.spark.serializer.KryoSerializer by setting spark.serializer. In addition, if the result is already large, you can use the broadcast variable

Question 5: single record consumes a lot

Solution: replacing map,mapPartition with mapPartition is calculated for each Partition, while map is calculated for each record in partition

The problem 6:collect is slow to output a large number of results.

Solution: collect source code is to put all the results in memory in the form of an Array, which can be directly output to distributed? File system, and then view the contents of the file system

Question 7: task execution speed tilt

Solution: if the data is skewed, generally because partition key is not good, you can consider other parallel processing methods and add aggregation operation in the middle; if it is Worker skew, for example, executor execution on some worker is slow, you can remove those nodes that are persistently slow by setting spark.speculation=true

Question 8: many empty tasks or small tasks are generated after multi-step RDD operations

Solution: use coalesce or repartition to reduce the number of partition in RDD

Problem 9:Spark Streaming throughput is not high

Solution: you can set spark.streaming.concurrentJobs

Parameters related to schedule scheduling

Spark.cores.max

The number of CPU computing resources. The parameter spark.cores.max determines the number of CPU Core that a Spark application can apply for in Standalone and Mesos modes.

It should be noted that this parameter has no effect on Yarn mode. In YARN mode, resources are dispatched and managed by Yarn.

The number of CPU resources is determined by the parameters of the other two directly configured Executor and the number of core in each Executor.

Spark.scheduler.mode

Whether to use FIFO mode or Fair mode for internal scheduling of a single Spark application

Spark.speculation

Spark.speculation (speculative mechanism switch) and spark.speculation.interval (), spark.speculation.quantile, spark.speculation.multiplier and other parameters adjust the specific details of Speculation behavior.

Spark.executor.memory xxG set memory

Spark.executor.cores x set the number of cores per excutor

Spark.cores.max xx sets the maximum number of cores used

If there are all kinds of timeout,executor lost, task lost

Spark.network.timeout is changed to 300 (5min) or higher as appropriate. The default is 120 (120s), which configures the delay of all network transmissions. If the following parameters are not actively set, their properties are overridden by default.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.