How to solve the common errors in the operation of Spark program and how to optimize it 10/15 Update SLTechnology News&Howtos

How to solve the common errors in the operation of Spark program and how to optimize it

2025-10-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Spark program running common error solutions and optimization is how, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

I. org.apache.spark.shuffle.FetchFailedException

1. Problem description

This problem usually occurs when there are a large number of shuffle operations, task constantly failed, and then re-execute, continue to cycle, very time-consuming.

two。 Error prompt

(1) missing output location

Org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

(2) shuffle fetch faild

Org.apache.spark.shuffle.FetchFailedException: Failed to connect to spark047215/192.168.47.215:50268

The current configuration uses 1 executor with 5 GRAM for each executor, with 20 GRAM started

3. Solution

Generally, if you encounter this problem, you can increase the executor memory and increase the cpu of each executor at the same time, so that the parallelism of task will not be reduced.

Spark.executor.memory 15G

Spark.executor.cores 3

Spark.cores.max 21

The number of execuote started is: 7

ExecuoteNum = spark.cores.max/spark.executor.cores

Configuration of each executor:

3core,15G RAM

The memory resources consumed are: 105G RAM

15G*7=105G

You can find that the resources used have not been improved, but the original configuration of the same task is still stuck for several hours, and it ends in a few minutes after changing the configuration.

II. Executor&Task Lost

1. Problem description

Because of the network or gc, worker or executor did not receive heartbeat feedback from executor or task

two。 Error prompt

(1) executor lost

WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, aa.local): ExecutorLostFailure (executor lost)

(2) task lost

WARN TaskSetManager: Lost task 69.2 in stage 7.0 (TID 1145, 192.168.47.217): java.io.IOException: Connection from / 192.168.47.217 in stage 55483 closed

(3) all kinds of timeout

Java.util.concurrent.TimeoutException: Futures timed out after [120 secondERROR TransportChannelHandler: Connection to / 192.168.47.212:35409 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong

3. Solution

Increase the value of spark.network.timeout to 300 (5min) or higher as appropriate.

The default is 120 (120s), which configures the delay of all network transmissions. If the following parameters are not actively set, their properties are overridden by default.

Spark.core.connection.ack.wait.timeout

Spark.akka.timeout

Spark.storage.blockManagerSlaveTimeoutMs

Spark.shuffle.io.connectionTimeout

Spark.rpc.askTimeout or spark.rpc.lookupTimeout

three。 Tilt

1. Problem description

Most of the tasks are completed, and there are one or two tasks that can't be finished or run very slowly.

It can be divided into two types: data tilt and task tilt.

two。 Error prompt

(1) data tilt

(2) Task tilt

There is not a big gap between a few task, some of which run very slowly.

3. Solution

(1) data tilt

Data skew is mostly caused by a large number of null values or "", which can be filtered before calculation.

For example:

SqlContext.sql ("... where col is not null and col! =''")

(2) Task tilt

There are many reasons for the tilt of task. Network io,cpu,mem may cause slow execution of tasks on this node. You can take a look at the performance monitoring of this node to analyze the reasons. In the past, I encountered a colleague's task of running R on a worker in spark, which caused the spark task of that node to run slowly.

Or you can enable spark's conjecture mechanism. If several task of a certain machine are particularly slow, the conjecture mechanism will assign tasks to other machines for execution. * Spark will select the fastest one as the final result.

Spark.speculation true

Spark.speculation.interval 100-Detection period (in milliseconds)

Spark.speculation.quantile 0.75-initiates speculation when the percentage of task is complete

Spark.speculation.multiplier 1.5-how many times slower than others to start speculating.

IV. OOM (memory overflow)

1. Problem description

If there is not enough memory, too much data will throw OOM's Exeception.

Because the error message is obvious, there is no error prompt here.

two。 Solution

There are mainly two kinds of driver OOM and executor OOM.

(1) driver OOM

It is generally caused by the use of collect operations to aggregate all executor data into driver. Try not to use collect operations.

(2) executor OOM

1. You can increase the memory space used by code by following the following memory optimization methods

two。 Increase the total amount of executor memory, that is, increase the value of spark.executor.memory

3. To increase the degree of parallelism of tasks (large tasks are divided into small tasks), refer to the following methods for optimizing parallelism

Optimize

1. Memory

Of course, if your task has a large amount of shuffle and less rdd cache, you can change the following parameters to further improve the speed of the task.

Spark.storage.memoryFraction-the percentage allocated to the rdd cache, which defaults to 0.6 (60%), which can be reduced if there is less data cached.

Spark.shuffle.memoryFraction-the percentage of memory allocated to shuffle data, default is 0.2 (20%)

The remaining 20% memory space is allocated to code generation objects, and so on.

If the task is slow, jvm gc frequently or there is not enough memory space, or you can lower the above two values.

"spark.rdd.compress", "true"-defaults to false, compresses serialized RDD partitions, consumes some cpu and reduces space usage

If the data is used only once, do not use the cache operation, as it will not improve the running speed and result in a waste of memory.

two。 Parallelism

Spark.default.parallelism

When parallelism occurs when shuffle occurs, the number of core in standalone mode defaults to the number of parallelism, or can be adjusted manually. If the number setting is too large, it will cause many small tasks, which will increase the cost of starting tasks. It is too small and slow to run tasks with large amounts of data.

Spark.sql.shuffle.partitions

The degree of parallelism when the sql aggregation operation (shuffle occurs) defaults to 200, which is increased if the task runs slowly.

The same two tasks:

Spark.sql.shuffle.partitions=300:

Spark.sql.shuffle.partitions=500:

The main reason for the faster speed is to greatly reduce the time of gc.

The main way to modify the parallelism of the map phase is to use rdd.repartition (partitionNum) in the code.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.