In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Spark program running common error solutions and optimization is how, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.
I. org.apache.spark.shuffle.FetchFailedException
1. Problem description
This problem usually occurs when there are a large number of shuffle operations, task constantly failed, and then re-execute, continue to cycle, very time-consuming.
two。 Error prompt
(1) missing output location
Org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
(2) shuffle fetch faild
Org.apache.spark.shuffle.FetchFailedException: Failed to connect to spark047215/192.168.47.215:50268
The current configuration uses 1 executor with 5 GRAM for each executor, with 20 GRAM started
3. Solution
Generally, if you encounter this problem, you can increase the executor memory and increase the cpu of each executor at the same time, so that the parallelism of task will not be reduced.
Spark.executor.memory 15G
Spark.executor.cores 3
Spark.cores.max 21
The number of execuote started is: 7
ExecuoteNum = spark.cores.max/spark.executor.cores
Configuration of each executor:
3core,15G RAM
The memory resources consumed are: 105G RAM
15G*7=105G
You can find that the resources used have not been improved, but the original configuration of the same task is still stuck for several hours, and it ends in a few minutes after changing the configuration.
II. Executor&Task Lost
1. Problem description
Because of the network or gc, worker or executor did not receive heartbeat feedback from executor or task
two。 Error prompt
(1) executor lost
WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, aa.local): ExecutorLostFailure (executor lost)
(2) task lost
WARN TaskSetManager: Lost task 69.2 in stage 7.0 (TID 1145, 192.168.47.217): java.io.IOException: Connection from / 192.168.47.217 in stage 55483 closed
(3) all kinds of timeout
Java.util.concurrent.TimeoutException: Futures timed out after [120 secondERROR TransportChannelHandler: Connection to / 192.168.47.212:35409 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong
3. Solution
Increase the value of spark.network.timeout to 300 (5min) or higher as appropriate.
The default is 120 (120s), which configures the delay of all network transmissions. If the following parameters are not actively set, their properties are overridden by default.
Spark.core.connection.ack.wait.timeout
Spark.akka.timeout
Spark.storage.blockManagerSlaveTimeoutMs
Spark.shuffle.io.connectionTimeout
Spark.rpc.askTimeout or spark.rpc.lookupTimeout
three。 Tilt
1. Problem description
Most of the tasks are completed, and there are one or two tasks that can't be finished or run very slowly.
It can be divided into two types: data tilt and task tilt.
two。 Error prompt
(1) data tilt
(2) Task tilt
There is not a big gap between a few task, some of which run very slowly.
3. Solution
(1) data tilt
Data skew is mostly caused by a large number of null values or "", which can be filtered before calculation.
For example:
SqlContext.sql ("... where col is not null and col! =''")
(2) Task tilt
There are many reasons for the tilt of task. Network io,cpu,mem may cause slow execution of tasks on this node. You can take a look at the performance monitoring of this node to analyze the reasons. In the past, I encountered a colleague's task of running R on a worker in spark, which caused the spark task of that node to run slowly.
Or you can enable spark's conjecture mechanism. If several task of a certain machine are particularly slow, the conjecture mechanism will assign tasks to other machines for execution. * Spark will select the fastest one as the final result.
Spark.speculation true
Spark.speculation.interval 100-Detection period (in milliseconds)
Spark.speculation.quantile 0.75-initiates speculation when the percentage of task is complete
Spark.speculation.multiplier 1.5-how many times slower than others to start speculating.
IV. OOM (memory overflow)
1. Problem description
If there is not enough memory, too much data will throw OOM's Exeception.
Because the error message is obvious, there is no error prompt here.
two。 Solution
There are mainly two kinds of driver OOM and executor OOM.
(1) driver OOM
It is generally caused by the use of collect operations to aggregate all executor data into driver. Try not to use collect operations.
(2) executor OOM
1. You can increase the memory space used by code by following the following memory optimization methods
two。 Increase the total amount of executor memory, that is, increase the value of spark.executor.memory
3. To increase the degree of parallelism of tasks (large tasks are divided into small tasks), refer to the following methods for optimizing parallelism
Optimize
1. Memory
Of course, if your task has a large amount of shuffle and less rdd cache, you can change the following parameters to further improve the speed of the task.
Spark.storage.memoryFraction-the percentage allocated to the rdd cache, which defaults to 0.6 (60%), which can be reduced if there is less data cached.
Spark.shuffle.memoryFraction-the percentage of memory allocated to shuffle data, default is 0.2 (20%)
The remaining 20% memory space is allocated to code generation objects, and so on.
If the task is slow, jvm gc frequently or there is not enough memory space, or you can lower the above two values.
"spark.rdd.compress", "true"-defaults to false, compresses serialized RDD partitions, consumes some cpu and reduces space usage
If the data is used only once, do not use the cache operation, as it will not improve the running speed and result in a waste of memory.
two。 Parallelism
Spark.default.parallelism
When parallelism occurs when shuffle occurs, the number of core in standalone mode defaults to the number of parallelism, or can be adjusted manually. If the number setting is too large, it will cause many small tasks, which will increase the cost of starting tasks. It is too small and slow to run tasks with large amounts of data.
Spark.sql.shuffle.partitions
The degree of parallelism when the sql aggregation operation (shuffle occurs) defaults to 200, which is increased if the task runs slowly.
The same two tasks:
Spark.sql.shuffle.partitions=300:
Spark.sql.shuffle.partitions=500:
The main reason for the faster speed is to greatly reduce the time of gc.
The main way to modify the parallelism of the map phase is to use rdd.repartition (partitionNum) in the code.
Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.