Some common problems and solutions when using Spark 07/06 Update SLTechnology News&Howtos

Some common problems and solutions when using Spark

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "some common problems and solutions when using Spark". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "some common problems and solutions when using Spark".

1. First of all, let's talk about one of the most commonly used commands for error checking after the spark task is run, which is to down the task run log.

There is an error in the program, down the log down to see the specific reason! down log command: yarn logs-applicationId app_id

2. 9 problems of Spark performance optimization and their solutions.

Several key points to be paid attention to in Spark program optimization-- the most important ones are data serialization and memory optimization

The number of problem 1:reduce task is not appropriate

Solution: the default configuration needs to be adjusted according to the actual situation, and the adjustment method is to modify the parameter spark.default.parallelism. Typically, the number of reduce is set to 2 to 3 times the number of core. The number is too large, resulting in a lot of small tasks, increasing the cost of starting tasks; the number is too small, the task runs slowly.

The problem 2:shuffle disk IO time is long

Solution: set spark.local.dir to multiple disks, and set disks to IO fast disks, and optimize shuffle performance by adding IO

Problem 3:map | large number of reduce, resulting in a large number of small shuffle files

Solution: the default number of shuffle files is map tasks * reduce tasks. Merge shuffle intermediate files by setting spark.shuffle.consolidateFiles to true, and the number of files is the number of reduce tasks

Problem 4: long serialization time and large result

Solution: Spark defaults to. Use JDK. The built-in ObjectOutputStream, which produces a large result and a long CPU processing time, can be set to org.apache.spark.serializer.KryoSerializer by setting spark.serializer. In addition, if the result is already large, you can use the broadcast variable

Question 5: single record consumes a lot

Solution: replacing map,mapPartition with mapPartition is calculated for each Partition, while map is calculated for each record in partition

The problem 6:collect is slow to output a large number of results.

Solution: collect source code is to put all the results in memory in the form of an Array, which can be directly output to distributed? File system, and then view the contents of the file system

Question 7: task execution speed tilt

Solution: if the data is skewed, generally because partition key is not good, you can consider other parallel processing methods and add aggregation operation in the middle; if it is Worker skew, for example, executor execution on some worker is slow, you can remove those nodes that are persistently slow by setting spark.speculation=true

Question 8: many empty tasks or small tasks are generated after multi-step RDD operations

Solution: use coalesce or repartition to reduce the number of partition in RDD

Problem 9:Spark Streaming throughput is not high

Solution: you can set spark.streaming.concurrentJobs

3. Intellij idea compiles spark source code directly and solves the problem:

Http://blog.csdn.net/tanglizhe1105/article/details/50530104

Http://stackoverflow.com/questions/18920334/output-path-is-shared-between-the-same-module-error

Spark compilation: clean package-Dmaven.test.skip=true

Parameter:-Xmx2g-XX:MaxPermSize=512M-XX:ReservedCodeCacheSize=512m

4 、 import Spark source code into intellj, build Error:

Not found: type SparkFlumeProtocol and EventBatch

Http://stackoverflow.com/questions/33311794/import-spark-source-code-into-intellj-build-error-not-found-type-sparkflumepr

Spark_complie_config.png

5 、 org.apache.spark.SparkException: Exception thrown in awaitResult

Set "spark.sql.broadcastTimeout" to increase the timeout

6. Apache Zeppelin compilation and installation:

Apache Zeppelin installation grunt build error:

Solution: enter the web module npm install

Http://stackoverflow.com/questions/33352309/apache-zeppelin-installation-grunt-build-error?rq=1

7. Solve the problems encountered in the compilation of Spark source code: http://www.tuicool.com/articles/NBVvai

There is not enough memory. This error is caused by insufficient memory when compiling. You can increase the memory when compiling.

[ERROR] PermGen space-> [Help 1]

[ERROR]

[ERROR] To see the full stack trace of the errors,re-run Maven with the-e switch.

[ERROR] Re-run Maven using the-X switch to enable full debug logging

[ERROR]

[ERROR] For more information about the errors and possible solutions

Please read the following articles:

[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError

Export MAVEN_OPTS= "- Xmx2g-XX:MaxPermSize=512M-XX:ReservedCodeCacheSize=512m"

8. Exception in thread "main" java.lang.UnsatisfiedLinkError: no jnind4j in java.library.path

Solution: I'm using a 64-Bit Java on Windows and still get the no jnind4j in java.library.path error It may be that you have incompatible DLLs on your PATH. In order to tell DL4J to ignore those you have to add the following as a VM parameter (Run-> Edit Configurations-> VM Options in IntelliJ):-Djava.library.path= ""

9. Spark2.0 local running source code error solution:

Modify the dependent jar package in the corresponding pom to change the scope level from provided to compile

Remove the make option before running the class; add-Dspark.master=local to the run vm setting

Running spark example code under Win7 reports an error:

Java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:/SourceCode/spark-2.0.0/spark-warehouse modifies the WAREHOUSE_PATH variable in the SQLConf class, changing the file: prefix to file:/ or file:///

CreateWithDefault ("file:/$ {system:user.dir} / spark-warehouse")

Local mode operation:-Dspark.master=local

10. Resolve Task not serializable Exception errors

Method 1: write all the data in RDD to the database through a JDBC connection. If you use the map function, you may have to create a connection for each element, which is very expensive. If you use mapPartitions, you only need to establish connection;mapPartitions for each partition and return Iterator.

Method 2: add @ transisent reference to the unserialized object, and do not serialize the properties in the object during network communication

11. This function is normal when func ("11") is called, but it will report an error of error: type mismatch when executing func (11) or func (1.1). This problem is easy to solve.

It is not difficult to overload multiple func functions for specific parameter types, as in traditional JAVA, but multiple functions need to be defined.

It is troublesome to use hypertypes, such as using AnyVal,Any;, and type conversion needs to be done in the function for specific logic, so as to further deal with the above two methods using traditional JAVA ideas, although both can solve this problem, but the disadvantage is not concise enough; in Scala, which is full of grammatical sugar, it provides the unique function of implicit conversion of implicit for type conversion.

12 、 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle

Solution: this problem usually occurs when there is a large number of shuffle operations, and the task keeps failed and then executes it again and again, until the application fails. Generally, if you encounter this problem, you can increase the executor memory and increase the cpu of each executor at the same time, so that the parallelism of task will not be reduced.

13. Spark ML PipeLine GBT/RF predicted that the Times was wrong, java.util.NoSuchElementException: key not found: 8.0

Cause of error: due to inconsistent column names of input setFeaturesCol,setLabelCol parameters in the GBT/RF model.

Solution: save only the training algorithm model, not PipeLineModel

14. Linux delete garbled files, step1. Ls-la; step2. Find. -inum inode num-exec rm {}-rf\

15 、 Caused by: java.lang.RuntimeException: Failed to commit task Caused by: org.apache.spark.executor.CommitDeniedException: attempt_201603251514_0218_m_000245_0: Not committed because the driver did not authorize commit

If you have a better understanding of how stage is divided in spark, this problem is relatively simple. The task contained in a Stage is too large, usually because your transform process is too long, so the task distributed by driver to executor will become very large. So we can solve this problem by splitting the stage. That is, during execution, cache.count is called to cache some intermediate data so as to cut off the excessively long stage.

Thank you for your reading, the above is the content of "some common problems and solutions when using Spark". After the study of this article, I believe you have a deeper understanding of some common problems and solutions when using Spark, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.