In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "some common problems and solutions when using Spark". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "some common problems and solutions when using Spark".
1. First of all, let's talk about one of the most commonly used commands for error checking after the spark task is run, which is to down the task run log.
There is an error in the program, down the log down to see the specific reason! down log command: yarn logs-applicationId app_id
2. 9 problems of Spark performance optimization and their solutions.
Several key points to be paid attention to in Spark program optimization-- the most important ones are data serialization and memory optimization
The number of problem 1:reduce task is not appropriate
Solution: the default configuration needs to be adjusted according to the actual situation, and the adjustment method is to modify the parameter spark.default.parallelism. Typically, the number of reduce is set to 2 to 3 times the number of core. The number is too large, resulting in a lot of small tasks, increasing the cost of starting tasks; the number is too small, the task runs slowly.
The problem 2:shuffle disk IO time is long
Solution: set spark.local.dir to multiple disks, and set disks to IO fast disks, and optimize shuffle performance by adding IO
Problem 3:map | large number of reduce, resulting in a large number of small shuffle files
Solution: the default number of shuffle files is map tasks * reduce tasks. Merge shuffle intermediate files by setting spark.shuffle.consolidateFiles to true, and the number of files is the number of reduce tasks
Problem 4: long serialization time and large result
Solution: Spark defaults to. Use JDK. The built-in ObjectOutputStream, which produces a large result and a long CPU processing time, can be set to org.apache.spark.serializer.KryoSerializer by setting spark.serializer. In addition, if the result is already large, you can use the broadcast variable
Question 5: single record consumes a lot
Solution: replacing map,mapPartition with mapPartition is calculated for each Partition, while map is calculated for each record in partition
The problem 6:collect is slow to output a large number of results.
Solution: collect source code is to put all the results in memory in the form of an Array, which can be directly output to distributed? File system, and then view the contents of the file system
Question 7: task execution speed tilt
Solution: if the data is skewed, generally because partition key is not good, you can consider other parallel processing methods and add aggregation operation in the middle; if it is Worker skew, for example, executor execution on some worker is slow, you can remove those nodes that are persistently slow by setting spark.speculation=true
Question 8: many empty tasks or small tasks are generated after multi-step RDD operations
Solution: use coalesce or repartition to reduce the number of partition in RDD
Problem 9:Spark Streaming throughput is not high
Solution: you can set spark.streaming.concurrentJobs
3. Intellij idea compiles spark source code directly and solves the problem:
Http://blog.csdn.net/tanglizhe1105/article/details/50530104
Http://stackoverflow.com/questions/18920334/output-path-is-shared-between-the-same-module-error
Spark compilation: clean package-Dmaven.test.skip=true
Parameter:-Xmx2g-XX:MaxPermSize=512M-XX:ReservedCodeCacheSize=512m
4 、 import Spark source code into intellj, build Error:
Not found: type SparkFlumeProtocol and EventBatch
Http://stackoverflow.com/questions/33311794/import-spark-source-code-into-intellj-build-error-not-found-type-sparkflumepr
Spark_complie_config.png
5 、 org.apache.spark.SparkException: Exception thrown in awaitResult
Set "spark.sql.broadcastTimeout" to increase the timeout
6. Apache Zeppelin compilation and installation:
Apache Zeppelin installation grunt build error:
Solution: enter the web module npm install
Http://stackoverflow.com/questions/33352309/apache-zeppelin-installation-grunt-build-error?rq=1
7. Solve the problems encountered in the compilation of Spark source code: http://www.tuicool.com/articles/NBVvai
There is not enough memory. This error is caused by insufficient memory when compiling. You can increase the memory when compiling.
[ERROR] PermGen space-> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors,re-run Maven with the-e switch.
[ERROR] Re-run Maven using the-X switch to enable full debug logging
[ERROR]
[ERROR] For more information about the errors and possible solutions
Please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError
Export MAVEN_OPTS= "- Xmx2g-XX:MaxPermSize=512M-XX:ReservedCodeCacheSize=512m"
8. Exception in thread "main" java.lang.UnsatisfiedLinkError: no jnind4j in java.library.path
Solution: I'm using a 64-Bit Java on Windows and still get the no jnind4j in java.library.path error It may be that you have incompatible DLLs on your PATH. In order to tell DL4J to ignore those you have to add the following as a VM parameter (Run-> Edit Configurations-> VM Options in IntelliJ):-Djava.library.path= ""
9. Spark2.0 local running source code error solution:
Modify the dependent jar package in the corresponding pom to change the scope level from provided to compile
Remove the make option before running the class; add-Dspark.master=local to the run vm setting
Running spark example code under Win7 reports an error:
Java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:/SourceCode/spark-2.0.0/spark-warehouse modifies the WAREHOUSE_PATH variable in the SQLConf class, changing the file: prefix to file:/ or file:///
CreateWithDefault ("file:/$ {system:user.dir} / spark-warehouse")
Local mode operation:-Dspark.master=local
10. Resolve Task not serializable Exception errors
Method 1: write all the data in RDD to the database through a JDBC connection. If you use the map function, you may have to create a connection for each element, which is very expensive. If you use mapPartitions, you only need to establish connection;mapPartitions for each partition and return Iterator.
Method 2: add @ transisent reference to the unserialized object, and do not serialize the properties in the object during network communication
11. This function is normal when func ("11") is called, but it will report an error of error: type mismatch when executing func (11) or func (1.1). This problem is easy to solve.
It is not difficult to overload multiple func functions for specific parameter types, as in traditional JAVA, but multiple functions need to be defined.
It is troublesome to use hypertypes, such as using AnyVal,Any;, and type conversion needs to be done in the function for specific logic, so as to further deal with the above two methods using traditional JAVA ideas, although both can solve this problem, but the disadvantage is not concise enough; in Scala, which is full of grammatical sugar, it provides the unique function of implicit conversion of implicit for type conversion.
12 、 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
Solution: this problem usually occurs when there is a large number of shuffle operations, and the task keeps failed and then executes it again and again, until the application fails. Generally, if you encounter this problem, you can increase the executor memory and increase the cpu of each executor at the same time, so that the parallelism of task will not be reduced.
13. Spark ML PipeLine GBT/RF predicted that the Times was wrong, java.util.NoSuchElementException: key not found: 8.0
Cause of error: due to inconsistent column names of input setFeaturesCol,setLabelCol parameters in the GBT/RF model.
Solution: save only the training algorithm model, not PipeLineModel
14. Linux delete garbled files, step1. Ls-la; step2. Find. -inum inode num-exec rm {}-rf\
15 、 Caused by: java.lang.RuntimeException: Failed to commit task Caused by: org.apache.spark.executor.CommitDeniedException: attempt_201603251514_0218_m_000245_0: Not committed because the driver did not authorize commit
If you have a better understanding of how stage is divided in spark, this problem is relatively simple. The task contained in a Stage is too large, usually because your transform process is too long, so the task distributed by driver to executor will become very large. So we can solve this problem by splitting the stage. That is, during execution, cache.count is called to cache some intermediate data so as to cut off the excessively long stage.
Thank you for your reading, the above is the content of "some common problems and solutions when using Spark". After the study of this article, I believe you have a deeper understanding of some common problems and solutions when using Spark, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.