What if Apache Spark 2.0 takes a long time to finish when the job is finished? 07/06 Update SLTechnology News&Howtos

What if Apache Spark 2.0 takes a long time to finish when the job is finished?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article is about what Apache Spark 2.0 does when it takes a long time to finish its homework. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Phenomenon

You may encounter this phenomenon when using Apache Spark 2.x: although our Spark Jobs is complete, our program is still running. For example, we use Spark SQL to execute some SQL, and this SQL generates a large number of files at the end. Then we can see that all the Spark Jobs of the SQL has actually been run, but the query statement is still running. From the log, we can see that the driver node is moving the files generated by tasks one by one to the directory of the final table, which is easy to happen when we have a large number of files generated by our job.

Why did it cause this phenomenon?

Spark 2.x uses Hadoop 2.x. When it saves the generated file to HDFS, it finally calls saveAsHadoopFile, and this function uses FileOutputCommitter, as follows:

The problem is that there are two noteworthy methods in FileOutputCommitter implementation FileOutputCommitter of Hadoop 2.x: commitTask and commitJob. In the FileOutputCommitter implementation of Hadoop 2.x, the mapreduce.fileoutputcommitter.algorithm.version parameter controls how commitTask and commitJob work. The specific code is as follows (for ease of illustration, I have removed the irrelevant statements, and the complete code can be found in FileOutputCommitter.java):

You can see that in the commitTask method, there is a conditional judgment algorithmVersion = = 1, which is the value of the mapreduce.fileoutputcommitter.algorithm.version parameter, which defaults to 1; if this parameter is 1, then when Task is completed, the temporary data generated by Task will be moved to the corresponding directory of task, and then moved to the final job output directory when commitJob, and this parameter, the default value of Hadoop 2.x is 1! This is why we see that the job is complete, but the program is still moving the data, resulting in the entire job not yet completed, and the commitJob function is finally executed by Spark's Driver, so the execution is slow to the end.

We can see that if we set the value of the mapreduce.fileoutputcommitter.algorithm.version parameter to 2, then when commitTask executes, the mergePaths method is called to move the data generated by Task directly from the Task temporary directory to the last generated directory of the program. When performing commitJob, you don't have to move the data directly, which is naturally much faster than the default value.

Note that before Hadoop 2.7.0, we could have set the mapreduce.fileoutputcommitter.algorithm.version parameter to a value other than 1 to achieve this purpose, because there is no limit in the program that this value must be 2. However, when it comes to Hadoop 2.7.0, the value of the mapreduce.fileoutputiter.resolm.version parameter must be 1 or 2, as specified in MAPREDUCE-4815.

How to set this parameter in Spark

The problem has been found, and we can solve it in the program. There are several ways to do this:

Set spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 directly in conf/spark-defaults.conf, which has a global impact.

Set spark.conf.set ("mapreduce.fileoutputcommitter.algorithm.version", "2") directly in the Spark program, which is at the job level.

If you are using Dataset API to write data to HDFS, then you can set dataset.write.option ("mapreduce.fileoutputcommitter.algorithm.version", "2").

However, if your Hadoop version is 3.xgrad mapreduce.fileoutputresolter.resolm.version parameter has been set to 2, see MAPREDUCE-6336 and MAPREDUCE-6406 for details.

Because this parameter has some impact on performance, by Spark 2.2.0, this parameter has been recorded in the Spark configuration document configuration.html, see SPARK-20107 for details.

Thank you for reading! This is the end of the article on "what to do when Apache Spark 2.0 takes a long time to finish your homework". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.