What is the principle of hadoop job referencing third-party jar files? 07/15 Update SLTechnology News&Howtos

What is the principle of hadoop job referencing third-party jar files?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about the principle of parsing the third-party jar files referenced by hadoop jobs. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

Writing mapreduce programs in eclipse, referencing third-party jar files, can be submitted directly by run on hadoop using eclipse hadoop plug-ins, which is very convenient. However, the plug-in version must match eclipse, otherwise it will always be executed by local, and there is no job generated in 50070.

If you want to publish the program as a jar file and execute it through the command line on namenode, without the help of eclipse to automatically configure the jar file, you will encounter java.lang.ClassNotFoundException. This problem can be divided into two cases.

one。 How is the hadoop imperative executed?

Actually, $HADOOP_HOME/bin/hadoop is a script file. The following wordcount command is an example

Bin/hadoop jar wordcount.jar myorg.WordCount / usr/wordcount/input / usr/wordcount/output

The script file parses the parameters, configures the classpath, and finally executes the following command:

Exec java-classpath $CLASSPATH org.apache.hadoop.util.RunJar $@

Where $CLASSPATH: contains ${HADOOP_CONF_DIR}, * .jar under $HADOOP_HOME and $HADOOP_CLASSPATH

$@: all script parameters, here are the parameters after jar

RunJar: the function of this class is relatively simple. Extract the jar file to the "hadoop.tmp.dir" directory, and then execute the class we specified. Here is myorg.WordCount.

P.s. For a more complete analysis of the hadoop script, see.

After RunJar executes WordCount, we will enter our program. We need to configure mapper, reducer, output path, etc., and finally submit this job to JobTracker by executing job.waitForCompletion (true).

So far, we know that the local execution part has been completed. If ClassNotFoundException occurs during this period, you can configure $HADOOP_CLASSPATH in your own script file, including the third-party jar file you need, and then execute the hadoop command.

two。 How do JobTracker and TaskTracker get third-party jar files?

Sometimes after submitting a job, ClassNotFoundException is also generated in the map or reduce function. This is because map or reduce may be executed on other machines, those machines do not have the required jar files, and the mapreduce job is executed by JobTracker and TaskTracker. How can the two get third-party jar files? That is case two.

Let's first analyze the mapreduce submission process.

Step 1. And 2. Submit the job through the Job class, get a job number, and decide whether to submit the job to LocalJobRunner or JobTracker according to conf

Step 3. Copy job resource

Client uploads the resources needed for the job to hdfs, such as job split, jar files, etc. JobClient processes the jar file through the configureCommandLineOptions function, in which the contents of these parameters are obtained through job

Files = job.get ("tmpfiles"); / / corresponding parameter item-fileslibjars = job.get ("tmpjars"); / / corresponding-libjarsarchives = job.get ("tmparchives"); / / corresponding-archives

If the jar file is configured, add it to the distributed cache DistributedCache,-libjars as an example:

If (libjars! = null) {FileSystem.mkdirs (fs, libjarsDir, mapredSysPerms); String [] libjarsArr = libjars.split (","); for (String tmpjars: libjarsArr) {Path tmp = newPath (tmpjars); Path newPath = copyRemoteFiles (fs, libjarsDir, tmp, job, replication); DistributedCache.addArchiveToClassPath (newPath, job);}}

In addition, job.setJarByClass is always needed to specify the running class in the configuration of mapreduce programs, so that hadoop can locate the jar file according to the class, which is our packaged jar, and upload it to hdfs. At this point, jobClient completes the process of copying resources that can be used by JobTracker and TaskTracker.

Step4-10. JobClient submits the job and executes the job (JobTracker and TaskTracker work will not be carried out, see for details).

three。 Summary

To get a mapreduce program to reference a third-party jar file, you can do the following:

Pass jar files through command line arguments, such as-libjars, etc.

Set it directly in conf, such as conf.set ("tmpjars", * .jar). Jar files are separated by commas.

Using distributed cache, such as DistributedCache.addArchiveToClassPath (path, job), the path here must be hdfs, that is, upload the jar to hdfs by yourself, and then add the path to the distributed cache.

The third party jar file and its own program are packaged into a jar file, and the program will get the whole file through job.getJar () and transfer it to hdfs. (very bulky)

Add the jar file to the $HADOOP_HOME/lib directory of each machine. (not recommended)

P.s. If through the above method 1. Or 2. You need to pay attention to the Configuration problem, you need to get it through the getConf () function, rather than new an object yourself.

After reading the above, do you have any further understanding of the principle of parsing third-party jar files referenced by hadoop jobs? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.