Several ways for Hadoop applications to refer to third-party jar (2) 07/11 Update SLTechnology News&Howtos

Several ways for Hadoop applications to refer to third-party jar (2)

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Then continue to write the previous article "several ways for Hadoop applications to reference third-party jar (1)".

To put it simply, there are several ways for Hadoop to introduce third-party jar packages:

First, pack all the referenced third-party jar packages into a jar package to form a super-large package, like the second way of packaging after the introduction of jar mentioned in the previous article.

Second, put all the third-party jar packages referenced into the lib directory of Hadoop, which requires each node in the Hadoop cluster to put

Third, put the jar package on a fixed machine in the cluster and use the libjars command to load the third party jar

Fourth, put the jar package on the HDFS and dynamically load the third-party jar package.

Next, I would like to express my personal opinion on the advantages and disadvantages of these ways:

First, put all the third-party jar packages referenced into a jar package

Advantages: it can be run directly on the hadoop cluster, and it is relatively easy to run the command.

Disadvantages: put all the jar packages together, the file is too large, taking into account the upgrade version, the referenced third-party jar files generally will not change, this packaging will be the third-party jar files packaged together every time to upload.

This packaging method has been demonstrated and explained in the previous article, "several ways for Hadoop applications to reference third-party jar (1)", which is not recommended personally.

Second, put all the third-party jar packages referenced into Hadoop's lib directory

Advantages: it can be run directly on the hadoop cluster, and it is relatively easy to run the command.

Disadvantages: a third-party jar needs to be placed under each node in the cluster, which is indispensable and inflexible. When the version is upgraded, the jar package on each machine needs to be maintained, which is not easy to maintain.

For this way, I did not experiment, it is feasible in theory, personally do not recommend this way.

Third, put the jar package on a fixed machine in the cluster and use the libjars command to load the third party jar

Advantages: only need to maintain the lib library on one machine in the hadoop cluster, so it is easy to maintain the system

Disadvantages: the hadoop jar command execution program can only be executed on the machine where jar is stored, and the execution of the command is more complex.

Personally, it's acceptable, but it's not my favorite way.

Here, I test that the code of WordCount remains unchanged and typed as WordCount_libjarscmd.jar. Be careful not to select the jar file in lib during the packaging process, and then put OperateHDFS.jar on a machine in the cluster. The command executed is as follows:

Hadoop jar WordCount_libjarscmd.jar com.hadoop.examples.WordCount-libjars OperateHDFS.jar input libjarscmdoutput

The format of the command is as follows:

Hadoop jar to execute jar to execute Class-libjars third-party jar directory finally is the input and output parameters required by the program

The execution result of the program is ok, which can be executed without any problems. The result is as follows:

Fourth, put the jar package on the HDFS and dynamically load the third-party jar package

Advantages: the program can easily run on any node on the cluster, and there is no limit to the machines on which the command is executed.

Disadvantages: need to write code in the program to add a third-party jar, if the directory where lib is stored has changed, then you can only change the code.

Personally, I prefer this way. After all, the directory where lib is stored usually does not change. I think so. (* ^ _ ^ *) hee hee.

Here, I have tested this method. You need to store OperateHDFS.jar on HDFS, then modify WordCount slightly, add a code block that dynamically adds third-party jar, and then pack it into the jar package WordCount _ dynamicload.jar. Be careful not to select the jar file in lib during the packaging process. The command you execute is as follows:

Hadoop jar WordCount_dynamicload.jar com.hadoop.examples.WordCount input dynamicload

The execution result of the program is ok, which can be executed without any problems. The result is as follows:

Let's post the WordCount source code:

Package com.hadoop.examples;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat Import org.apache.hadoop.util.GenericOptionsParser;import com.hadoop.hdfs.OperateHDFS;public class WordCount {public static class TokenizerMapper extends Mapper {private final static IntWritable one = new IntWritable (1); private Text word = new Text () Public void map (Object key, Text value, Context context) throws IOException, InterruptedException {/ / it does nothing just to test the introduction of a third-party jar. If it cannot be found, it will definitely report the ClassNotFound exception OperateHDFS s = new OperateHDFS (); StringTokenizer itr = new StringTokenizer (value.toString ()) While (itr.hasMoreTokens ()) {word.set (itr.nextToken ()); context.write (word, one);}} public static class IntSumReducer extends Reducer {private IntWritable result = new IntWritable () Public void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {int sum = 0; for (IntWritableval: values) {sum + = val.get ();} result.set (sum); context.write (key, result) } public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); String [] otherArgs = new GenericOptionsParser (conf, args). GetRemainingArgs (); if (otherArgs.length < 2) {System.err.println ("Usage: wordcount [...]"); System.exit (2) } Job job = new Job (conf, "word count"); job.setJarByClass (WordCount.class); job.setMapperClass (TokenizerMapper.class); job.setCombinerClass (IntSumReducer.class); job.setReducerClass (IntSumReducer.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class) / * in fact, I have done a test, and it is possible to write this directory into the cost directory. * this is the third way. Instead of using the libjars command to load, I use the program to load dynamically * but in this case, I can only execute the command on the fixed machine and run the program. It's the same as the third way * / / only when you do the fourth way, you need to open it. In the first three ways, you need to comment it out / / job.addFileToClassPath (new Path ("hdfs://192.168.3.57:8020/user/lxy/lib/OperateHDFS.jar")). For (int I = 0; I < otherArgs.length-1; + + I) {FileInputFormat.addInputPath (job, new Path (otherArgs [I]));} FileOutputFormat.setOutputPath (job, new Path (otherArgs [otherArgs.length-1])); System.exit (job.waitForCompletion (true)? 0: 1);}}

Attachment: http://down.51cto.com/data/2365541

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.