Analysis and solution of Application JAR package conflict in YARN Environment 07/04 Update SLTechnology News&Howtos

Analysis and solution of Application JAR package conflict in YARN Environment

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The Hadoop framework itself integrates many third-party JAR package libraries. The Hadoop framework starts itself or when running applications such as the user's MapReduce, it gives priority to finding the JAR package that is preset by Hadoop. In this way, when the third-party library used by the user's application already exists in the preset directory of the Hadoop framework, but the two versions are different, Hadoop will give priority to loading the JAR package preset by Hadoop itself for the application, which often results in the application not running properly.

Starting from a practical problem we encountered in practice, this paper analyzes the relevant principles of JAR package search when MapReduce programs are running in Hadoop on YARN environment, and gives the ideas and methods to solve JAR package conflicts.

1. An example of JAR package conflict

One of my MR programs needs to use the new interface of version 1.9.13 of the jackson library:

Figure pom.xml of 1:MR, depending on 1.9.13 of jackson

But the preset jackson version of my Hadoop cluster (hadoop-2.3.0-cdh6.1.0 of CDH version) is 1.8.8, which is located under share/hadoop/mapreduce2/lib/ under the Hadoop installation directory.

When running my MR program with the following command:

Hadoop jar mypackage-0.0.1-jar-with-dependencies.jar com.umeng.dp.MainClass-input=../input.pb.lzo-output=/tmp/cuiyang/output/

Since the JsonNode.asText () method used in the MR program is new in version 1.9.13 and does not exist in version 1.8.8, the error is as follows:

...

15-11-13 18:14:33 INFO mapreduce.Job: map 0 reduce 0

15-11-13 18:14:40 INFO mapreduce.Job: Task Id: attempt_1444449356029_0022_m_000000_0, Status: FAILED

Error: org.codehaus.jackson.JsonNode.asText () Ljava/lang/String

...

2. Find out the process of executing the application program in YARN framework

Before we continue to analyze how to solve the problem of JAR package conflicts, we need to understand a very important question, that is, how do users'MR programs run on NodeManager? This is the key to finding a solution to the problem of JAR package conflicts.

This article is not an introduction to the YARN framework, some basic knowledge of YARN is assumed to be known to everyone, such as ResourceManager (hereinafter referred to as RM), NodeManager (hereinafter referred to as NM), AppMaster (hereinafter referred to as AM), the functions and responsibilities of the five core components of Client,Container, as well as the relationship between them, and so on.

Figure 2:YARN architecture diagram

If you don't know much about the principle of YARN, it doesn't matter, it won't affect the understanding of the following article. I will make a brief summary of the knowledge of several key points that will be used in the following article, and just understand these key points:

From a logical point of view, Container can be simply understood as a process running Map Task or Reduce Task (of course, AM is also a Container, which is run by the RM command NM). YARN designed the general concept of Container in order to abstract different framework applications.

Container is started by AM sending a command to NM.

Container is actually a process started by a Shell script that executes the Java program to run Map Task or Reduce Task.

All right, let's start by explaining how the MR program runs on NM.

As mentioned above, Map Task or Reduce Task is sent by AM to the specified NM and commands NM to run. After receiving the command from AM, NM will establish a local directory for each Container, download the program files and resource files to this directory in NM, and then prepare to run Task, which is actually ready to start a Container. NM dynamically generates a script file named launch_container.sh for the Container, and then executes the script file. This file is the key to seeing how Container works!

The two lines in the script related to this question are as follows:

Export CLASSPATH= "$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*: (... omit...): $PWD/*"

Exec / bin/bash-c "$JAVA_HOME/bin/java-D (various Java parameters) org.apache.hadoop.mapred.YarnChild 127.0.0.1 58888 (other application parameters)"

Look at line 2 first. It turns out that when YARN runs MapReduce, each Container is an ordinary Java program, and the Main program entry class is: org.apache.hadoop.mapred.YarnChild.

We know that when JVM loads a class, it will look for the specified classpath according to the order in which the path is declared in CLASSPATH, and will return until the first target class is found, instead of continuing the search. That is, if both JAR packages have the same class, whoever declares it before the CLASSPATH will be loaded. This is the key to resolving JAR package conflicts!

Looking back at line 1, it happens to be the CLASSPATH variable needed to define the JVM runtime. As you can see, YARN writes the directory of the Hadoop preset JAR package at the front of CLASSPATH. In this way, any classes contained in the Hadoop preset JAR package will take precedence over classes with the same classpath in the applied JAR package!

So how does JVM load classes that are unique to the application (that is, classes that are not preset by Hadoop)? Look at the end of the CLASSPATH variable definition: "/ *: $PWD/*". That is, if the Java class cannot be found anywhere else, it will end up looking in the current directory.

So what is the current directory? As mentioned above, NM creates a separate directory for Container before running Container, then puts the required resources into that directory, and then runs the program. This directory is the directory where all Container-related resources and program files are stored, that is, the current directory where the launch_container.sh script runs. If you pass in the-libjars parameter when you execute the program, the specified JAR file will also be copied to this directory. In this way, JVM can find all the JAR packages in the current directory through the CLASSPATH variable, so you can load user-referenced JAR packages.

When I run the application once on my computer, the directory is located at / Users/umeng/worktools/hadoop-2.3.0-cdh6.1.0/ops/tmp/hadoop-umeng/nm-local-dir/usercache/umeng/appcache/application_1444449356029_0023, and the contents are as follows (can be configured through a configuration file, abbreviated):

Figure Job runtime directory in 3:NM

Well, now that we know why YARN always loads Hadoop preset class and JAR packages, how do we solve this problem? The way is: look at the source code! Find the place where launch_container.sh is generated dynamically and see if you can adjust the order in which the CLASSPATH variables are generated, adjusting the current directory of the Job runtime to the front of the CLASSPATH.

Third, read the source code to solve problems

Trace the source code, let's go deep into it and see through everything.

First of all, it comes to mind that although launch_container.sh script files are generated by NM, NM is just a carrier for running Task, and it is the brain of the program that really controls exactly how Container runs: AppMaster. Looking at the source code, it sure enough validates our idea: Container's CLASSPATH is passed to NodeManager by MRApps (MapReduce's AM), and NodeManager is written into the sh script.

The TaskAttemptImpl::createCommonContainerLaunchContext () method in MRApps creates a Container, and then the Container is serialized and passed directly to the implementation of the NM; method, calling the relationship as follows: createContainerLaunchContext ()-> getInitialClasspath ()-> MRApps.setClasspath (env, conf). First, let's take a look at setClasspath ():

First, the userClassesTakesPrecedence is judged, and if the Flag is set, the method MRApps.setMRFrameworkClasspath (environment, conf) will not be called. That is, if the Flag is set, the user is required to set the CLASSPATH of all JAR packages.

Let's look at the setMRFrameworkClasspath () method:

Among them, the directory of all Hadoop preset JAR packages is placed in DEFAULT_YARN_APPLICATION_CLASSPATH. As you can see, the framework will first use the CLASSPATH set by YarnConfiguration.YARN_APPLICATION_CLASSPATH, and if it is not set, it will use DEFAULT_YARN_APPLICATION_CLASSPATH.

Then conf.getStrings () converts the configuration string into an array of strings separated by commas; Hadoop iterates through the array, calling MRApps.addToEnvironment (environment, Environment.CLASSPATH.name (), c.trim (), conf) to set the CLASSPATH.

Seeing here, we see a glimmer of hope: by default, MRApps uses DEFAULT_YARN_APPLICATION_CLASSPATH as the default CLASSPATH for Task. If we want to change the CLASSPATH, it looks like we need to modify the YARN_APPLICATION_CLASSPATH so that the variable is not empty.

Therefore, we added the following statement to the application:

String [] classpathArray = config.getStrings (YarnConfiguration.YARN_APPLICATION_CLASSPATH, YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH)

String cp = "$PWD/*:" + StringUtils.join (":, classpathArray)

Config.set (YarnConfiguration.YARN_APPLICATION_CLASSPATH, cp)

The above statement means: first get the YARN default setting DEFAULT_YARN_APPLICATION_CLASSPATH, then add the current directory where the Task program is running at the beginning, and then set it to the YARN_APPLICATION_CLASSPATH variable together. In this way, when MRApps creates the Container, the CLASSPATH that we modified and the current directory of the program takes precedence as the CLASSPATH of the Container runtime.

Finally, we need to put the JAR package that our application depends on into the directory where Task is running, so that when the class is loaded, it can be loaded into the class we really need. So how do you do that? Yes, use the-libjars parameter, which has been explained earlier. In this way, the command to run the program is changed as follows:

Hadoop jar. / target/mypackage-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.umeng.dp.MainClass-libjars jackson-mapper-asl-1.9.13.jar,jackson-core-asl-1.9.13.jar-- input=../input.pb.lzo-- output=/tmp/cuiyang/output/

IV. Conclusion

In this article, we solved a JAR package conflict problem we encountered by analyzing the source code of Hadoop.

No matter how mature and perfect the documentation manual is, it is impossible to cover all the details of its products to answer all users' questions, let alone Hadoop, a non-profit open source framework. The advantage of open source is that when you are confused, you can turn to the source code to find the answer to the question. This is as teacher Hou Jie said: "in front of the source code, there are no secrets."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.