Infrastructure Hadoop big data processing-programming 07/09 Update SLTechnology News&Howtos

Infrastructure Hadoop big data processing-programming

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Hadoop can be programmed in a Linux environment or a Winows environment, where the Windows environment is taken as an example, dominated by Eclipse tools (you can also use IDEA). There are also many development articles on the Internet, and here we refer to their contents for a brief introduction and a summary of the main points.

Hadoop is a powerful parallel framework that allows tasks to be processed in parallel on their distributed clusters. But it is very difficult to write and debug Hadoop programs. Because of this, the developers of Hadoop developed the Hadoop Eclipse plug-in, which embeds Eclipse in the development environment of Hadoop, thus realizing the graphics of the development environment and reducing the difficulty of programming. After installing the plug-in and configuring the relevant information about Hadoop, if the user creates the Hadoop program, the plug-in will automatically import the JAR file of the Hadoop programming interface, so that the user can write, debug and run Hadoop programs (including stand-alone programs and distributed programs) in the graphical interface of Eclipse, and can also view the real-time status, error messages and running results of their own programs, and view and manage HDFS as well as files. Generally speaking, Hadoop Eclipse plug-in is easy to install, easy to use and powerful, especially in Hadoop programming. It is an indispensable tool for getting started with Hadoop and Hadoop programming.

Brief introduction of Hadoop working directory

For the convenience of future development, we install the software used in development in this directory according to the following, except for JDK installation. Here, I install JDK in the direct directory Java installation path of D disk (null truncation errors will be reported in some places under Program Files). Here is the working directory:

System disk (D:)

|-HadoopWork

|-eclipse

|-hadoop-2.7.3

|-workplace

|--.

Follow the above directory to extract Eclipse and Hadoop under "D:\ HadoopWork" and create "workplace" as the workspace for Eclipse.

Eclipse plug-in development configuration

Step 1: put our "hadoop2x-eclipse-plugin-master" in the "plugins" of the Eclipse directory, and then re-Eclipse to take effect.

System disk (D:)

|-HadoopWork

|-eclipse

|-plugins

|-hadoop2x-eclipse-plugin-master.jar

This is where my "hadoop-eclipse-plugin" plug-in is placed. Restart Eclipse as shown below:

The "DFS Locations" is found under "Project Explorer" on the left side of the image above, indicating that Eclipse has recognized the Hadoop Eclipse plug-in you just put in.

Step 2: select "Preference" under the "Window" menu, and then pop up a form. On the left side of the form, there is a list of options, which will add the "Hadoop Map/Reduce" option. Click this option and select the Hadoop installation directory (such as my Hadoop directory: d:\ HadoopWork\ hadoop-2.7.3). The result is as follows:

Step 3: switch the "Map/Reduce" working directory. There are two ways:

1) choose "Open Perspective" under the "Window" menu, pop up a form, and select the "Map/Reduce" option to switch.

2) in the upper right corner of the Eclipse software, click the icon "in", click the "Other" option, you can also pop up the image above, select "Map/Reduce", and then click "OK" to determine.

The interface under the "Map/Reduce" working directory is shown in the following figure.

Step 4: establish a connection to the Hadoop cluster, right-click "Map/Reduce Locations" under the Eclipse software, pop up an option, select "New Hadoop Location", and then pop up a form.

Note that the red mark in the picture above is what we need to pay attention to.

Location Name: you can do whatever you want to identify a "Map/Reduce Location"

Map/Reduce Master

Host:192.168.80.32 (IP address of Master.Hadoop)

Port:9001

DFS Master

Use Use R Master host: check the box in the front. Because our NameNode and JobTracker are on the same machine. )

Port:9000

User name:hadoop (consistent with the Department)

Note: the Host and Port in this are the addresses and ports you configured in mapred-site.xml and core-site.xml respectively. For those that are not clear, please refer to "0 Infrastructure Hadoop big data processing-Cluster installation".

Then click "Advanced parameters" to find "hadoop.tmp.dir" and change it to the address set in our Hadoop cluster. Our Hadoop cluster is "/ usr/local/hadoop273/hadoop_tmp". This parameter is configured in "core-site.xml".

After clicking "finish", you will find that a message appears in the "Map/Reduce Locations" under the Eclipse software, which is the "Map/Reduce Location" we just created.

Step 5: view the HDFS file system and try to create folders and upload files. Click under "DFS Locations" on the left side of the Eclipse software, and the file structure on HDFS will be shown.

Right-click "> user > hadoop" to try to create a "folder-- index_in", and then right-click Refresh to view the folder we just created.

After it is created, refresh it.

Log in to the "Master.Hadoop" server remotely and use the following command to see if a "index_in" folder has been created.

Hadoop fs-ls

So far, our Hadoop Eclipse development environment has been configured, students who are not interested can upload some local files to HDFS distributed files, and can compare with each other whether the files have been uploaded successfully.

Eclipse runs the WordCount program to configure JDK for Eclipse

If you have more than just JDK8.0 installed on your computer, make sure that the default JDK for Eclipse's platform is 8.0. Choose "Preference" from the "Window" menu, pop up a form, find "Java" from the left side of the form, select "Installed JREs", and then add JDK8.0. Here is my default choice, JRE.

If not, click Add to add.

After adding, click the following figure to select the 1.8 version.

Set the encoding of Eclipse to UTF-8

Create a MapReduce project

From the "File" menu, choose "Other", find "Map/Reduce Project", and select it.

Next, fill in the name of the MapReduce project as "WordCountProject" and click "finish" to complete it.

So far we have successfully created the MapReduce project, and we found that there are more projects we just built on the left side of the Eclipse software.

Create a WordCount class

Select the "WordCountProject" project, right-click the pop-up menu, then select "New", then select "Class", and then fill in the following information:

Because we directly use the WordCount program that comes with Hadoop2.7.3, registration needs to be consistent with "org.apache.hadoop.examples" in the code, and the class name must be "WordCount". The code is placed in the following structure.

Hadoop-2.7.3

|-src

|-examples

|-org

|-apache

|-hadoop

|-examples

Find the "WordCount.java" file in the directory above, open it in notepad, and copy the code into the java file you just created.

Package org.apache.hadoop.examples; import java.io.IOException;import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat Import org.apache.hadoop.util.GenericOptionsParser; public class WordCount {public static class TokenizerMapper extends Mapper {private final static IntWritable one = new IntWritable (1); private Text word = new Text (); public void map (Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer (value.toString ()); while (itr.hasMoreTokens ()) {word.set (itr.nextToken ()) Context.write (word, one);}} public static class IntSumReducer extends Reducer {private IntWritable result = new IntWritable (); public void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {int sum = 0; for (IntWritableval: values) {sum + = val.get ();} result.set (sum) Context.write (key, result);} public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); conf.set ("mapred.job.tracker", "192.168.80.32 public static void main 9001"); String [] ars=new String [] {"input", "newout"}; String [] otherArgs = new GenericOptionsParser (conf, ars). GetRemainingArgs () If (otherArgs.length! = 2) {System.err.println ("Usage: wordcount"); System.exit (2);} Job job = new Job (conf, "wordcount"); job.setJarByClass (WordCount.class); job.setMapperClass (TokenizerMapper.class); job.setCombinerClass (IntSumReducer.class); job.setReducerClass (IntSumReducer.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class) FileInputFormat.addInputPath (job, new Path (otherArgs [0])); FileOutputFormat.setOutputPath (job, new Path (otherArgs [1])); System.exit (job.waitForCompletion (true)? 0: 1);}}

Note: if you do not add "conf.set" ("mapred.job.tracker", "192.168.80.32 Map/Reduce Location 9001"); ", it will prompt you that you do not have enough permissions. In fact, the reason for this is that the configuration in the" Map/Reduce Location "just set up does not work completely, but creates a file on the local disk and tries to run it, obviously not. We want Eclipse to submit jobs to the Hadoop cluster, so we manually add the Job run address here.

Run the WordCount program

Select the "Wordcount.java" program and right-click once to run according to "Run AS Run on Hadoop". Then the following figure pops up and operates according to the following figure.

You can see the output log in Console.

View the results of WordCount operation

Look at the left side of the Eclipse software, right-click "DFS Locations" Hadoop273 "user" hadoop, click the refresh button "Refresh", and the folder "newoutput" that we just appeared will appear. Remember that the "newoutput" folder is created automatically when the program is running. if the same folder already exists, either the program changes to a new output folder, or delete the folder with the same name on the HDFS, or it will make an error.

Open the "newoutput" folder, open the "part-r-00000" file, and you can see the result after execution.

You can also export the project as a jar package and send it to a Hadoop server to run, just like running your own example.

At this point, the Eclipse development environment has been set up and the Wordcount program has been successfully run, and the next step is to really start the journey of Hadoop.

Expansion

The following is a summary of the questions listed by yourself and reference gardeners:

INFO hdfs.DFSClient: Exception in createBlockOutputStream

Java.net.NoRouteToHostException: no route to the host

Jps on each server to see if the hadoop process is started, and if all are started, then stop the host and several Slave firewalls. If there are no more problems, it means that the relevant ports are not open, so add the relevant ports to the firewall.

"error: failure to login" problem

Below, take the "hadoop-0.20.203.0" found on the Internet as an example. This also happens when I use "V1.0". The reason is that the "hadoop-eclipse-plugin-1.0.0_V1.0.jar" is compiled directly into the source code, so there is a lack of corresponding Jar packages. The details are as follows

Detailed address: http://blog.csdn.net/chengfei112233/article/details/7252404

In my practical attempt, I found that if the hadoop-0.20.203.0 version of the package is copied directly to the plug-in directory of eclipse, there will be an error when connecting to DFS with the message "error: failure to login".

The error message box that pops up reads "An internal error occurred during:" Connecting to DFS hadoop ". Org / apache/commons/configuration/Configuration". After looking at Eclipse's log, it is found that it is caused by the lack of jar packages. After further searching, we found that the hadoop-eclipse-plugin-0.20.203.0.jar was copied directly, and the jar package was missing under the lib directory in the package.

After collecting information on the Internet, the correct installation method is given here:

The first step is to modify the hadoop-eclipse-plugin-0.20.203.0.jar. Open the package with the archive manager and find that there are only two packages, commons-cli-1.2.jar and hadoop-core.jar. Change the following in the hadoop/lib directory

Commons-configuration-1.6.jar

Commons-httpclient-3.0.1.jar

Commons-lang-2.4.jar

Jackson-core-asl-1.0.1.jar

Jackson-mapper-asl-1.0.1.jar

A total of 5 packages are copied to the lib directory of hadoop-eclipse-plugin-0.20.203.0.jar, as shown below:

Then, modify the MANIFEST.MF under the META-INF directory of the package, and change the classpath to the following:

Bundle-ClassPath:classes/,lib/hadoop-core.jar,lib/commons-cli-1.2.jar,lib/commons-httpclient-3.0.1.jar,lib/jackson-core-asl-1.0.1.jar,lib/jackson-mapper-asl-1.0.1.jar,lib/commons-configuration-1.6.jar,lib/commons-lang-2.4.jar

This completes the modification of the hadoop-eclipse-plugin-0.20.203.0.jar.

Finally, copy the hadoop-eclipse-plugin-0.20.203.0.jar to the plugins directory of Eclipse. (the version number corresponding to each version is also different)

"Permission denied" problem

There are many attempts on the Internet, some mention "hadoop fs-chmod 777 / user/local/hadoop273", some mention "dfs.permissions configuration item, change the value value to false", and some mention "hadoop.job.ugi", but all of them have no effect.

References:

Address 1: http://www.cnblogs.com/acmy/archive/2011/10/28/2227901.html

Address 2: http://sunjun041640.blog.163.com/blog/static/25626832201061751825292/

Error type: org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=*, access=WRITE, inode= "hadoop": hadoop:supergroup:rwxr-xr-x

Solution:

My solution directly changes the name of the system administrator to the user running hadoop in your Hadoop cluster.

"Failed to set permissions of path" problem

Reference: https://issues.apache.org/jira/browse/HADOOP-8089

The error message is as follows:

ERROR security.UserGroupInformation: PriviledgedActionException as: hadoop cause:java.io.IOException Failed to set permissions of path:\ usr\ hadoop\ tmp\ mapred\ staging\ hadoop753422487\ .staging to 0700 Exception in thread "main" java.io.IOException: Failed to set permissions of path:\ usr\ hadoop\ tmp\ mapred\ staging\ hadoop753422487\ .staging to 0700

Solution:

Configuration conf = new Configuration ()

Conf.set ("mapred.job.tracker", "[server]: 9001")

"[server]" in "[server]: 9001" is the IP address of the Hadoop cluster Master.

On the limit of "hadoop mapred execution Directory File right"

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.