Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize HelloWorld with Hadoop

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article is about how Hadoop implements HelloWorld. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Start with the source code, and then analyze it step by step

Package org.apache.hadoop.examples;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat Import org.apache.hadoop.util.GenericOptionsParser;public class WordCount {public static class TokenizerMapper extends Mapper {private final static IntWritable one = new IntWritable (1); private Text word = new Text (); public void map (Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer (value.toString ()); while (itr.hasMoreTokens ()) {word.set (itr.nextToken ()) Context.write (word, one);}} public static class IntSumReducer extends Reducer {private IntWritable result = new IntWritable (); public void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {int sum = 0; for (IntWritableval: values) {sum + = val.get ();} result.set (sum) Context.write (key, result);} public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); conf.set ("mapred.job.tracker", "172.16.10.15 public static void main 9001"); / / additional code String [] otherArgs = new GenericOptionsParser (conf, args). GetRemainingArgs (); if (otherArgs.length! = 2) {System.err.println ("Usage: wordcount") System.exit (2);} Job job = new Job (conf, "word count"); job.setJarByClass (WordCount.class); job.setMapperClass (TokenizerMapper.class); job.setCombinerClass (IntSumReducer.class); job.setReducerClass (IntSumReducer.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class); FileInputFormat.addInputPath (job, new Path (otherArgs [0])); FileOutputFormat.setOutputPath (job, new Path (otherArgs [1])) System.exit (job.waitForCompletion (true)? 0: 1);}}

You can see that the entire source code is divided into three parts:

1. Map

Public static class TokenizerMapper extends Mapper {private final static IntWritable one = new IntWritable (1); private Text word = new Text (); public void map (Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer (value.toString ()); while (itr.hasMoreTokens ()) {word.set (itr.nextToken ()); context.write (word, one);}

A) define your own Map process, and set the class name TokenizerMapper by yourself. This class needs to inherit the Mapper class in the org.apache.hadoop.mapreduce package. The four parameters represent the parameter type of the input key key, the input value value, the output key key, and the output value value. It is worth noting that Hadoop itself provides a set of basic types that can be optimized for network serialization transport, rather than using java embedded types. These types are all in the org.apache.hadoop.io package. Where LongWritable type is equivalent to Long type, Text type is equivalent to String type, and IntWritable is equivalent to Integer type.

B) the parameter value in the map method refers to a line in the text file, and the parameter key is the offset of the first letter of the line from the first address of the text file

C) the StringTokenizer class is an application class used to separate String, similar to split.

/ / it has three constructors: public StringTokenizer (String str) public StringTokenizer (String str,String delim) public StringTokenizer (String str,String delim,boolean returnDelims) / / where the first parameter is the String to be delimited, the second parameter is the set of delimiters, and the third parameter is whether the delimiter is returned as a tag, if the delimiter is not specified The default is'\ t\ n\ r\ f the nextToken / it has three main methods: public boolean hasMoreTokens () / / returns whether there is a delimiter public String nextToken () / / returns the string public int countTokens () from the current position to the next delimiter / / returns the number of times the nextToken method is called

D) after StringTolenizer processing, you will get one by one

< word,1 >

Such a key-value pair is put in context, and Context is used to write the output content, which is a bit roundabout to read, so you can understand it for yourself.

2. Reduce

Public static class IntSumReducer extends Reducer {private IntWritable result = new IntWritable (); public void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {int sum = 0; for (IntWritableval: values) {sum + = val.get ();} result.set (sum); context.write (key, result) }}

A) like the mapper procedure, the Reduce procedure needs to inherit the Reducer class in the org.apache.hadoop.mapreduce package and override its reduce method.

B) the input parameter key in the reduce method refers to a single word, and values refers to the list of counting values of the corresponding words

C) the purpose of the reduce method is to add and process the values of the list

D) the output is

< key,value>

, key refers to a single word, and value refers to the sum of the values of the list of count values of the corresponding word.

3. Main

Public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); conf.set ("mapred.job.tracker", "172.16.10.15 args 9001"); / / additional code String [] otherArgs = new GenericOptionsParser (conf, args). GetRemainingArgs (); if (otherArgs.length! = 2) {System.err.println ("Usage: wordcount") System.exit (2);} Job job = new Job (conf, "word count"); job.setJarByClass (WordCount.class); job.setMapperClass (TokenizerMapper.class); job.setCombinerClass (IntSumReducer.class); job.setReducerClass (IntSumReducer.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class) FileInputFormat.addInputPath (job, new Path (otherArgs [0])); FileOutputFormat.setOutputPath (job, new Path (otherArgs [1])); System.exit (job.waitForCompletion (true)? 0: 1);}

A) Configuration conf = new Configuration (); by default, when Configuration starts instantiating, it reads the parameters from the Hadoop configuration file.

B) conf.set ("mapred.job.tracker", "172.16.10.15 eclipse 9001"); this code is set because we want to submit jobs using Job to the Hadoop cluster, so we manually add the Job run address. If you run directly in the Hadoop cluster, you don't need to add this code. And you can see that as long as the first three sentences use this code, so the code after these three sentences will be included in all Hadoop examples.

C) the next sentence is also a read parameter, which is read from the command line argument.

D) Job job = new Job (conf, "word count"); during MapReduce processing, the Job object is responsible for managing and running a computing task, and then setting the parameters of the task through several Job methods. "word count" is the name of Job (of course, according to all java language specifications, you can also use the

Job job = new Job (); job.setJobName ("Name")

Make a declaration in the form of.

E) job.setJarByClass (WordCount.class); sets the Jar file based on the location of the WordCount class.

Why would you do that? Because when we run this job on the Hadoop cluster, we package the code into a JAR file to publish the file on the cluster. Hadoop uses this passed class to find the JAR file that contains it.

F) job.setMapperClass (TokenizerMapper.class); set Mapper

G) job.setCombinerClass (IntSumReducer.class); set Combiner. Here, we first use the Reduce class to merge the intermediate results of Mapper, which can reduce the pressure of network transmission.

H) job.setReducerClass (IntSumReducer.class); set Reduce

I) job.setOutputKeyClass (Text.class); and job.setOutputValueClass (IntWritable.class); set the type of output key and set the type of output value, respectively

J) FileInputFormat.addInputPath (job, new Path (otherArgs [0])); set the input file, which is the first parameter of otherArgs

K) FileOutputFormat.setOutputPath (job, new Path (otherArgs [1])); set the output file and write the output to this file, which is the second parameter of otherArgs.

Note: this output directory should not exist before running the job, otherwise Hadoop will report an error and refuse to run the job. The purpose of this precaution is to prevent data loss (it must be very annoying if the results of long-running data are accidentally overwritten).

L) System.exit (job.waitForCompletion (true)? 0: 1); job execution, waiting for the execution result

4. The function of each package

So far, the three major parts have been analyzed, and then let's take a look at what classes have been introduced:

A) package org.apache.hadoop.examples;Java provides the package mechanism management code. The keyword is package. The package name can be determined by itself, but cannot be repeated. Usually for the sake of package uniqueness, it is recommended to use the reverse order of the company domain name as the package, so there is a package name like 'org.apache.hadoop'' in the above example.

B) if import java.io.IOException; starts with java, you can find information about the class in JDK1.7 's API. Here is the introduction of IOException from java.io, which is an input / output exception class.

C) import java.util.StringTokenizer;, which is a StringTokenizer class introduced from the java.util package, is a class that parses text. The specific usage has been mentioned above.

D) if import org.apache.hadoop.conf.Configuration; starts with org.apache.hadoop, you can find information about the class in Hadoop1.2.1 's API documentation. Here is the introduction of the Configuration class from hadoop's conf package, which is a class that reads, writes and saves configuration information.

E) the import org.apache.hadoop.fs.Path; Path class saves the path string of the file or directory

F) import org.apache.hadoop.io.IntWritable; IntWritable is a serializable integer represented by a class. In java, to represent an integer, you can use either the int type or the integer type, integer encapsulates the int type, and the integer class is serializable. But Hadoop thought that the serialization of integer was not appropriate, so he implemented IntWritable.

G) import org.apache.hadoop.io.Text; introduces the Text class from the io package, which is a comparably serializable class that stores strings.

H) import org.apache.hadoop.mapreduce.Job; introduces the Job class, and each task in Hadoop is a Job, which is responsible for parameter configuration, setting MapReduce details, submitting to the Hadoop cluster, performing control, and so on.

I) import org.apache.hadoop.mapreduce.Mapper; introduces the Mapper class, which is responsible for the Map process in MapReduce.

J) import org.apache.hadoop.mapreduce.Reducer; introduces the Reduce class, which is responsible for the Reduce process in MapReduce.

K) import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; is introduced into the FileInputFormat class, and the main function is to slice the file.

L) the import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;FileOutputFormat class writes the output to a file.

M) the import org.apache.hadoop.util.GenericOptionsParser; class is responsible for parsing command-line arguments.

From the function of the code, we already have a clear understanding of map reduce, so how exactly does the wordcount program execute?

Upload the file file1.txt,file2.txt to the hdfsinput1 folder in hdfs (either through the eclipse client or through the Hadoop command line), and then write the wordcount.java file on eclipse (that is, the source code for the first part of the analysis)

Because the files used in the test are small, each file is a split, and the file is divided into lines.

< key,value>

This step is done automatically by the MapReduce framework, where the key value is the offset of the first letter of the line from the first address of the text file

To get the output of the map method

< key,value>

After the pair, the Combine operation is performed.

Similarly, in the Reduce process, the input data is sorted first, and then processed by the custom reduce method to get a new

< key,value>

Yes, and as the output of WordCount, the output is stored in part-r-00000 under the lxnoutputssss folder of the first picture.

Thank you for reading! This is the end of the article on "how to achieve HelloWorld in Hadoop". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it out for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report