Practical 1--WorldCout programming in MapReduce programming 07/09 Update SLTechnology News&Howtos

Practical 1--WorldCout programming in MapReduce programming

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

/ / mapreduce program import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat Import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat Public class WordCount {/ * TokenizerMapper continues to split the contents of the input file with "\ t\ n\ r\ f" from Mapper * * [if there is only one map for one file, two map for two files] * map [here read the contents of the input file. Then set the key/value pair of word = = > one] * * @ param Object Input key Type: * @ param Text Input value Type: * @ param Text Output key Type: * @ param IntWritable Output value Type: * * the main feature of Writable is that it makes the Hadoop framework know how to serialize and deserialize an object of type Writable. * WritableComparable adds the compareT interface to Writable so that the Hadoop framework knows how to sort objects of type WritableComparable. * * @ author liuqingjie * * / public static class TokenizerMapper extends Mapper {private final static IntWritable one = new IntWritable (1); private Text word = new Text (); public void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer (value.toString ()); while (itr.hasMoreTokens ()) {word.set (itr.nextToken ()); context.write (word, one) } / * IntSumReducer inherits from Reducer * * [No matter how many Map, there is only one Reduce, this is a summary] * reduce [loop all map values, summarize the key/value pairs of word = = > one] * * the key here is the word set for Mapper [every key/value has a reduce] * * when the loop ends In the end, it is true that context is the final result. * * @ author liuqingjie * * / public static class IntSumReducer extends Reducer {private IntWritable result = new IntWritable (); public void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {int sum = 0; for (IntWritableval: values) {sum + = val.get ();} result.set (sum) Context.write (key, result);} public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); if (args.length! = 2) {System.err.println ("Please configure path"); System.exit (2);} Job job = new Job (conf, "wordcount"); job.setJarByClass (WordCount.class) / / main class job.setMapperClass (TokenizerMapper.class); / / mapper job.setReducerClass (IntSumReducer.class); / / reducer job.setMapOutputKeyClass (Text.class); / / set the key class job.setMapOutputValueClass (IntWritable.class) for map output data; / / set the map output value class job.setOutputKeyClass (Text.class); / / set the key class job.setOutputValueClass (IntWritable.class) for job output data / / set the job output class FileInputFormat.addInputPath (job, new Path (otherArgs [0])); / / File input FileOutputFormat.setOutputPath (job, new Path (otherArgs [1])); / / File output System.exit (job.waitForCompletion (true)? 0: 1); / / wait for completion and exit. }}

Write a process analysis:

(1) data type

Integer: IntWritable, which is Hadoop's encapsulation of int

String type: Text, which is Hadoop's encapsulation of String

Context object: Context, which is used to communicate with MapReduce systems, such as passing the results of map to reduce processing

(2) execution process

It is divided into two stages: map stage and reduce stage, with key/value as input and output, in which the types of key and value can be defined by programmers.

Written by map:

Customize a class, inherited from the base class Mapper, the base class is a generic, there are four parameter types: used to specify the input key of the map function, input value, output key, output value, the format is as follows: public class Mapper.

According to the actual need, rewrite the map function, and the function type is specified by Mapper. Each pair calls the map function once.

In the wordcount program, the value value in the map method stores a line in the text file. The key value is the offset of the first character of the line relative to the first character of the text file. In this program, the key value is not used. The StringTokenizer class is a word that splits each line into individual words.

Written by reduce:

Define a class that inherits from the base class Reducer, which is a generic type with four formal parameter types: used to specify the input key, input value, output key and output value of the reduce function, and the format public class Reducer, in which the input type of reduce must be the same as the output type of map.

According to the actual need, override the reduce method, and the type of method is specified by Reducer. Each key calls the reduce method once.

The main function is written:

Configure the job in the main function, which is mainly configured as follows:

Job job = new Job (conf, "word count"); job.setJarByClass (WordCount.class); / / main class job.setMapperClass (TokenizerMapper.class); / / mapper job.setReducerClass (IntSumReducer.class); / / reducer job.setMapOutputKeyClass (Text.class); / / set key class job.setMapOutputValueClass (IntWritable.class) for map output data; / / set map output class job.setOutputKeyClass (Text.class) / / set the key class job.setOutputValueClass (IntWritable.class) of the job output data; / / set the job output value class FileInputFormat.addInputPath (job, new Path (otherArgs [0])); / / File input FileOutputFormat.setOutputPath (job, new Path (otherArgs [1])); / / File output System.exit (job.waitForCompletion (true)? 0: 1); / / wait for completion and exit.

(3) data processing process

1) split the file into splits, and split it automatically by the MapReduce framework, dividing each split into pairs.

2) each pair calls the map function once, and a new pair is produced after processing, which is passed to reduce by Context.

3) Mapper sorts pairs by key value and merges value with the same key value. Finally, the final output of Mapper is obtained.

4) reduce processing, after processing will be a new pair of output.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.