Example Analysis of hadoop-Mapper 07/15 Update SLTechnology News&Howtos

Example Analysis of hadoop-Mapper

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain the example analysis of hadoop-Mapper for you in detail. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

* Licensed to the Apache Software Foundation (ASF) under onepackage org.apache.hadoop.mapreduce;import java.io.IOException;/** * Maps input key/value pairs to a set of intermediate key/value pairs. * *

Maps are the individual tasks which transform input records into a * intermediate records. The transformed intermediate records need not be of * the same type as the input records. A given input pair may map to zero or * many output pairs.

* *

The Hadoop Map-Reduce framework spawns one map task for each * {@ link InputSplit} generated by the {@ link InputFormat} for the job. * Mapper implementations can access the {@ link Configuration} for * the job via the {@ link JobContext#getConfiguration ()}. * *

The framework first calls * {@ link # setup (org.apache.hadoop.mapreduce.Mapper.Context), followed by * {@ link # map (Object, Object, Context)} * for each key/value pair in the InputSplit. Finally * {@ link # cleanup (Context)} is called.

* *

All intermediate values associated with a given output key are * subsequently grouped by the framework, and passed to a {@ link Reducer} to * determine the final output. Users can control the sorting and grouping by * specifying two key {@ link RawComparator} classes.

* *

The Mapper outputs are partitioned per * Reducer. Users can control which keys (and hence records) go to * which Reducer by implementing a custom {@ link Partitioner}. * *

Users can optionally specify a combiner, via * {@ link Job#setCombinerClass (Class)}, to perform local aggregation of the * intermediate outputs, which helps to cut down the amount of data transferred * from the Mapper to the Reducer. * *

Applications can specify if and how the intermediate * outputs are to be compressed and which {@ link CompressionCodec} s are to be * used via the Configuration.

* *

If the job has zero * reduces then the output of the Mapper is directly written * to the {@ link OutputFormat} without sorting by keys.

* *

Example:

* public class TokenCounterMapper * extends Mapper {* * private final static IntWritable one = new IntWritable (1); * private Text word = new Text (); * * public void map (Object key, Text value, Context context) throws IOException {* StringTokenizer itr = new StringTokenizer (value.toString ()); * while (itr.hasMoreTokens ()) {* word.set (itr.nextToken ()); * context.collect (word, one) * *

* *

Applications may override the {@ link # run (Context)} method to exert * greater control on map processing e.g. Multi-threaded Mappers * etc.

* * @ see InputFormat * @ see JobContext * @ see Partitioner * @ see Reducer * / public class Mapper {public class Context extends MapContext {public Context (Configuration conf, TaskAttemptID taskid, RecordReader reader, RecordWriter writer, OutputCommitter committer, StatusReporter reporter, InputSplit split) throws IOException, InterruptedException {super (conf, taskid, reader, writer, committer, reporter Split) }} / * Called once at the beginning of the task. * / protected void setup (Context context) throws IOException, InterruptedException {/ / NOTHING} / * * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. * / @ SuppressWarnings ("unchecked") protected void map (KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException {context.write ((KEYOUT) key, (VALUEOUT) value);} / * * Called once at the end of the task. * / protected void cleanup (Context context) throws IOException, InterruptedException {/ / NOTHING} / * * Expert users can override this method for more complete control over the * execution of the Mapper. * @ param context * @ throws IOException * / public void run (Context context) throws IOException, InterruptedException {setup (context); while (context.nextKeyValue ()) {map (context.getCurrentKey (), context.getCurrentValue (), context);} cleanup (context);}}

The four methods of Mapper are setup,map,cleanup and run. Setup and cleanup are used to manage resources in the Mapper lifecycle. Setup is called before the completion of the Mapper construction and is about to start the execution of the map action, and the cleanup is called after all the map actions are completed. The method map is used to perform a map action on an input key/value pair. The run method performs the procedure described above, calling setup, iterating over all the key/value pairs, map, and finally calling cleanup.

Three subclasses of Mapper are implemented in org.apache.hadoop.mapreduce.lib.map, namely InverseMapper (taking the input map as output), MultithreadedMapper (multithreading executes the map method), and TokenCounterMapper (decomposing the input value into token and counting). The most complex of these is MultithreadedMapper, which we take as an example to analyze the implementation of Mapper.

InverseMapper source code:

* Licensed to the Apache Software Foundation (ASF) under onepackage org.apache.hadoop.mapreduce.lib.map;import java.io.IOException;/** A {@ link Mapper} that swaps keys and values. * / public class InverseMapper extends Mapper {/ * * The inverse function. Input keys and values are swapped.*/ @ Override public void map (K key, V value, Context context) throws IOException, InterruptedException {context.write (value, key);}}

TokenCountMapper source code:

* Licensed to the Apache Software Foundation (ASF) under onepackage org.apache.hadoop.mapreduce.lib.map;import java.io.IOException;/** * Tokenize the input values and emit each word with a count of 1. * / public class TokenCounterMapper extends Mapper {private final static IntWritable one = new IntWritable (1); private Text word = new Text (); @ Override public void map (Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer (value.toString ()) While (itr.hasMoreTokens ()) {word.set (itr.nextToken ()); context.write (word, one);}

MultithreadedMapper starts multiple threads to execute the map method of another Mapper, which starts mapred.map.multithreadedrunner.threads (configuration item) threads to execute Mapper:mapred.map.multithreadedrunner.class (configuration item). MultithreadedMapper overrides the run method of the base class Mapper, starting N threads (the corresponding class is MapRunner) to execute the run method of mapred.map.multithreadedrunner.class (what we call the target Mapper) (that is, the setup and cleanup of the target Mapper will be executed multiple times). The target Mapper shares the same InputSplit, which means that data reads to InputSplit must be thread-safe. For this reason, MultithreadedMapper introduces the inner class SubMapRecordReader,SubMapRecordWriter,SubMapStatusReporter, which inherits from RecordReader,RecordWriter and StatusReporter respectively. Through mutually exclusive access to the Mapper.Context of MultithreadedMapper, they realize thread-safe access to the same InputSplit and provide the required Context for Mapper. The implementation of these classes is simple.

This is the end of this article on "sample Analysis of hadoop-Mapper". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.