First try of MapReduce 07/09 Update SLTechnology News&Howtos

First try of MapReduce

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. circumstances

I have been in contact with Hadoop for half a year, from the construction of Hadoop clusters to the installation of Hive, HBase, Sqoop related components, and even marginal projects such as Spark on Hive, Phoenix and Kylin. If I say deployment, I think I can have no problems, but if I have mastered the system, I dare not say so, because at least I am not familiar with MapReduce, and I only have a limited understanding of its working mechanism. About the operation of MapReduce, I almost understand, but the actual implementation can only rely on the found code, it is really ashamed.

So I can't help it any longer. I must have something of my own. At the very least, when writing, I don't have to find other people's blog, um, just find my own.

2. Experiment

1. Experimental process

At the beginning of the experiment is the simplest deduplicated MapReduce, in the local file experiment, there is no problem, but put the file on the HDFS can not be found, the reason, the HDFS needs to use Hadoop to execute the jar file

1) javac outputs the class to the specified directory dir

Javac * .java-d dir

2) jar packages class files

1. Package the specified class file to target.jar

Jar cvf target.jar x1.class x2.class... Xn.class

2. Package all class files under the specified path dir to target.jar

Jar cvf target.jar-C dir.

3. Package the class file into executable jar, and import the Main function into the program.

Jar cvfe tarrget.jar Main-C dir.

Hadoop only needs ordinary jar, and does not need to be packaged into executable jar.

3) execute jar, the main class MapDuplicate

Hadoop jar target.jar MapDuplicate (params)

2. Code analysis

1) import class

Import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser

Configuration class: used to set parameters for Hadoop, such as IP, port, etc.

Path: used to set input and output paths

The type of int used by IntWritable:MapReduce

The type of string used by Text:MapReduce

Job: the main class that generates the MapReduce task, and the task parameters are also set in this class

Mapper: inherited Map class

Reducer: inherited Reduce class

FileInputFormat: input file format

FileOutputFormat: output file format (can be changed to other IO classes, such as databases)

GenericOptionsParser: a class that parses command-line arguments

2) Code structure

Public class MapDuplicate {public static class Map extends Mapper {...} public static class Reduce extends Reducer {...} public static void main (String [] args) throws Ex {...}}

2) Map class

Public static class Map extends Mapper {private static Text line = new Text (); public void map (Object key,Text value,Context context) throws IOException,InterruptedException {line = value; context.write ("");}}

The main function of Map class is to process the data uniformly, give key-value pairs according to the rules, and provide standardized data for Reduce operations such as Combine and Reduce. In terms of code, both inherit the Mapper class and implement the map function

Mapper class inherits four parameters, the first two are input data key and value type, generally write Object,Text; the last two are output data key and value type, these two types must be consistent with the input data key value type of Reduce.

All Java value types are converted to corresponding value types before being sent to the MapReduce task: for example: String- > Text,int- > IntWritable,long- > LongWritable

Context is a class where the Java class interacts with the MapReduce task. It passes the key-value pair of Map to Combiner or Reducer, and also writes the result of Reducer to HDFS.

3) Reduce class

Public static class Reduce extends Reducer {public void reduce (Text key,Iterable values,Context context) throws IOException,InterruptedException {context.write (key,new Text ("));}}

Reduce has two operations, Combine and Reduce, both of which inherit the Reducer class. The former is used to preprocess the data and give the processed data to Reduce, which can be regarded as a local Reduce. When no processing is needed, Combine can be directly replaced by Reduce; the latter is used to formally process the data, merging the data with the same key value, and each Reduce function process only deals with the data with the same key (key).

The four parameters inherited by the Reducer class, the first two are the types of input data keys and values, which must be consistent with the output types of the Mapper class (Combine must also be consistent, and the Combine output must be consistent with the input of Reduce, so the Combine input and output types must be the same); the last two are the types of output data keys and values, that is, the final result.

4) Main function

Public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); conf.set ("mapred.job.tracker", "XHadoop1:9010"); String [] ioArgs = new String [] {"duplicate_in", "duplicate_out"}; String [] otherArgs = new GenericOptionsParser (conf,ioArgs). GetRemainingArgs () If (otherArgs.length! = 2) {System.err.println ("Usage: MapDuplicate"); System.exit (2);} Job job = new Job (conf, "MapDuplicate"); job.setJarByClass (MapDuplicate.class); job.setMapperClass (Map.class) Job.setCombinerClass (Reduce.class); job.setReducerClass (Reduce.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (Text.class); FileInputFormat.addInputPath (job,new Path (otherArgs [0])); FileOutputFormat.setOutputPath (job,new Path (otherArgs [1])) System.exit (job.waitForCompletion (true)? 0: 1);}

First, there must be a Configuration class, through which the working machine is specified.

Then, the statement that receives the parameter, which does not explain

Then, you need to have a Job class, specify the classes used for MapReduce processing, and you need to specify: Mapper classes, Combiner classes, Reducer classes, classes that output data keys and value types

Then, specify the path to the input data

Then, wait for the task to finish and exit.

III. Summary

This experiment can be said to be the simplest MapReduce, but the sparrow has all the internal organs.

In principle, MapReduce has the following steps:

HDFS (Block)-> Split- > Mapper- > Partion- > Spill- > Sort- > Combiner- > Merge- > Reducer- > HDFS

1. The input data of HDFS is divided into Split and read by Mapper class.

2. After Mapper reads the data, Partion (assign) the task.

3. If Map operates memory overflow, Spill (overflow) is required to disk.

4. Mapper performs Sort (sort) operation

5. Combine (merge key) operation after sorting, which can be understood as local mode Reduce

6. Merge (merge) of overflowing files will be carried out while Combine.

7. After all the tasks are completed, the data is handed over to Reducer for processing, and the data is written to HDFS after processing.

The data transfer operation from the beginning of the Map task to the beginning of the Reduce task is called Shuffle

Programmatically, MapReduce has the following steps:

1. Write the Mapper class

2. Write Combiner classes (optional)

3. Write Reducer class

4. Call procedure: parameter configuration Configuration

Specify task class

Specify input and output format

Specify data location

Start the mission.

The above is only a superficial understanding, only for learning reference and reference.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.