In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
I. circumstances
I have been in contact with Hadoop for half a year, from the construction of Hadoop clusters to the installation of Hive, HBase, Sqoop related components, and even marginal projects such as Spark on Hive, Phoenix and Kylin. If I say deployment, I think I can have no problems, but if I have mastered the system, I dare not say so, because at least I am not familiar with MapReduce, and I only have a limited understanding of its working mechanism. About the operation of MapReduce, I almost understand, but the actual implementation can only rely on the found code, it is really ashamed.
So I can't help it any longer. I must have something of my own. At the very least, when writing, I don't have to find other people's blog, um, just find my own.
2. Experiment
1. Experimental process
At the beginning of the experiment is the simplest deduplicated MapReduce, in the local file experiment, there is no problem, but put the file on the HDFS can not be found, the reason, the HDFS needs to use Hadoop to execute the jar file
1) javac outputs the class to the specified directory dir
Javac * .java-d dir
2) jar packages class files
1. Package the specified class file to target.jar
Jar cvf target.jar x1.class x2.class... Xn.class
2. Package all class files under the specified path dir to target.jar
Jar cvf target.jar-C dir.
3. Package the class file into executable jar, and import the Main function into the program.
Jar cvfe tarrget.jar Main-C dir.
Hadoop only needs ordinary jar, and does not need to be packaged into executable jar.
3) execute jar, the main class MapDuplicate
Hadoop jar target.jar MapDuplicate (params)
2. Code analysis
1) import class
Import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser
Configuration class: used to set parameters for Hadoop, such as IP, port, etc.
Path: used to set input and output paths
The type of int used by IntWritable:MapReduce
The type of string used by Text:MapReduce
Job: the main class that generates the MapReduce task, and the task parameters are also set in this class
Mapper: inherited Map class
Reducer: inherited Reduce class
FileInputFormat: input file format
FileOutputFormat: output file format (can be changed to other IO classes, such as databases)
GenericOptionsParser: a class that parses command-line arguments
2) Code structure
Public class MapDuplicate {public static class Map extends Mapper {...} public static class Reduce extends Reducer {...} public static void main (String [] args) throws Ex {...}}
2) Map class
Public static class Map extends Mapper {private static Text line = new Text (); public void map (Object key,Text value,Context context) throws IOException,InterruptedException {line = value; context.write ("");}}
The main function of Map class is to process the data uniformly, give key-value pairs according to the rules, and provide standardized data for Reduce operations such as Combine and Reduce. In terms of code, both inherit the Mapper class and implement the map function
Mapper class inherits four parameters, the first two are input data key and value type, generally write Object,Text; the last two are output data key and value type, these two types must be consistent with the input data key value type of Reduce.
All Java value types are converted to corresponding value types before being sent to the MapReduce task: for example: String- > Text,int- > IntWritable,long- > LongWritable
Context is a class where the Java class interacts with the MapReduce task. It passes the key-value pair of Map to Combiner or Reducer, and also writes the result of Reducer to HDFS.
3) Reduce class
Public static class Reduce extends Reducer {public void reduce (Text key,Iterable values,Context context) throws IOException,InterruptedException {context.write (key,new Text ("));}}
Reduce has two operations, Combine and Reduce, both of which inherit the Reducer class. The former is used to preprocess the data and give the processed data to Reduce, which can be regarded as a local Reduce. When no processing is needed, Combine can be directly replaced by Reduce; the latter is used to formally process the data, merging the data with the same key value, and each Reduce function process only deals with the data with the same key (key).
The four parameters inherited by the Reducer class, the first two are the types of input data keys and values, which must be consistent with the output types of the Mapper class (Combine must also be consistent, and the Combine output must be consistent with the input of Reduce, so the Combine input and output types must be the same); the last two are the types of output data keys and values, that is, the final result.
4) Main function
Public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); conf.set ("mapred.job.tracker", "XHadoop1:9010"); String [] ioArgs = new String [] {"duplicate_in", "duplicate_out"}; String [] otherArgs = new GenericOptionsParser (conf,ioArgs). GetRemainingArgs () If (otherArgs.length! = 2) {System.err.println ("Usage: MapDuplicate"); System.exit (2);} Job job = new Job (conf, "MapDuplicate"); job.setJarByClass (MapDuplicate.class); job.setMapperClass (Map.class) Job.setCombinerClass (Reduce.class); job.setReducerClass (Reduce.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (Text.class); FileInputFormat.addInputPath (job,new Path (otherArgs [0])); FileOutputFormat.setOutputPath (job,new Path (otherArgs [1])) System.exit (job.waitForCompletion (true)? 0: 1);}
First, there must be a Configuration class, through which the working machine is specified.
Then, the statement that receives the parameter, which does not explain
Then, you need to have a Job class, specify the classes used for MapReduce processing, and you need to specify: Mapper classes, Combiner classes, Reducer classes, classes that output data keys and value types
Then, specify the path to the input data
Then, wait for the task to finish and exit.
III. Summary
This experiment can be said to be the simplest MapReduce, but the sparrow has all the internal organs.
In principle, MapReduce has the following steps:
HDFS (Block)-> Split- > Mapper- > Partion- > Spill- > Sort- > Combiner- > Merge- > Reducer- > HDFS
1. The input data of HDFS is divided into Split and read by Mapper class.
2. After Mapper reads the data, Partion (assign) the task.
3. If Map operates memory overflow, Spill (overflow) is required to disk.
4. Mapper performs Sort (sort) operation
5. Combine (merge key) operation after sorting, which can be understood as local mode Reduce
6. Merge (merge) of overflowing files will be carried out while Combine.
7. After all the tasks are completed, the data is handed over to Reducer for processing, and the data is written to HDFS after processing.
The data transfer operation from the beginning of the Map task to the beginning of the Reduce task is called Shuffle
Programmatically, MapReduce has the following steps:
1. Write the Mapper class
2. Write Combiner classes (optional)
3. Write Reducer class
4. Call procedure: parameter configuration Configuration
Specify task class
Specify input and output format
Specify data location
Start the mission.
The above is only a superficial understanding, only for learning reference and reference.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.