Hadoop2.6.0 Learning Notes (1) introduction to MapReduce 07/19 Update SLTechnology News&Howtos

Hadoop2.6.0 Learning Notes (1) introduction to MapReduce

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Lu Chunli's work notes, who said that programmers should not have literary style?

Hadoop is the storage and computing platform processed by big data. HDFS is mainly used to realize data storage and MapReduce to realize data calculation.

Distributed computing functions have been encapsulated in MapReduce. When developing business functions, users only need to inherit Mapper and Reducer, and implement the map () and reduce () methods respectively.

1. Map stage

Read the data in hdfs, and then standardize the original data into a data form that is conducive to subsequent processing.

2. Reduce stage

Accept the data output from the map phase, summarize it yourself, and then write the results to hdfs.

The formal parameters received by map and reduce are.

In hadoop1, jobtracker and tasktracker.

In hadoop2, there are resourcemanager and nodemanager on yarn.

Mapper end

# Mapper provided by Hadoop. Custom Mapper needs to inherit this class package org.apache.hadoop.mapreduce;public class Mapper {/ * Called once at the beginning of the task. * / protected void setup (Context context) throws IOException, InterruptedException {/ / NOTHING} / * * Called once for each key/value pair in the input split. * Most applications should override this, but the default is the identity function. * / @ SuppressWarnings ("unchecked") protected void map (KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException {context.write ((KEYOUT) key, (VALUEOUT) value);} / * * Called once at the end of the task. * / protected void cleanup (Context context) throws IOException, InterruptedException {/ / NOTHING} / * * Expert users can override this method for more complete control over the * * @ param context * * / public void run (Context context) throws IOException, InterruptedException {setup (context); try {while (context.nextKeyValue ()) {map (context.getCurrentKey (), context.getCurrentValue (), context) Finally {cleanup (context);}

Reducer

# Reducer provided by Hadoop. Custom Reducer needs to inherit this class package org.apache.hadoop.mapreduce;public class Reducer {/ * Called once at the start of the task. * / protected void setup (Context context) throws IOException, InterruptedException {/ / NOTHING} / * * This method is called once for each key. * Most applications will define their reduce class by overriding this method. * The default implementation is an identity function. * / @ SuppressWarnings ("unchecked") protected void reduce (KEYIN key, Iterable values, Context context) throws IOException, InterruptedException {for (VALUEIN value: values) {context.write ((KEYOUT) key, (VALUEOUT) value);}} / * * Called once at the end of the task. * / protected void cleanup (Context context) throws IOException, InterruptedException {/ / NOTHING} / * * Advanced application writers can use the * {@ link # run (org.apache.hadoop.mapreduce.Reducer.Context)} method to * control how the reduce task works. * / public void run (Context context) throws IOException, InterruptedException {setup (context); try {while (context.nextKey ()) {reduce (context.getCurrentKey (), context.getValues (), context); / / If a back up store is used, reset it Iterator iter = context.getValues (). Iterator (); if (iter instanceof ReduceContext.ValueIterator) {(ReduceContext.ValueIterator) iter). ResetBackupStore () Finally {cleanup (context);}

Map process

The custom Mapper class inherits from this Mapper.class, and among the four parameters of the class Mapper class, the first two are inputs to the map () function, and the last two are the output of the map () function.

1. Read the contents of the input file, parse it into form, and call the map () function once for each pair.

2. Implement your own business logic in the map () function, process the input, and output the processed result in the form of

3. Partition the output

4. Sort and group the data of different groups according to key, and put the value of the same key into a set.

5. the grouped data are merged and processed.

Description:

When the user specifies the path to the input file, HDFS can automatically read the contents of the file, usually a text file (or something else). The map () function is called once on each line, and the line offset of each line is called as key, and the line content is passed into map as value.

MR is a distributed computing framework. Both map and reduce may have multiple tasks to perform. The purpose of partitioning is to confirm which map output should be received and processed by which reduce.

The shuffle process on the map side will be supplemented with subsequent learning.

Example of word count:

[hadoop@nnode hadoop2.6.0] $hdfs dfs-cat / data/file1.txthello worldhello markhuanghello hadoop [hadoop@nnode hadoop2.6.0] $

Each time it is passed in, it is read row by row, and the data passed in each call to the map function is,

Each time the map function is processed, the key is of LongWritable type, so there is no need to deal with it, only the received value needs to be processed. Because you need to count, you need to split,split the value of value once for each word (1 occurrence times).

KEYIN, VALUEIN, KEYOUT, VALUEOUT= > IntWritable, Text, Text, IntWritable

Reduce process

The custom Reducer class inherits from the Reducer class, similar to the Mapper class, and overrides the reduce method to implement its own business logic.

1. The output of multiple map tasks is copied to different reduce nodes through the network according to different partitions.

2. Why sort and sort the output of multiple tasks, and process them through custom business logic

3. Save the output of reduce to the specified file.

Description:

The input data received by reduce Value is grouped according to key (group), while group is sorted according to key, forming a structure.

Example of word count:

There are four sets of data,

The reduce method is called in turn and passed in as a key,value, which is handled by the business logic in reduce.

KEYIN,VALUEIN,KEYOUT,VALUEOUT= > Text,IntWritable, Text,IntWritable

Word counting program code:

Map end

Package com.lucl.hadoop.mapreduce;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;// map end public class CustomizeMapper extends Mapper {@ Override protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {LongWritable one = new LongWritable (1); Text word = new Text () StringTokenizer token = new StringTokenizer (value.toString ()); while (token.hasMoreTokens ()) {String v = token.nextToken (); word.set (v); context.write (word, one);}

Reduce end

Package com.lucl.hadoop.mapreduce;import java.io.IOException;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;// reduce end public class CustomizeReducer extends Reducer {@ Override protected void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {int sum = 0; for (LongWritable intWritable: values) {sum + = intWritable.get () Context.write (key, new LongWritable (sum));}}

Driver class

Package com.lucl.hadoop.mapreduce;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;import org.apache.hadoop.util.Tool Import org.apache.hadoop.util.ToolRunner;import org.apache.log4j.Logger;/** @ author lucl * * / public class MyWordCountApp extends Configured implements Tool {private static final Logger logger = Logger.getLogger (MyWordCountApp.class); public static void main (String [] args) {try {ToolRunner.run (new MyWordCountApp (), args);} catch (Exception e) {e.printStackTrace () } @ Override public int run (String [] args) throws Exception {Configuration conf = new Configuration (); String [] otherArgs = new GenericOptionsParser (conf, args). GetRemainingArgs (); if (otherArgs.length < 2) {logger.info ("Usage: wordcount [...]"); System.exit (2) } / * each map runs as a job task * / Job job = Job.getInstance (conf, this.getClass (). GetSimpleName ()); job.setJarByClass (MyWordCountApp.class); / * * specify input file or directory * / FileInputFormat.addInputPaths (job, args [0]) / / directory / * * map-related settings * / job.setMapperClass (CustomizeMapper.class); job.setMapOutputKeyClass (Text.class); job.setMapOutputValueClass (LongWritable.class); / * * reduce-related settings * / job.setReducerClass (CustomizeReducer.class); job.setCombinerClass (CustomizeReducer.class) Job.setOutputKeyClass (Text.class); job.setOutputValueClass (LongWritable.class); / * specify the output file directory * / FileOutputFormat.setOutputPath (job, new Path (args [1])); return job.waitForCompletion (true)? 0: 1;}}

The word counting program calls:

[hadoop@nnode code] $hadoop jar WCApp.jar / data / wc-20151129010115/11/29 00:20:37 INFO client.RMProxy: Connecting to ResourceManager at nnode/192.168.137.117:803215/11/29 00:20:38 INFO input.FileInputFormat: Total input paths to process: 215-11-29 00:20:39 INFO mapreduce.JobSubmitter: number of splits:215/11/29 00:20:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1448694510754_000415/11/29 00:20:39 INFO impl.YarnClientImpl: Submitted application application_1448694510754_000415/11/29 00:20:39 INFO mapreduce.Job: The url to track the job: http://nnode:8088/proxy/application_1448694510754_0004/15/11/29 00:20:39 INFO mapreduce.Job: Running job: job_1448694510754_000415/11/29 00:21:10 INFO mapreduce.Job: Job job_1448694510754_0004 running in uber mode: false15/11/29 00:21:10 INFO mapreduce.Job: map 0 reduce 0 Compact 11 / 29 00:21:41 INFO mapreduce.Job: map reduce 0 File System Counters FILE 11 reduce 29 00:22:01 INFO mapreduce.Job: map 100% reduce 100-11-29 00:22:02 INFO mapreduce.Job: Job job_1448694510754_0004 completed successfully15/11/29 00:22:02 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=134 FILE: Number of bytes written=323865 FILE: Number of Read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=271 HDFS: Number of bytes written=55 HDFS: Number of read operations=9 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=2 Launched reduce Tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms) = 55944 Total time spent by all reduces in occupied slots (ms) = 17867 Total time spent by all map tasks (ms) = 55944 Total time spent by all reduce tasks (ms) = 17867 Total vcore-seconds taken by all map tasks=55944 Total vcore-seconds taken by all reduce Tasks=17867 Total megabyte-seconds taken by all map tasks=57286656 Total megabyte-seconds taken by all reduce tasks=18295808 Map-Reduce Framework Map input records=6 Map output records=12 Map output bytes=170 Map output materialized bytes=140 Input split bytes=188 Combine input records=12 Combine output records=8 Reduce input Groups=7 Reduce shuffle bytes=140 Reduce input records=8 Reduce output records=7 Spilled Records=16 Shuffled Maps = 2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms) = 315 CPU time spent (ms) = 2490 Physical memory (bytes) snapshot=510038016 Virtual memory ( Bytes) snapshot=2541662208 Total committed heap usage (bytes) = 257171456 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=83 File Output Format Counters Bytes Written=55 [hadoop@nnode code] $

The word counting program outputs the result:

[hadoop@nnode ~] $hdfs dfs-ls / wc-201511290101Found 2 items-rw-r--r-- 2 hadoop hadoop 0 2015-11-29 00:22 / wc-201511290101/_SUCCESS-rw-r--r-- 2 hadoop hadoop 55 2015-11-29 00:21 / wc-201511290101/part-r-00000 [hadoop@nnode ~] $hdfs dfs-text / wc-201511290101/part-r-000002.3 1fail 1hadoop 4hello 3markhuang 1ok 1world 1 [hadoop@nnode ~] $

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.