What is the output format of MapReduce 07/03 Update SLTechnology News&Howtos

What is the output format of MapReduce

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly talks about "what is the output format of MapReduce". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn what the output format of MapReduce is.

Output format of MapReduce

Hadoop has a corresponding output format. By default, there is only one Reduce, the output has only one file, the default file name is part-r-00000, and the number of output files is the same as the number of Reduce. If you have two Reduce, the output will have two files, the first is part-r-00000, the second is part-r-00001, and so on.

OutputFormat interface

OutputFormat is mainly used to describe the format of the output data, and it can write user-provided key/value pairs to files in a specific format. Through the OutputFormat interface, the implementation of a specific output format, the process is somewhat complex and unnecessary. Hadoop comes with many OutputFormat implementations, which correspond to InputFormat implementations and are sufficient to meet the needs of our business. The hierarchy of the OutputFormat class is shown in the following figure.

OutputFormat is the base class for MapReduce output, and all implementation MapReduce outputs implement the OutputFormat interface. We can divide these implementation interface classes into the following types, which are introduced one by one.

Text output

The default output format is TextOutputFormat, which writes each record as a line of text. Its keys and values can be of any type because TextOutputFormat calls the toString () method to convert them to strings. Each key / value pair is split by tabs, and of course you can set the mapreduce.output.textoutputformat.separator property (mapred.textoutputformat.separator in older versions of API) to change the default delimiter. The input format corresponding to TextOutputFormat is KeyValueTextInputFormat, which splits key / value pairs into text with configurable delimiters.

You can use NullWritable to omit the output key or value (or both, which is equivalent to the NullOutputFormat output format, which outputs nothing). This also results in undelimited output so that the output is suitable for reading with TextInputFormat.

Binary output

1. About SequenceFileOutputFormat

As the name implies, SequenceFileOutputFormat writes its output as a sequential file. If the output needs to be used as input for subsequent MapReduce tasks, this is a good output format because it is compact and easy to compress.

2. About SequenceFileAsBinaryOutputFormat

SequenceFileAsBinaryOutputFormat writes key / value pairs into a SequenceFile container as a binary format.

3. About MapFileOutputFormat

MapFileOutputFormat takes MapFile as its output. The keys in the MapFile must be added sequentially, so you must make sure that the keys in the reducer output are in order.

Multiple output

As mentioned above, there is only one Reduce by default, and the output has only one file. Sometimes you may need to control the filename of the output or have each reducer output multiple files. We have two ways to implement reducer to output multiple files.

1 、 Partitioner

We consider the need to output data to different file paths according to the age of the student. Here we are divided into three age groups: less than or equal to 20 years old, more than 20 years old, less than or equal to 50 years old and more than 50 years old.

Our approach is that each age group corresponds to a reducer. To do this, we need to do this through the following two steps.

Step 1: set the number of reducer to the number of age groups, that is, 3.

Job.setPartitionerClass (PCPartitioner.class); / / set the Partitioner class job.setNumReduceTasks (3); / / set the number of reduce to 3

Step 2: write a Partitioner and put the data of the same age group into the same partition.

Public static class PCPartitioner extends Partitioner

< Text, Text>

{@ Override public int getPartition (Text key, Text value, int numReduceTasks) {/ / TODO Auto-generated method stub String [] nameAgeScore = value.toString () .split ("\ t"); String age = nameAgeScore [1]; / / Student age int ageInt = Integer.parseInt (age) / / partition by age group / / specify partition 0 if (numReduceTasks = = 0) return 0 by default / / if the age is less than or equal to 20, specify if (ageInt 20 & & ageInt {private final static IntWritable one = new IntWritable (1); @ Override protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {context.write (value, one) }} public static class MailReducer extends Reducer

< Text, IntWritable, Text, IntWritable>

{private IntWritable result = new IntWritable (); private MultipleOutputs

< Text, IntWritable>

MultipleOutputs; @ Override protected void setup (Context context) throws IOException, InterruptedException {multipleOutputs = new MultipleOutputs

< Text, IntWritable>

(context);} protected void reduce (Text Key, Iterable

< IntWritable>

Values,Context context) throws IOException, InterruptedException {int begin = Key.toString () .indexOf ("@"); int end = Key.toString () .indexOf ("."); if (begin > = end) {return;} / / get the mailbox category, such as qq String name = Key.toString () .substring (begin+1, end) Int sum = 0; for (IntWritable value: Values) {sum + = value.get ();} result.set (sum); multipleOutputs.write (Key, result, name) } @ Override protected void cleanup (Context context) throws IOException, InterruptedException {multipleOutputs.close ();} @ Override public int run (String [] args) throws Exception {Configuration conf = new Configuration () / / read the configuration file Path mypath = new Path (args [1]); FileSystem hdfs = mypath.getFileSystem (conf); / / create the output path if (hdfs.isDirectory (mypath)) {hdfs.delete (mypath, true);} Job job = Job.getInstance () / / create a new task job.setJarByClass (Email.class); / / main class FileInputFormat.addInputPath (job, new Path (args [0])); / / input path FileOutputFormat.setOutputPath (job, new Path (args [1])); / / output path job.setMapperClass (MailMapper.class) / / Mapper job.setReducerClass (MailReducer.class); / / Reducer job.setOutputKeyClass (Text.class); / / key output type job.setOutputValueClass (IntWritable.class); / / value output type job.waitForCompletion (true); return 0 } public static void main (String [] args) throws Exception {String [] args0 = {"hdfs://single.hadoop.dajiangtai.com:9000/junior/mail.txt", "hdfs://single.hadoop.dajiangtai.com:9000/junior/mail-out/"} Int ec = ToolRunner.run (new Configuration (), new Email (), args0); System.exit (ec);}}

In reducer, construct an instance of MultipleOutputs in the setup () method and assign it to an instance variable. Use an instance of MultipleOutputs to write the output in the reduce () method instead of context. The write () method acts on keys, values, and names.

After the program runs, the output file is named as follows.

/ mail-out/163-r-00000/mail-out/126-r-00000/mail-out/21cn-r-00000/mail-out/gmail-r-00000/mail-out/qq-r-00000/mail-out/sina-r-00000/mail-out/sohu-r-00000/mail-out/yahoo-r-00000/mail-out/part-r-00000

The base path specified in MultipleOutputs's write () method is interpreted as the output path, because it can contain the file path delimiter (/), and it is possible to create subdirectories of any depth.

Database output

DBOutputFormat is suitable for transferring job output data (medium-sized data) to databases such as Mysql, Oracle, etc.

At this point, I believe you have a deeper understanding of "what the output format of MapReduce is". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.