What is the method of hadoop multi-file output of new and old API 07/06 Update SLTechnology News&Howtos

What is the method of hadoop multi-file output of new and old API

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what is the method of hadoop multi-file output of new and old API". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Generally speaking, Map/Reduce outputs a set of files, but in some cases, we need to output multiple sets of files, such as the requirements I mentioned above. Next, I will use the new and old API to explain how to achieve multi-file output.

Old API:

MultipleTextOutputFormat is a very important class. In fact, all we have to do is write a class that inherits MultipleTextOutputFormat and overrides the generateFileNameForKeyValue (Object key, Object value, String name) method. Because there is a write method in MultipleTextOutputFormat, which writes the record to hdfs, in this method, generateFileNameForKeyValue is called. Don't talk too much nonsense, put on the code:

Public class MultiFileOutputFormat extends MultipleTextOutputFormat {@ Override protected String generateFileNameForKeyValue (Object key, Object value, String name) {if (key instanceof OutputFileName) {return ((OutputFileName) key) .getPath () + "/" + name;} else {return super.generateFileNameForKeyValue (key, value, name);}

OutputFileName is an enumerated class defined by myself, which is easy to manage. You can also return a path here. Here is the code of OutputFileName.

Public enum OutputFileName {ERRORLOG ("errorlog", "logtype=errorlog"), APIREQUEST ("apiRequest", "logtype=apiRequest"), FIRSTINTOTIME ("firstIntoTime", "logtype=firstIntoTime"), TABFLUSHTIME ("tabFlushTime", "logtype=tabFlushTime"), PERFORMANCE ("performance", "logtype=performance"), FILEREQUEST ("fileRequest", "logtype=fileRequest") Private String name; private String path; private String tempPath; private OutputFileName (String name,String path) {this.name = name; this.path = path;} public String getName () {return this.name;} public String getPath () {if (! StringUtil.isEmpty (tempPath)) {String temp = this.tempPath; this.tempPath = null Return temp;} else {return this.path;}

How to use MultiFileOutputFormat, a self-written class? Just use it this way.

/ / JobConf conf = new JobConf (config,XXX.class) in the main method of the class where job belongs; conf.setOutputFormat (MultiFileOutputFormat.class); / / collector.collect (OutputFileName.ERRORLOG, new Text (log) in the map function)

After doing the above, this example can write the data to the logtype=errorlog directory. Of course, you can set the output directory according to different logs.

New API:

For the new API, I did not find the MultipleTextOutputFormat class, which was a headache. I even read the source code and copied the old API to write MultipleTextOutputFormat myself, which required a lot of work. I had to write a class integrating RecordWriter and rewrite the methods in it. At that time, it was possible to write data to different paths, but there was also bug. When there was a lot of data, only part of the data under the path was retained. I did a test, and I really wrote all the records. But only kept the last part of the written under the set path, so far no reason has been found, so there is no code here, only more than 600,000 lines of records can be kept.

Of course, I still have a way, after a lot of torture, finally found the relevant information on the Internet, use this class MultipleOutputs, check API, there is really, but under the org.apache.hadoop.mapreduce.lib.output package, this class is equivalent to reorganizing the old API things, we do not have to write other classes integrated MultipleTextOutputFormat. How to use it? look at the code.

Public static class MapperClass extends Mapper {private Text outkey = new Text (""); private MultipleOutputs mos; public void map (Object key, Text value, Context context) throws IOException,InterruptedExceptio {String log = value.toString (); outkey.set (log); int begin = log.indexOf ("@ [# ("); if (begin! =-1) {String logForSplit = log.substring (begin+ "@" .length ()) String [] split = logForSplit.split ("#"); if (split! = null & & split.length > 0) {String cType = split [0]; if (! StringUtil.isEmpty (cType)) {if ("apiRequest" .equals (cType)) {mos.write ("apiRequest", outkey, NullWritable.get ()) } else if ("errlog" .equals (cType)) {mos.write ("errorlog", outkey, NullWritable.get ());} @ Override protected void cleanup (Context context) throws IOException, InterruptedException {mos.close () Super.cleanup (context);} @ Override protected void setup (Context context) throws IOException, InterruptedException {mos = new MultipleOutputs (context); super.setup (context);}} public class TestJob {public static void main (String [] args) throws IOException, InterruptedException, ClassNotFoundException {Configuration conf = new Configuration (); Job job = new Job (conf, "ss") Job.setInputFormatClass (TrackInputFormat.class); job.setOutputFormatClass (TextOutputFormat.class); job.setJarByClass (TestJob.class); job.setMapperClass (TestJob.MapperClass.class); job.setNumReduceTasks (0); job.setOutputKeyClass (Text.class); job.setOutputValueClass (NullWritable.class); if (inputPaths.length > 0) {Path [] paths = new Path [inputPaths.length]; for (int I = 0; I < inputPaths.length; iSuppli +) {paths [I] = new Path (inputPathes [I]) } FileInputFormat.setInputPaths (job, paths);} else {FileInputFormat.setInputPaths (job, new Path (args [0]);} FileOutputFormat.setOutputPath (job, new Path (args [1])); MultipleOutputs.addNamedOutput (job, "errorlog", TextOutputFormat.class, Text.class, NullWritable.class); MultipleOutputs.addNamedOutput (job, "apiRequest", TextOutputFormat.class, Text.class, NullWritable.class);}}

OK, this is fine. First of all, we must define the MultipleOutputs object in our map class, and rewrite the cleanup and setup methods to close and create the MultipleOutputs object, respectively. The most important thing is to register our file name in the job class, such as errorlog,apiRequest and so on.

The above two examples are somewhat different. The first is to write the data to a different directory, while the second is to write to the same directory, but it will be divided into different types of files, such as the records I intercepted.

-rw-r--r-- 2 hadoop supergroup 10569073 2014-06-06 11:50 / test/aa/fileRequest-m-00063.lzo

-rw-r--r-- 2 hadoop supergroup 10512656 2014-06-06 11:50 / test/aa/fileRequest-m-00064.lzo

-rw-r--r-- 2 hadoop supergroup 68780 2014-06-06 11:51 / test/aa/firstIntoTime-m-00000.lzo

-rw-r--r-- 2 hadoop supergroup 67901 2014-06-06 11:51 / test/aa/firstIntoTime-m-00001.lzo

As for how to output to different directories, it needs to be studied. There is a disadvantage in this way, which will produce a lot of.

-rw-r--r-- 2 hadoop supergroup 42 2014-06-06 11:50 / test/aa/part-m-00035.lzo empty file

This is the end of the content of "what is the method of hadoop multi-file output new and old API". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.