In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
I. basic principles
needs to slice the data before the map is executed, and each slice corresponds to a map task. Each map task does not process the sliced data directly, it processes KV. So there are two questions: how the data is sliced and how the slices are converted to KV for map processing.
this involves two abstract classes, InputFormat and RecordReader. For specific reasons for these two abstract classes, please take a look at the previous source code analysis of input.
1. InputFormatpublic abstract class InputFormat {public InputFormat () {} public abstract List getSplits (JobContext var1) throws IOException, InterruptedException; public abstract RecordReader createRecordReader (InputSplit var1, TaskAttemptContext var2) throws IOException, InterruptedException;}
As we can see, this abstract class has only two methods.
GetSplits: just look at the name and you can see that it is used to process data into slices.
CreateRecordReader: is used to create RecordReader objects.
So this is a basic function of InputFormat.
2. RecordReaderpublic abstract class RecordReader implements Closeable {public RecordReader () {} / / initialization, which generally reads the slice data public abstract void initialize (InputSplit var1, TaskAttemptContext var2) throws IOException, InterruptedException; / / checks whether there is a next pair of KV, and if so, actually processes it into KV and assigns values to this.key and this.value public abstract boolean nextKeyValue () throws IOException, InterruptedException / / returns a key public abstract KEYIN getCurrentKey () throws IOException, InterruptedException; / / returns a value public abstract VALUEIN getCurrentValue () throws IOException, InterruptedException; / / returns whether public abstract float getProgress () throws IOException is being processed, InterruptedException; / / closes reader public abstract void close () throws IOException;}
This abstract class involves reading sliced data and processing it into a KV structure. In the input source code analysis, it is said that getting key through a method similar to context.getCurrentKey () in the mapper.run method is actually calling these get methods in this RecordReader.
3. The relationship between InputFormat and RecordReader
You can see it from the source code above.
InputFormat: responsible for planning slice information and creating RecordReader objects
RecordReader: responsible for reading the slice data processed by the current mapper according to the slice plan, processing it into KV form, and then passing it to mapper through context.
2. Common implementation classes of InputFormat and RecordReader
Commonly used are: TextInputFormat, KeyValueTextInputFormat, NLineInputFormat, CombineTextInputFormat and custom InputFormat (there are other articles on customization)
1 、 TextInputFormat
this is the default InputFormat, the slicing method is to cut according to the data block, the default size block size. A file is at least one slice (no matter how small). Because this class inherits from FileInputFormat, it is sliced using the getsplit () method defined by its parent class.
The RecordReader used by is LineRecordReader. When processing is sliced into KV, each record is one line of input. Key K is of type LongWritable and stores the byte offset of the line throughout the file. The value is the content of the line, excluding any line Terminator (newline and carriage return).
2 、 KeyValueTextInputFormat
The class is also sliced using the getsplit () method of the parent class FileInputFormat, so it is sliced in the same way as above.
The RecordReader used by is KeyValueLineRecordReader. Each row is a record and is divided into key,value by a delimiter. You can set the delimiter by setting conf.set (KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "); in the driver class. The default delimiter is tab (\ t).
3 、 NLineInputFormat
Although the class inherits FileInputFormat, it overrides the getSplit method and slices it in another way. Is to slice according to the specified number of rows, such as 5 rows, then 5 rows as a slice, regardless of the size of the data. Set the number of rows of slices through the parameter mapreduce.input.lineinputformat.linespermap.
The RecordReader used by is LineRecordReader. Similar to the above, do not repeat.
4 、 CombineTextInputFormat
The class inherits from CombineFileInputFormat and the parent class inherits from FileInputFormat. The getSplits method was overridden in CombineFileInputFormat. Because FileInputFormat defaults that no matter how small a file is, a file is at least one slice. If you encounter a lot of small files, it will result in a lot of slices. The slicing method here is to slice strictly according to the size, and the small files will be grouped together to reach the specified size before they are regarded as a slice.
The RecordReader used by is CombineFileRecordReader. The processing method is similar to that of LineRecordReader, except that the slices may come from multiple files, which is a little troublesome to read.
3. Set to use the specified inputformatjob.setInputFormatClass (xxxInputFormat.class)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.