How to realize the data input of MapReduce in Hadoop 07/13 Update SLTechnology News&Howtos

How to realize the data input of MapReduce in Hadoop

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to realize MapReduce data input in Hadoop". The explanation in this article is simple and clear, easy to learn and understand. Please follow the idea of Xiaobian and go deep into it slowly to study and learn "how to realize MapReduce data input in Hadoop" together.

Next, we decompose org.apache.hadoop.mareduce.lib. * according to the order of data flow in MapReduce. and introduce the functions of the corresponding base classes. The first is the input part, which implements the data input part of MapReduce. The class diagram is as follows:

In the upper right corner of the class diagram is the InputFormat, which describes the input to a MapReduceJob. With InputFormat, Hadoop can:

l Check the correctness of MapReduce input data;

Splitting input data into logical blocks InputSplit, which are assigned to Mapper;

Provide a RecordReader implementation that Mapper uses to read input pairs from InputSplit.

In org.apache.hadoop. mareduce.lib.input, Hadoop provides a virtual base class FileInputFormat for all file-based InputFormats The following parameters can be used to configure FileInputFormat:

l mapred.input.pathFilter.class: Input file filter, files that pass the filter will be added to InputFormat;

mapred.min.split.size: the smallest partition size;

mapred.max.split.size: Maximum split size;

l mapred.input.dir: Enter the path, separated by commas.

The most important methods in the class are:

protectedList listStatus(Configuration job)

Recursively obtain all files (including file information) in the input data directory. The job input is the Configuration of the system running, including the parameters mentioned above.

publicList getSplits(JobContext context)

Divide the input into InputSplit, which consists of two cycles. The first cycle processes all files. For each file, the cycle gets the partition on the file according to the maximum/minimum value of the input partition. Note that partitions do not cross files.

FileInputFormat does not implement the CreateRecordReader method of InputFormat.

FileInputFormat has two subclasses, SequenceFileInputFormat is a binary form of key/value file defined by Hadoop (see hadoop.apache.org/core/do... o/SequenceFile.html), which has its own defined file layout. SequenceFileInputFormat overloads listStatus due to its special extension, and it implements createRecordReader, returning a SequenceFileRecordReader object. TextInputFormat processes text files, createRecordReader returns instances of LineRecordReader. Neither of these classes overload the getSplits method of FileInputFormat, so in their RecordReader, you must consider how FileInputFormat divides the input.

FileInputFormat getSplits returns FileSplit. This is a very simple class, and the attributes it contains (file name, starting offset, partition length, and possible target machine) are sufficient to explain what the class does.

RecordReader is used to read pairs in partitions. RecordReader has five virtual methods:

Initialize: Initialize, input parameters include the Reader work data partition InputSplit and Job context;

l nextKey: Get the next Key input, if there is no new record in the data partition, return null;

l nextValue: Get the Value corresponding to Key, which must be called after calling nextKey;

getProgress: Get the current progress;

l close, Closeable interface from java.io for cleaning RecordReader.

We take LineRecordReader as an example to analyze the composition of RecordReader. We have analyzed FileInputFormat's division of files before, and the divided Split includes the file name, starting offset, and length of the division. Since the file is a text file, LineRecordReader's initialization method initializes creates a line-based read object LineReader (defined in org.apache.hadoop.util, so we won't parse it) and skips the first part of the input (only if Split starts at offset other than 0, which might be part of the last line of the previous Split). nextKey is easy to handle, it uses the current offset as the Key, nextValue is of course the line where the offset starts (if the line is long, truncation may occur). Progress getProgress and close are simple.

Thank you for reading, the above is the content of "how to implement MapReduce data input in Hadoop". After learning this article, I believe that everyone has a deeper understanding of how to implement MapReduce data input in Hadoop. The specific use situation still needs to be verified by practice. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.