Common classes of mapreduce in hadoop (2) 07/09 Update SLTechnology News&Howtos

Common classes of mapreduce in hadoop (2)

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Cloud Wisdom (Beijing) Technology Co., Ltd. Chen Xin

NullWritable

When you don't want to output, think of it as key. NullWritable is a special class of Writable, serialization length is 0, the implementation method is empty, do not read data from the data stream, do not write data, only as a placeholder, such as in MapReduce, if you do not need to use a key or value, you can declare the key or value as NullWritable,NullWritable is an immutable single instance type.

FileInputFormat inherits from InputFormat

The role of InputFormat:

Validate input specification

Split the input file to InputSpilts

Provide RecordReader to collect input records in InputSplit and execute them to Mapper.

RecordReader

Convert byte-oriented InputSplit to record-oriented view for Mapper or Reducer to run. So assume the boundaries of responsibility for processing records and render the key-value for the task.

SequenceFile:

SequenceFile is a flat file (serialization) that contains binary kv. It provides Writer, Reader and Sorter to write, read and sort. There are three ways to compress kv based on CompressionType,SequenceFile:

Writer: do not compress records

RecordCompressWriter: compress only values

BlockCompressWriter: compressed records,keys and values are compressed separately in block, and the size of block can be configured.

The compression method is specified by the appropriate CompressionCodec. It is recommended that you use the static method createWriter of this class to select the format. Reader can be used as a bridge to read any of these compression formats.

CompressionCodec:

The related methods of streaming compression / decompression are encapsulated.

Mapper

Mapper maps the input kv pairs to a collection of intermediate data kv pairs. Maps converts the input record into an intermediate record, where the converted record does not have to be the same as the input record type. A given input pair can be mapped to 0 or more output pairs.

During MRJob execution, the MapReduce framework generates InputSplit (input shards) based on the pre-specified InputFormat (input format object), and each InputSplit will be processed by a map task.

In summary, the Mapper implementation class is initialized by passing in the JobConf object through the JobConfigurable.configure (JobConf) method, and then calling the map (WritableComparable,Writable,OutputCollector,Reporter) method in each map task to handle each kv pair of InputSplit. MR applications can override the Closeable.close method to handle some of the necessary cleanup.

The output pair is not necessarily of the same type as the input pair. A given input pair may be mapped to 0 or many output pairs. The output pair is obtained by calling OutputCollector.colect (WritableComparable,Writable).

MR applications can use Reporter to report progress, set application-level status information, update counters, or just show that the application is running.

All intermediate data associated with a given output key is then grouped by the framework and passed to Reducer for processing to produce the final output. The user can specify a Comparator to control the packet processing process through JobConf.setOutputKeyComparatorClass (Class).

The Mapper output is sorted and partitioned according to the number of Reducer, which is equal to the number of reduce tasks. Users can control which keys (records) go to which Reducer by implementing a custom Partitioner.

In addition, the user can specify a Combiner and call JobConf.setCombinerClass (Class) to implement it. This allows local aggregation of map output, helping to reduce the amount of data from mapper to reducer.

Sorted intermediate output data is usually stored in SequenceFile in a simple format (key-len,key,value-len,value). The application can decide whether or how to be compressed and the compression format, and the CompressionCodec can be specified through JobConf.

If job does not have reducer, the output of mapper will be written directly to FileSystem without grouping and sorting.

Map number

The number of map is usually determined by the total size of the input data, that is, the number of blocks in all input files.

The number of map running in parallel per node is normally between 10 and 100. Since the initialization of the Map task itself takes some time, it would be better for the map to run for at least 1 minute.

Thus, if you have 10T data files with a size of 128m per block, the maximum use is the number of 82000map, unless you use setNumMapTasks (int) (this method provides only one recommended value for the MR framework) to set the map value to a higher value.

Reducer

Reducer merges intermediate data set processing into smaller data result sets based on key.

Users can set the number of reducer for a job through JobConf.setNumReduceTasks (int).

Overall, the Reducer implementation class passes in the JobConf object through the JobConfigurable.configure (JobConf) method and sets and initializes the Reducer for Job. The MR framework invokes reduce (WritableComparable, Iterator, OutputCollector,Reporter) to process input data grouped as key. The application can override Closeable.close () to handle the necessary cleanup operations.

Reducer consists of three main phases: shuffle,sort,reduce.

Shuffle

The input data into Reducer is the data that Mapper has sorted. In the shuffle phase, the relevant mapper address is obtained according to the partition algorithm, and the corresponding output data of the mapper is pulled from the reducer to the reducer machine through the Http protocol.

Sort

At this stage, the framework groups the input of reducer according to key (because the data output from different mapper may contain the same key).

Shuffle and sort are running at the same time, while reducer is still pulling the output of map.

Secondary Sort

If the rules for grouping the intermediate data key are inconsistent with the rules for grouping the key before the reduction phase, a Comparator can be set through JobConf.setOutputValueGroupingComparator (Class). Because the grouping policy of intermediate data is set through JobConf.setOutputKeyComparatorClass (Class), you can control which key the intermediate data is grouped according to. JobConf.setOutputValueGroupingComparator (Class) can be used for secondary sorting of value in the case of a data connection.

Reduce (simplification)

In this phase, the framework loops through the reduce (WritableComparable, Iterator, OutputCollector,Reporter) method to process each kv pair that is grouped.

Reduce tasks generally write output data to the file system FileSystem through OutputCollector.collect (WritableComparable, Writable). Applications can use Reporter to report job execution progress, set application-level status information and update counters (Counter), or simply prompt that the job is running.

Note that the output of Reducer is no longer sorted.

Number of Reducer

The appropriate number of reducer can be estimated as follows: (number of nodes mapred.tasktracker.reduce.tasks.maximum) times 0.95 or times 1.75. When the factor is 0.95, when all map tasks are completed, all reducer can be started immediately and start pulling data from the map machine. When the factor is 1.75, some of the fastest nodes will complete the first round of reduce processing, and the framework will start the second round of reduce tasks, which can achieve better job load balancing. Increasing the number of reduce will increase the running burden of the framework, but it will help to improve the load balancing of jobs and reduce the cost of failure. The use of the above factors is best based on the premise that the framework still has reduce slots when the job is executed. after all, the framework also needs to perform possible speculative execution of the job and deal with failed tasks.

Do not use Reducer

If you do not need to simplify, you can set the number of reduce to 0. In this case, the output of map is written directly to the file system. The output path is specified by setOutputPath (Path). The framework no longer sorts map results before writing data to the file system.

Partitioner

Partitioner partitions the data by key, thus controlling which reducer the output of map is transferred to. The default Partitioner algorithm is hash (hash. The number of partitions is determined by the number of reducer of the job. HashPartitioner is the default Partitioner.

Reporter

Reporter provides MR applications with progress reports, application status information settings, and counter (Counter) updates.

Mapper and Reducer implementations can use Reporter to report progress or indicate that the job is running normally. In some scenarios, the application takes too much time to process some special kv pairs, which may force them to stop because the framework assumes that the task times out. To avoid this, you can set mapred.task.timeout to a higher value or set it to 0 to avoid timeouts.

Applications can also use Reporter to update counts (Counter).

OutputCollector

OutputCollector is a general tool provided by the MR framework to collect Mapper or Reducer output data (intermediate data or final result data).

HadoopMapReduce provides some frequently used implementation classes for mapper, reducer, and partioner for us to learn.

The above sections on configuration and job have been changed in the new API. In short, MapContext and ReduceContext have been introduced into Mapper and Reducer, which encapsulate configuration and outputcollector, as well as reporter.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.