Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the main interfaces of MapReduce

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces what the main interfaces of MapReduce are, which can be used for reference. Interested friends can refer to them. I hope you will gain a lot after reading this article.

(1) InputFormat interface

The user needs to implement this interface to specify the content format of the input file. There are two methods for this interface

Public interface InputFormat {

InputSplit [] getSplits (JobConf job, int numSplits) throws IOException

RecordReader getRecordReader (InputSplit split,JobConf job,Reporter reporter) throws IOException

}

The getSplits function divides all input data into numSplits split, and each split is handed over to a map task for processing. The getRecordReader function provides an iterator object for the user to parse the split, parsing each record in the split into a key/value pair.

Hadoop itself provides some InputFormat:

TextInputFormat

As the default file input format, used to read plain text files, the file is divided into a series of lines ending with LF or CR, key is the position offset of each line, is of type LongWritable, and value is the content of each line, of type Text.

KeyValueTextInputFormat

It is also used to read the file, if the line is split into two parts by a delimiter (the default is tab), the first part is key and the rest is value;. If there is no delimiter, the whole line is blank as key,value.

SequenceFileInputFormat

Used to read sequence file. Sequence file is an binary file that Hadoop uses to store data in a custom format. It has two subclasses: SequenceFileAsBinaryInputFormat, which reads key and value as BytesWritable, and SequenceFileAsTextInputFormat, which reads key and value as Text.

SequenceFileInputFilter

According to filter to get part of the data that meets the conditions from the sequence file, specify Filter through setFilterClass, built-in three kinds of Filter,RegexFilter to take key values to meet the specified regular expression records; PercentFilter by specifying the parameter f, take the record of the number of rows% favored 0; MD5Filter by specifying the parameter f, take MD5 (key)% favored 0 records.

NLineInputFormat

You can split a file in behavior units, such as a map for each line of the file. The resulting key is the position offset of each line (LongWritable type), value is the content of each line, and Text type.

MultipleInputs

Join for multiple data sources

(2) Mapper interface

The functions that users must implement in their own Mapper,Mapper by inheriting the Mapper interface are

Mapper has four methods: setup (), map (), cleanup () and run (). Among them, setup () is generally used to do some preparatory work before map (), map () is generally responsible for the main processing work, and cleanup () is the final work such as closing files or Kmurv distribution after the implementation of map (). The run () method provides the execution template for setup- > map- > cleanup ().

(3) Partitioner interface

Users need to inherit the API to implement their own Partitioner to specify which reduce task will handle the key/value pairs generated by map task. A good Partitioner can make the data processed by each reduce task similar, thus achieving load balancing. The function to be implemented in Partitioner is

GetPartition (K2 key, V2 value, int numPartitions)

This function returns the corresponding reduce task ID.

Users who do not provide Partitioner,Hadoop will use the default (actually a hash function).

How to use Partitioner

Implement the Partitioner interface override getPartition () method

Partitioner example

Public static class MyPartitioner extends Partitioner {

@ Override

Public int getPartition (Text key, Text value, int numPartitions) {

}

}

Sample Partitioner requirements

Requirement description

The data file contains provinces

The same province needs to be sent to the same Reduce.

Resulting in different files.

Steps

Implement Partitioner, override getPartition

Segmentation based on province field

(4) Combiner

The combine function merges pairs (multiple key, value) generated by a map function into a new one. Input the new as into the reduce function in the same format as the reduce function. Combiner greatly reduces the amount of data transmission between map task and reduce task, which can significantly improve the performance. In most cases, Combiner is the same as Reducer.

Under what circumstances can I use Combiner

A scenario in which records can be summarized, such as summation.

The scene of averaging can not be used.

Timing of Combiner execution

The time to run the combiner function may be before or after the merge is completed, and the timing can be controlled by one parameter, min.num.spill.for.combine (default 3)

When combiner is set in job and the number of spill is at least 3, the combiner function runs before merge produces the result file

In this way, we can reduce the amount of data written to the disk file when spill needs a lot of merge and a lot of data needs to do conbine, also in order to reduce the frequency of reading and writing to the disk, and it is possible to optimize the job.

Combiner may not be executed, and Combiner will take into account the load of the cluster at that time.

How to use Combiner

Inherit the Reducer class

Public static class Combiner extends Reducer {

Public void reduce (Text key, Iterator values

OutputCollector output, Reporter reporter)

Throws IOException {

}

}

(5) Reducer interface

To implement your own Reducer, you must implement the reduce function

(6) OutputFormat

The user specifies the content format of the output file through OutputFormat, but it does not have split. Each reduce task writes its data to its own file, which is called part-nnnnn, where nnnnn is the ID of reduce task.

Public abstract class OutputFormat {

/ * *

* create a record writer

, /

Public abstract RecordWriter getRecordWriter (TaskAttemptContext context) throws IOException, InterruptedException

/ * *

* check whether the storage space of the result output is valid

, /

Public abstract void checkOutputSpecs (JobContext context) throws IOException, InterruptedException

/ * *

* create a task submitter

, /

Public abstract OutputCommitter getOutputCommitter (TaskAttemptContext context) throws IOException, InterruptedException

}

TextOutputFormat, output to a plain text file in the format key + "" + value.

/ dev/null in NullOutputFormat,hadoop, which sends the output to the black hole.

SequenceFileOutputFormat, output to sequence file format file.

MultipleSequenceFileOutputFormat, MultipleTextOutputFormat, output records to different files according to key.

DBInputFormat and DBOutputFormat, read from DB and output to DB.

Thank you for reading this article carefully. I hope the article "what are the main interfaces of MapReduce" shared by the editor will be helpful to you. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report