Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the processing classes for Hadoop input and output

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the Hadoop input and output processing classes, which have a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

Processing class for hadoop input

InputFormat

InputFormat is responsible for processing the input part of the MR.

Function:

1. Verify whether the input of the job is standardized.

2. Cut the input file into InputSplit.

3. Provide the implementation class of RecordReader, read the InputSplit into Mapper for processing.

FileInputFormat

FileInputFormat is the base class of all InputFormat implementations with files as data sources. FileInputFormat saves all the files inputted by Job and implements the method of calculating splits for input files. As for the method of obtaining records, it is implemented by a different subclass, TextInputFormat.

TextInputFormat

The default processing class, which handles plain text files.

For each line in the file as a record, he takes the starting offset of each line in the file as key, the contents of each line as value, and defaults to\ n or enter as a line.

Note: TextInputFormat integrates with FileInputFormat.

InputSplit

Before the execution of MapReduce, the original data is divided into several Split, each Split as the input of a Map task, in the process of Map execution, the Split will be decomposed into records (key-value key-value pairs), and Map will process each record in turn.

Hadoop divides the input data of MapReduce into small data blocks of equal length, which is called input InputSplit or shard for short.

Hadoop builds a Map task for each shard, and that task runs the user-defined Map function to process each record in the shard.

Hadoop can get the best performance by running Map tasks on nodes that store input data (data in HDFS). This is called data localization optimization.

The optimal shard size should be the same as the block size:

Because it is the size of the largest input block that ensures that it can be stored on a single node. If the sharding spans two data blocks, it is almost impossible for any HDFS node to store two data blocks at the same time, so part of the data in the shard needs to be transmitted to the Map task node through the network. This approach is obviously less efficient than running the entire Map task with local data.

Other input classes

CombineFileInputFormat

Hadoop is more suitable for dealing with a small number of large files than a large number of small files.

CombineFileInputFormat can alleviate this problem, it is designed for small files.

KeyValueTextInputFormat

KeyValueTextInputFormat works well with files in this format when each row of input data is two columns and is separated by Tab.

NlineInputFormat

You can control the number of rows of data in each Split.

SequenceFileInputFormat

When the input file format is SequenceFile, use SequenceFileInputFormat as the input.

Custom input format

1. Integrate FileInputFormat base class

2. Override the getSplits (JobContext context) method

3. Override the createRecordReader (InputSplit split,TaskAttemptContext context) method

Processing class for Hadoop output

TextOutputFormat

In the default output format, the intermediate values of key and value are separated by Tab.

SequenceFileOutputFormat

Output key and value in sequence format.

SequenceFileAsOutputFormat

Output key and value in the original binary format.

MapFileOutputFormat

When writing key and value to MapFile, because the key in MapFile is ordered, you must ensure that the records are written in the order of key values.

MultipleOutputFormat

By default, a Reduce produces one output, but sometimes we want a Reduce to produce multiple outputs, and MultipleOutputFormat and MultipleOutputs can do this.

Thank you for reading this article carefully. I hope the article "what are the processing classes of Hadoop input and output" shared by the editor will be helpful to you. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report