How to control the number of map through inputSplit fragment size 07/09 Update SLTechnology News&Howtos

How to control the number of map through inputSplit fragment size

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to control the number of map through inputSplit fragment size", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how to control the number of map through inputSplit fragment size"!

When executing the Hadoop program, we should set the number of Map according to different situations. In addition to setting the maximum number of map that can be run on each node that is fixed, we also need to control the number of tasks that actually perform Map operations.

1. How to control the number of map tasks actually running

We know that when the file is uploaded to the Hdfs file system, it is split into different Block blocks (the default size is 64MB). But the chunks processed by each Map are sometimes not the physical Block chunks of the system. The actual size of the input chunk processed is set according to InputSplit, so how does InputSplit get it?

InputSplit=Math.max (minSize, Math.min (maxSize, blockSize) where: minSize=mapred.min.split.size maxSize=mapred.max.split.size

We control the actual number of Map used by changing the number of shards in InputFormat, while controlling the number of shards in InputFormat requires controlling the size of each InputSplit shard

two。 How to control the size of each split fragment

The default input format of Hadoop is TextInputFormat, which defines how the file is read and shredded. We open his source file (in the org.apache.hadoop.mapreduce.lib.input package):

Package org.apache.hadoop.mapreduce.lib.input;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.compress.CompressionCodec;import org.apache.hadoop.io.compress.CompressionCodecFactory;import org.apache.hadoop.io.compress.SplittableCompressionCodec;import org.apache.hadoop.mapreduce.InputFormat;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.JobContext;import org.apache.hadoop.mapreduce.RecordReader Import org.apache.hadoop.mapreduce.TaskAttemptContext;public class TextInputFormat extends FileInputFormat {@ Override public RecordReader createRecordReader (InputSplit split, TaskAttemptContext context) {return new LineRecordReader ();} @ Override protected boolean isSplitable (JobContext context, Path file) {CompressionCodec codec = new CompressionCodecFactory (context.getConfiguration ()) .getCodec (file); if (null = = codec) {return true;} return codec instanceof SplittableCompressionCodec;}}

Through the source code, we find that TextInputFormat inherits FileInputFormat, but in TextInputFormat, we do not find a specific part of file sharding. TextInputFormat should adopt FileInputFormat's default InputSplit method. Therefore, we opened the source code of FileInputFormat and found in it:

Public static void setMinInputSplitSize (Job job,long size) {job.getConfiguration () .setLong ("mapred.min.split.size", size);} public static long getMinSplitSize (JobContext job) {return job.getConfiguration () .getLong ("mapred.min.split.size", 1L);} public static void setMaxInputSplitSize (Job job,long size) {job.getConfiguration () .setLong ("mapred.max.split.size", size) } public static long getMaxSplitSize (JobContext context) {return context.getConfiguration () .getLong ("mapred.max.split.size", Long.MAX_VALUE);}

As we can see above, Hadoop implements the definition of mapred.min.split.size and mapred.max.split.size here, and the default value is 1 and the maximum of Long, respectively. Therefore, we only need to re-assign values to these two values in the program to control the size of InputSplit fragments.

3. If we want to set the shard size to 10MB

Then we can add the following code to the driver part of the MapReduce program:

TextInputFormat.setMinInputSplitSize (job,1024L); / / set the minimum shard size TextInputFormat.setMaxInputSplitSize (job,1024 × 1024 × 10L); / / set the maximum shard size to this point. I believe you have a better understanding of "how to control the number of map through inputSplit shard size". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.