What are the differences between HDFS blocks and Input Splits 07/09 Update SLTechnology News&Howtos

What are the differences between HDFS blocks and Input Splits

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what the difference between HDFS blocks and Input Splits is. Xiaobian thinks it is quite practical, so share it with you as a reference. I hope you can gain something after reading this article.

HDFS block

Now I have a file named iteblog.txt, which looks like this:

[iteblog@iteblog.com /home/iteblog]$ ll iteblog.txt-rw-r--r-- 1 iteblog iteblog 454669963 May 15 12:07 iteblog.txt

Obviously, this file is larger than one HDFS block size, so if we store this file on HDFS, we will generate 4 HDFS blocks as follows (notice that the output below does some deletion):

[iteblog@iteblog.com /home/iteblog]$ hadoop -put iteblog.txt /tmp[iteblog@iteblog.com /home/iteblog]$ hdfs fsck /tmp/iteblog.txt -files -blocks/tmp/iteblog.txt 454669963 bytes, 4 block(s): OK0. BP-1398136447-192.168.246.60-1386067202761:blk_8133964845_1106679622318 len=134217728 repl=31. BP-1398136447-192.168.246.60-1386067202761:blk_8133967228_1106679624701 len=134217728 repl=32. BP-1398136447-192.168.246.60-1386067202761:blk_8133969503_1106679626977 len=134217728 repl=33. BP-1398136447-192.168.246.60-1386067202761:blk_8133970122_1106679627596 len=52016779 repl=3

You can see that the iteblog.txt file is divided into 4 blocks, the first three blocks are exactly 128MB (134217728), and the rest of the data is stored in the fourth HDFS block.

If there is a line in the file with offset 134217710 and length 100, what does HDFS do?

The answer is that the record is split into two parts, one part stored in block 0 and the other part stored in block 1. Specifically, the offset is 134217710, and the data with length 18 is stored in block 0; the offset is 134217729, and the data with length 82 is stored in block 1. This part of the logic can be summarized in the following diagram

Description:

The red block in the picture represents a file

The blue rectangle in the middle represents an HDFS block, and the number inside the rectangle represents the HDFS block number. When reading the whole file, start reading the HDFS block numbered 0, and then read 1, 2, 3...

The bottom row of rectangles represents the content stored in the file, each small rectangle represents a row of data, and the number inside represents the data number. The red vertical lines represent HDFS block boundaries.

From the above figure we can clearly see that when we write files to HDFS, HDFS will cut the file into blocks of 128MB in size, and it will not determine what is stored in the file when cutting, so the data that logically belongs to one line will be cut into two parts, and the data of these two parts will be physically stored in two different HDFS blocks, just like lines 5, 10 and 14 in the above figure are cut into two parts.

File Split

Now we need to use MapReduce to read the above file, because it is a normal text file, so you can directly use TextInputFormat to read. Here is the FileSplit information obtained using TextInputFormat:

scala> FileInputFormat.addInputPath(job,new Path("/tmp/iteblog.txt"));scala> val format = new TextInputFormat;scala> val splits = format.getSplits(job)scala> splits.foreach(println)hdfs://iteblogcluster/tmp/iteblog.txt:0+134217728hdfs://iteblogcluster/tmp/iteblog.txt:134217728+134217728hdfs://iteblogcluster/tmp/iteblog.txt:268435456+134217728hdfs://iteblogcluster/tmp/iteblog.txt:402653184+52016779

As you can see, each FileSplit starts at the same offset as each HDFS file block above. But how does MapReduce process data? We now know that when storing a file in HDFS, the file is cut into an HDFS Block, which causes some data that logically belongs to one line to be cut into two parts. How does TextInputFormat handle such data?

In this case, TextInputFormat does two things:

When initializing LineRecordReader, if the start position of FileSplit is not equal to 0, it means that this Block is not the first Block, and the first line of data of this Block is discarded at this time.

When reading each Block, an extra line is read, and if data is cut into another Block, the data can be read by this task.

The graphical representation can be summarized as follows:

Description:

The red dotted line in the figure represents HDFS block boundary;

The blue dashed line represents the boundary of Split readings.

It can be clearly seen from the figure:

When the program reads Block 0, although the fifth row data is divided and stored in Block 0 and Block 1, the current program can read the complete data of the fifth row completely.

When the program reads Block 1, because the start position of FileSplit is not equal to 0, the first row of data will be lost at this time, that is, the fifth row of data in Block 1 will be discarded, and the sixth row of data will be read directly. The reason for this is that the fifth line of data in Block 1 has already been read when the program reads the previous Block, so it can be discarded directly.

The rest of the Block read logic is consistent with this one.

From the above analysis, the following conclusions can be drawn

Split and HDFS Block are one-to-many relationships;

HDFS block is the physical representation of the data, while Split is the logical representation of the data in the block;

In the case of data locality, the program will also read a small amount of data from remote nodes because there are rows that are cut into different blocks.

About "What are the differences between HDFS blocks and Input Splits" This article is shared here. I hope the above content can be of some help to everyone so that you can learn more knowledge. If you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.