What is the relationship between recordreader, split and block in hadoop 03/19 Update SLTechnology News&Howtos

What is the relationship between recordreader, split and block in hadoop

2026-03-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what is the relationship between recordreader and split and block in hadoop". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn how the relationship between recordreader and split and block in hadoop is.

The role of recordreader is self-evident.

Generally speaking, Inputformat generates a recordreader for maptask to use without a split, so that MapTask can read the part of split that is under its jurisdiction.

Here, we use linerecordreader as an example to explain:

Several core methods

The basic function of linerecordreader is defined, that is, whether there is a next pair of kv, get the next key, get the next value.

The use of these three methods is as follows.

Ignore for the time being.

Because files are stored in blocks on hdfs, what the heck are split and block? Why not just follow the block to deal with it. The reason is that the data in block may not be contiguous. Maybe some important information is separated by two block. Therefore, we use the logical concept, split, to deal with it.

And split didn't really split the file. Instead, it can be logically marked by start,length,filepath and so on.

According to Path, you can get FileSystem.

Final Path file = split.getPath ()

/ / open the file and seek to the start of the split

Final FileSystem fs = file.getFileSystem (job)

For each maptask, a linerecordreader is used to deal with the corresponding split.

Private FSDataInputStream fileIn

To maintain a stream.

Remember: the flow here is not just for this split. As we said before, split is just a tag, not separated.

Therefore, the stream fileIn actually points to the entire file.

And, this stream, uh, will implement the standard method in jdk, such as read. Read into the buffer, but if a different block is involved, the stream will automatically help us find the corresponding block, which is too complicated. Anyway, remember that fileIn blocks the switching of different block at the top level, which is like dealing with a large file to us.

Since it is a stream, it can be located, so different maptask can directly navigate to the place where you want to process the file through the fileIn stream according to the start location in their split.

FileIn.seek (start); in = new SplitLineReader (fileIn, job, this.recordDelimiterBytes); filePosition = fileIn

You can see that the in object is generated with the help of fileIn. In contrast, in must have used this fileIn stream to achieve some function.

Typically, readLine

The in object is responsible for reading logic for a line, while fileIn is responsible for reading characters from the file into the byte buffer.

The readline function, which will eventually have such a set, can be seen

BufferLength = fillBuffer (in, buffer, prevCharCR)

Call the fillbuffer function to read something from in.read () to buffer

Private int readDefaultLine (Text str, int maxLineLength, int maxBytesToConsume) throws IOException {/ * We're reading data from in, but the head of the stream may be * already buffered in buffer, so we have several cases: * 1.No newline characters are in the buffer, so we need to copy * everything and read another buffer from the stream. 2. An unambiguously terminated line is in buffer, so we just * copy to str. * 3. Ambiguously terminated line is in buffer, i.e. Buffer ends * in CR. In this case we copy everything up to CR to str, but * we also need to see what follows CR: if it's LF, then we * need consume LF as well, so next call to readLine will read * from after that. * We use a flag prevCharCR to signal if previous character was CR * and, if it happens to be at the end of the buffer, delay * consuming it until we have a chance to look at the char that * follows. * / str.clear (); int txtLength = 0; / / tracks str.getLength (), as an optimization int newlineLength = 0; / / length of terminating newline boolean prevCharCR = false; / / true of prev char was CR long bytesConsumed = 0; do {int startPosn = bufferPosn; / / starting from where we left off the last time if (bufferPosn > = bufferLength) {startPosn = bufferPosn = 0; if (prevCharCR) {+ + bytesConsumed / / account for CR from previous read} bufferLength = fillBuffer (in, buffer, prevCharCR); if (bufferLength maxLineLength-txtLength) {appendLength = maxLineLength-txtLength;} if (appendLength > 0) {str.append (buffer, startPosn, appendLength); txtLength + = appendLength;}} while (newlineLength = = 0 & bytesConsumed

< maxBytesToConsume); if (bytesConsumed >

Integer.MAX_VALUE) {throw new IOException ("Too many bytes before newline:" + bytesConsumed);} return (int) bytesConsumed;}

OK, after that, the three main methods of linerecordreader are simply answered and read. A little bit.

However, there is one question that has not been said yet. That is, what to do if a line of information is separated by a block.

Or this problem, to put it this way, we know that the getSplit method in Inputformat divides the split directly according to the length and other attributes of the file.

Refer to the getSplits method of FileInputformat

Then a row of data may be in different splits or in different block.

In different block, there are fileIn objects to help us deal with, mainly reading read into the buffer, which is a physical problem, not a place to consider.

In a different split? This situation is somewhat problematic because different split are different partitions and are executed by different map task.

So how can we solve this problem with recordreader?

The solution is to break through the start and end restrictions of split.

Linerecordreader's solution:

Except that the location pointed to by start is not the first line of the file, a line is filtered out by default (the start location may be a location on the line).

Initialize () method

If (start! = 0) {start + = in.readLine (new Text (), 0, maxBytesToConsume (start));} this.pos = start

In the nextKeyvalue method, read more data to supplement a complete row.

While (getFilePosition ()

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.