What if the output doubles after TeraSort modification in Hadoop? 04/26 Update SLTechnology News&Howtos

What if the output doubles after TeraSort modification in Hadoop?

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Editor to share with you about the abnormal doubling of output after TeraSort modification in Hadoop. I hope you will get something after reading this article. Let's discuss it together.

In short, after modifying the TeraInputFormat.java, the output data obtained by running TeraSort is doubled in varying degrees, and there is no clue at first, and the sampling thread is confused with the reading of Map into < key, value >, and the logic is not clear, resulting in repeated debugging in insignificant places a lot of time.

In fact, I should be able to think of a way, that is, to set a breakpoint in MapTask to observe, but I don't know if I am lazy or because I have a fear of the hidden MapTask. At first, I didn't look at it carefully. Later, I set the variable count and output in the nextKeyValue () method in the RecordReader part of MapTask to observe the number of records obtained by split each time. Each of my split takes the entire (note that the entire input file) rather than a split-sized record, so the output doubles as well.

So the key point has been found, what is the problem? The MapTask part is bound by Hadoop by default, and TeraSort is not rewritten, so there can be no error in this part; the first half of TeraInputFormat is the sampling part, and the problem cannot be caused here; the initialize part of the RecordReader in the second half is basically unchanged before modification, and the wrong part must be in the nextKeyValue () section, so line-by-line analysis finally locks this sentence:

NewSize = in.readLine (record)

It is very common to read a row of records, so is it possible that the readLine () method has no limit on the length? Although the nextKeyValue () method is called by the split object, will readLine () ignore the size of each split block and read it down until the end of the file?

To verify this possibility, I added a global variable:

Long recordLen;// added the following sentence to nextKeyValue () recordLen + = newSize

To record the total length of the read record, and set when

If (recordLen > = split.getLength) {return false;}

After modification, make jar package and put it on the node to run, the result is correct!

After reading this article, I believe you have a certain understanding of "how to double the output after TeraSort modification in Hadoop". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for your reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.