How to understand the secondary ordering of hadoop 07/12 Update SLTechnology News&Howtos

How to understand the secondary ordering of hadoop

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "hadoop secondary sorting how to understand", the article explains the content is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "hadoop secondary sorting how to understand" it!

1. Each stage of the process

Input-- > split-- > recordreader-- > form compound key value to textpair-> partition (partition method set by setGroupingComparatorClass) output-> sort each partition setSortComparatorClass (sort textpair according to the set sort method, in fact, this has already been sorted once and twice)-- > shuffle phase-- > internal sort (using the sort method set by setSortComparatorClass) Perform a second sort)-- > grouping (grouping function set by setGroupingComparatorClass)-- > execute reduce-- > output

two。 Detailed explanation of each process

Map phase:

(1) install inputformat to input the input data, and generate corresponding key-value pairs at the same time

(2) in the Map function, the key value is processed to key,value to form a new TextPair key value pair key1=key + value,value1=value. At the same time, the sorting of TextPair is to sort key1 first, then value.

(3) in the Spill output phase, the corresponding reducer is determined by the newly defined partion method. Partitions are partitioned according to the first field of the TextPair key (key).

(4) sort the chunks of map output internally, using the rule we defined, and actually sort them once and twice (first by the first field of key1, then by the second field)

(5) merge multiple files in a partition

Reduce phase:

(1) Shuffle Reducer finds the file to be read according to jobtracker, transfers it to Reducer, and performs merge operation.

(2) because the corresponding map output files are read from different nodes, the second sort is carried out here, and the sort is still sorted according to the sort rules we defined (the sorting method of TextPair), and the second sort is redone.

(3) in the reduce phase, items with the same key value are grouped, and the default operation is the key. For our key-value pair, key1 is a compound key-value pair, and our operation on it is based on the first value of key1. Generate a new grouping

(4) reduce processes the grouping.

We will now deduce from the examples in the authoritative guide to Hadoop.

In this example, the input file is in this format, the first column is the time and the second column is the temperature

1990 31

1991 20

1991 18

1991 33

1990 22

1990 17

The results we want are as follows (first sorted by year, then sorted by temperature)

1990 17

1990 22

1990 31

1991 18

1991 20

1991 33

The process is as follows:

(1) in the map phase, the input file will be formed into a compound key-value pair.

(2) using the partion function, the first column of the key value of the compound key is segmented as the key, and the internal sorting is carried out.

This file is mapped to different reducer,Reducer and read from jobtracker to the file to be read.

(3) reducer loads the content on different nodes through shuffle and re-sorts it (because the contents of each part are different after the corresponding parts on different nodes are loaded, it needs to be re-sorted again)

(4) grouping

After the reordering of the reduce phase, we still need to group. The key value of the grouping is the default key, while we pass through the compound key, which is not necessarily grouped according to the year, so we re-implement the grouping function so that it uses the first column of the compound key as the key value to group.

Reducer1:

Reducer2:

(4) the grouping formed by reduce processing. The key value is the first column of the compound key, and the value value is the value taken from the valueList in turn.

Reducer1 output:

1990 17

1990 22

1990 31

Reducer2 output:

1991 18

1991 20

1991 33

Thank you for your reading, the above is the content of "how to understand hadoop secondary sorting". After the study of this article, I believe you have a deeper understanding of how to understand hadoop secondary sorting, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.