How to output map in hadoop 04/26 Update SLTechnology News&Howtos

How to output map in hadoop

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "how to output map in hadoop". The content is simple and easy to understand, and the organization is clear. I hope it can help you solve your doubts. Let Xiaobian lead you to study and learn this article "how to output map in hadoop".

Mapper's input official documentation is as follows

The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.

The mapper output is sorted and divided for each reducer, so how the hadoop code is divided will follow the code analysis.

Or according to the official example WordCount example

First analysis To simplify the output complexity of map,

Parse only one document, and there are only 10 'words', namely' J',.. "c", "b", "a" (here 10 letters are preferably out of order, you will see their ordering later),

Comment out the code that sets the combine class.

1. Context.write (production kvbuffer and kvmeta) in a single-step trace map

It can be traced back to org. apache.hadoop.mapred.MapTask.MapOutputBuffer.collect(K, V, int).

Here because our output has only 10 Records and each size is relatively small, so skip the spill processing and combine processing, the main code is as follows,

public synchronized void collect(K key, V value, final int partition ) throws IOException {

{

...

keySerializer.serialize(key);

...

valSerializer.serialize(value);

.... kvmeta.put(kvindex + PARTITION, partition); kvmeta.put(kvindex + KEYSTART, keystart);

kvmeta.put(kvindex + VALSTART, valstart);

kvmeta.put(kvindex + VALLEN, distanceTo(valstart, valend)); ...

}

Here we serialize (K,V) into the byte array org.apache.hadoop.mapred.MapTask.MapOutputBuffer.kvbuffer,

And store the location information of (K,V) in memory and its partition(record of the same partition is processed by the same reducer) message in kvmeta.

The output of this map is stored in memory

2. By looking up the kvmeta code index, find consumer kvbuffer and kvmeta code, produce spillRecv to indexCacheList

You can find it in org.apache.hadoop.mapred.MapTask.MapOutputBuffer.sortAndSpill() where you can set breakpoints, see below.

private void sortAndSpill() throws IOException, ClassNotFoundException, InterruptedException { ...

sorter.sort(MapOutputBuffer.this, mstart, mend, reporter);

...

for (int i = 0; i < partitions; ++i) {

...

if (combinerRunner == null) {

// spill directly DataInputBuffer key = new DataInputBuffer();

while (spindex < mend &&

kvmeta.get(offsetFor(spindex % maxRec) + PARTITION) == i) {

....

writer.append(key, value);

++spindex;

}

} ...

spillRec.putIndex(rec, i);

}

...

indexCacheList.add(spillRec);

...}

There are three operations,

1. Sorter.sort: Sorted by partition and key, the purpose is to aggregate records of the same partition and arrange them in the order of key.

2. writer.append : Writes serialized record to the output stream, here to the file spill0.out

3. indexCacheList.add : Each spillRec records the partition information contained in a spill out file.

3. Find code that consumes indexCacheList, org.apache.hadoop.mapred.MapTask.MapOutputBuffer.mergeParts()

Set the breakpoint here, you can see here we only have one spill file, no merge required,

Here is just the only spillRec written to file, file.out.index

Rename spill0.out to file.out. You can vim open this file to see that there are sequence characters in it.

private void mergeParts() throws IOException, InterruptedException, ClassNotFoundException {

...

sameVolRename(filename[0],

mapOutputFile.getOutputFileForWriteInVolume(filename[0]));...

indexCacheList.get(0).writeToFile(

mapOutputFile.getOutputIndexFileForWriteInVolume(filename[0]), job);

...}

In summary:

1. The output of map is serialized first into memory kvbuffer, kvmeta

2. sortAndSpill writes records in memory to files

3. merge will spill out of the file merge ask a file file.out, and each file partition information into file.out.index.

Not yet analyzed:

map Details of complex cases where a large amount of data is output and multiple spill files appear (1. asynchronous spill, 2. merge multiple files)

The above is "how to output map in hadoop" all the contents of this article, thank you for reading! I believe that everyone has a certain understanding, hope to share the content to help everyone, if you still want to learn more knowledge, welcome to pay attention to the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.