How to realize MapTask in Hadoop 07/09 Update SLTechnology News&Howtos

How to realize MapTask in Hadoop

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you how to achieve MapTask in Hadoop. I hope you will get something after reading this article. Let's discuss it together.

Overall implementation process

As shown in the figure above, the entire process of MapTask is divided into five stages:

● read phase: parsing data into key/value from InputSplit shards through RecordReader.

● map phase: the key/value parsed by RecordReader is handed over to the map () method to process and generate new key/value one by one.

● collect phase: writes the newly generated key/value in map () to the ring data buffer in memory by OutpCollector.collect ().

● spill phase: when the ring buffer reaches a certain threshold, the data is written to the local disk and an spill file is generated. Before writing the file, the data is sorted locally and compressed if necessary (as required by the configuration).

● combine phase: when all the data is processed, all the temporary spill files are merged at once to produce a data file.

Next we will take a more in-depth look at the three most important stages of the process: collect, spill, and combine.

Collect process

After a new key/value pair is generated in the previous map, OutpCollector.collect (key,value) is called. Inside the method, Partitioner.getPartition () is called to get the partition number of the record, and then passed to MapOutputBuffer.collect () for further processing.

MapOutputBuffer uses an internal ring buffer to temporarily save the user's output data. When the buffer utilization reaches a certain threshold, the SpillThread thread will spill the data in the buffer to the local disk. When all the data is processed, all the files will be merged and only one file will be generated. The data buffer is used directly to think of the write efficiency of MapTask.

Ring buffers allow collect and spill phases to be processed in parallel.

MapOutputBuffer adopts a two-level index structure, involving three ring memory buffers, namely kvoffsets, kvindices and kvbuffer. The size of this ring buffer can be set by io.sot.mb. The default size is 100MB, as shown below:

Kvoffsets is the offset index array, which is used to hold the offset of key/value in kvindices. A key/value pair accounts for the size of an int in the kvoffsets array, while the size of three int in the kvindices array (including the starting position of the partition number partition,key and the starting position of the value, as shown in the figure above).

When the utilization of kvoffsets exceeds io.sort.spill.percent (the default is 80%), the SpillTread thread is triggered to spill the data to disk.

Kvindices, the civilian index array, is used to hold the starting position of the actual key/value in the data buffer kvbuffer.

Kvbuffer is the data office buffer, which is used to actually save the key/value. By default, 95% of the io.sort.mb can be used. When the utilization rate of the buffer exceeds io.sort.spill.percent, the SpillTread thread will be triggered to spill the data to disk.

Spill process

During the execution of the collect phase, when the data in the ring data buffer in memory reaches a certain value, a Spill operation will be triggered to spill part of the data to the local disk. The SpillThread thread is actually the consumer of the kvbuffer buffer, the main code is as follows:

Java code

SpillLock.lock ()

While (true) {

SpillDone.sinnal ()

While (kvstart = = kvend) {

SpillReady.await ()

}

SpillDone.unlock ()

/ / sort and spill the data in the buffer kvbuffer to the local disk

SortAndSpill ()

SpillLock.lock

/ / reset each pointer to prepare for the next spill

If (bufend < bufindex & & bufindex < bufstart) {

Bufvoid = kvbuffer.length

}

Vstart = vend

Bufstart = bufend

}

SpillLock.unlock ()

The internal flow in the sortAndSpill () method looks like this:

The first step is to use a quick sort algorithm to sort the data in kvbuffer [bufstart,bufend), first sort the partition partition number, and then sort by key. After these two rounds of sorting, the data will be clustered together by partition, and the data in the same partition will be sorted by key.

In the second step, the data in each partition is written to a temporary file under the working directory of the task according to the partition size. If the user sets Combiner, the data in each partition will be aggregated once before writing to the file, for example, and merged into

The third step is to write the meta-information of the partition data to the in-memory index data structure SpillRecord. The metadata information of the partition includes the offset in the temporary file, the size of the data before compression and the size of the data after compression.

Combine process

When all the data of the task is processed, MapTask merges all the temporary file years of the task into one large file and generates the corresponding index file. In the process of merging, the merging is carried out in partition text units.

By letting each Task eventually generate a file, you can avoid the overhead of opening a large number of files at the same time and randomly reading small files.

After reading this article, I believe you have a certain understanding of "how to achieve MapTask in Hadoop". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.