Running concurrency of Advanced feature programs in MapReduce Computing Framework 07/04 Update SLTechnology News&Howtos

Running concurrency of Advanced feature programs in MapReduce Computing Framework

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Tuesday, 2019-2-19

Running concurrency of Advanced feature programs in MapReduce Computing Framework

The so-called concurrency is the number of map task processes and reduce task processes in the process of MapReduce execution to complete the processing of the program.

MapReduce is to turn business processing logic into distributed processing.

The mechanism for determining the number of reduce task / / the global aggregation operation is determined by the business scenario

1. Business logic needs

2. The amount of data.

Setting method:

Job.setNumReduceTasks (5)

The number of / / reduce task cannot be specified arbitrarily. For example, we have to count the number of words in a large pile of English files. At this time, there can only be one reduce task in the process of global program execution. Why? Because: you need to have all the map task send all the results to one reduce task so that reduce task will count all the results. In this case, reduce task can't use more than one.

For example, in wordcount, we count the total number of occurrences of each word, in which case the reduce task can be as many as possible. Because, after the result of maptask processing goes through the shuffle phase, the same word will only appear in the same reduce task. We may get five files, but the number of words counted in these five files is globally unique.

The mechanism for determining the number of map task:

Because there is no cooperative relationship between map task, each map task has its own way, and it is impossible to do "global" aggregation operation in the processing of map task, so the number of map task depends entirely on the amount of data processed.

Decision mechanism:

"slice" the processed data

Each slice is assigned a map task to process.

The default slicing mechanism in the Mapreduce framework:

TextInputFormat.getSplits () inherits from FileInputFormat.getSplits ()

Thoughts on data slicing

1: define a slice size: can be adjusted by parameters, by default equal to "blocksize set in hdfs", usually 128m / / will reduce the network transmission data to a certain extent, but not absolutely.

2: get all the pending files List under the input data directory

3: traverse the file List and slice it one by one

For (file:List)

File is cut from 0 offset to form a slice every 128m, such as a.txt (200m), will be cut into two slices: a.txt: 0-128m, a.txt: 128M-256M and such as b.txt (80m), will be cut into a slice, b.txt: 0-80m

If the data to be processed is a large number of small files, using the default slicing mechanism mentioned above will result in a large number of slices, resulting in a large number of maptask processes, but each slice is very small, and the amount of data processed by each maptask is very small, thus the overall efficiency will be very low. General solution: divide multiple small files into a slice; the implementation is to customize the getSplits method in an Inputformat subclass override

An Inputformat implementation class for this scenario is included in the Mapreduce framework: CombineFileInputformat

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.