How to analyze MapReduce 07/02 Update SLTechnology News&Howtos

How to analyze MapReduce

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to analyze MapReduce. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it with the editor.

Today we will elaborate on MapReduce. In view of the fact that Hadoop1.X is out of date, Hadoop3.X is not widely used at present, and Hadoop2.X is still widely used in enterprises, so the following is based on the Hadoop2.X version of MapReduce (HDFS and Yarn are also discussed later).

MapReduce is one of the three swordsmen at the core of Hadoop. The design idea comes from the "distributed computing model" of one of Google's three papers. As a distributed computing program programming framework, users are required to implement the business logic code and integrate it with its default components into a complete distributed computing program, which runs concurrently on the Hadoop cluster. A complete MapReduce program has three types of instance processes when it is distributed:

1. MRAppMaster: responsible for process scheduling and state coordination of the whole program

2. MapTask: responsible for the whole data processing process in map phase

3. ReduceTask: responsible for the whole data processing process in reduce phase

The author still wants to emphasize one point here: MapTask and ReduceTask are process level, which is very important!

The author draws a flow chart of MapReduce processing and takes an example of dealing with word statistics as an example:

MapReduce processing data is mainly divided into two stages: map and reduce. The processing examples corresponding to the above figure are MapTask and ReduceTask. Data processing advanced memory and then scan the disk, although there is an overflow ratio limit, but the author stressed that the disk at least once, through the above figure and the following explanation to understand the entire MapReduce processing flow, details will be able to grasp what the shuffle phase has done. Here is a picture of the core mechanisms and components involved:

Slicing mechanism

Slicing is to cut a file into block blocks, but the slices here are logical rather than physical. For the logic of slicing, you can check the getSplits method of the API InputFormat. Through one of its implementation classes, FileInputFormat, look at the default implementation mechanism of slicing, and directly look at the source code:

Here we mainly focus on the sharable parts of the file. By analyzing the source code, in FileInputFormat, the default slicing mechanism is as follows:

1. Simply slice according to the content length of the file

two。 Slice size, which is equal to block size by default

3. Slicing does not take into account the data set as a whole, but slices each file individually.

So by default, the slice size is equal to blocksize. However, no matter how to adjust the parameters, do not allow multiple small files to be "subsumed" into one split, which will affect performance. We will explain the problem of small files when we talk about HDFS later.

After understanding the slicing mechanism, beginners are easy to fall into a misunderstanding, that is, if I configure blocksize to 128m, then I will split a file in proportion to 128m, and cut to the end less than 128m as a separate slice, but the author emphasizes that it is necessary to divide the situation. In fact, careful friends will see the comments in the screenshot of my source code, the key parameter SPLIT_SLOP is 1.1m, also take blocksize for 128m as an example, how many block blocks will be generated for a 130m splittable file? It is obvious that 130 < 128 / 1.1 results in a 130m slice of block, so it is important to look at the source code.

2. MapReduce parallelism determination mechanism

1) MapTask parallelism determination mechanism

Once you understand the slicing mechanism, it is easy to understand the parallelism mechanism of MapTask, because the parallelism of MapTask mainly depends on the slicing mechanism. The map phase parallelism of a task is determined by the client when the task is submitted, and the basic logic of the client's planning for the parallelism of the map phase is that the data to be processed is divided into logically multiple slices according to a specific slice size, and then each slice is assigned a mapTask parallel instance for processing.

2) the determination mechanism of ReduceTask parallelism

The way to set ReduceTask is very simple, you can set it directly manually: job.setNumReduceTasks (4);, the default value is 1, manually set to 4. The parallelism of ReduceTask also affects the execution efficiency of the whole task, and data skew may occur if the data is unevenly distributed.

Note: the ReduceTask setting method is very simple, and you can set it manually directly: the quantity is not arbitrarily set, and the business logic requirements are also taken into account. In some cases, if you need to calculate the global summary result, only one ReduceTask setting method is very simple, and you can manually set it directly. Try not to run too much ReduceTask. For most tasks, it is best to have at most the same number of reduce as the reduce in the cluster or smaller than the reduce slots in the cluster. This is especially important for small clusters.

The choice of concurrency is affected by many factors, such as the hardware configuration of computing nodes, the type of computing tasks: CPU-intensive or IO-intensive, and the amount of data of computing tasks, which still depends on the actual situation.

3) map | reduce core components

A) partitioner components

Called before the map output data overflows to disk. Partition components can be customized by default based on the number of key.hashcode%reduce (HashPartition)

B) the combiner component (inheriting Reducer) is called before the map ring buffer, before the reduce input buffer overflows to disk, or when multiple overflow files are merged. The goal is to reduce the amount of data written to disk (disk IO) and the amount of data passed to reduce (bandwidth). Use with caution: the number of calls is not certain, which cannot affect the core business logic, such as averaging the data:

25 6: join combiner,2+5+6/3=13/3

4 3: join combiner,4+3/2=7/2, finally (13 + 3 + 7 + 7) / 2 = 47 + 12

Do not add combiner components: 2 "5" 6 "4" 3 bean 5 = 4c) grouping reduce stage merging data rules, by default according to the same key into a group / * * using the reduce side of the GroupingComparator to achieve a group of bean as the same key * /

Public class ItemidGroupingComparator extends WritableComparator {

/ / pass in the class type of bean as key, and specify the need to let the framework do reflection to obtain instance objects

Protected ItemidGroupingComparator () {

Super (Order.class,true)

}

@ Override

Public int compare (WritableComparable a, WritableComparable b) {

/ / strong turn

Order aBean = (Order) a

Order bBean = (Order) b

/ / when comparing two bean, specify to compare only the itemId in the bean

Return aBean.getItemId () .compareTo (bBean.getItemId ()

}

Finally, let's talk about the distributed cache of MapReduce: through DistributedCache, MapReduce can distribute the files specified by job to the machines executed by task before job execution, and provide a relevant mechanism to manage cache files. However, it should be noted that the files that need to be distributed must be placed on the hdfs in advance; the files that need to be distributed should be read-only during the task; and it is not recommended to distribute larger files, which will affect performance. Mainly used for the distribution of third-party libraries, multi-table data join small table data easy to deal with and so on. Distributed caching can be done in custom implementing Mapper classes and overriding setup methods. The above is how to analyze MapReduce, the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.