The working principle of MapReduce of Hadoop 07/16 Update SLTechnology News&Howtos

The working principle of MapReduce of Hadoop

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Hadoop consists of two parts: distributed file system HDFS and distributed computing framework MapReduce. Among them, the distributed file system HDFS is mainly used for the distributed storage of large-scale data, while MapReduce is built on the distributed file system to carry out distributed computing for the data stored in the distributed file system.

1 MapReduce Design goal

HadoopMapReduce was born in the search field, mainly to solve the problem of poor scalability of massive data processing faced by search engines. Its implementation draws lessons from the design idea of Google MapReduce to a great extent, including simplifying the programming interface, improving the fault tolerance of the system and so on. To summarize the design goals of HadoopMapReduce, there are mainly the following:

1. Easy to program: the traditional distributed programming is very complex, and there are many details that users need to pay attention to, such as data fragmentation, data transmission, communication between nodes and so on. Therefore, the threshold for designing distributed programs is very high. One of the important design goals of Hadoop is to simplify distributed programming, abstracting the design details that all parallel programs need to pay attention to into common modules and leaving them to the system to implement, while users only need to focus on their own application logic implementation, which simplifies distributed programming and improves development efficiency.

2. Good scalability: with the development of the company's business, the amount of data accumulated will become larger and larger. When the amount of data increases to a certain extent, the existing cluster may no longer be able to meet its computing capacity and storage capacity. At this time, administrators may expect to linearly expand the cluster capacity by adding machines.

3. High fault tolerance: in distributed environment. With the increase of cluster size, the failure rate in the cluster (including hardware failures such as disk damage, machine downtime, communication failure between nodes, bad data or software failures caused by user program bug) will significantly increase, resulting in an increase in the possibility of task failure and data loss. Therefore, Hadoop uses strategies such as computing migration or data migration to improve the availability and fault tolerance of the cluster.

2 MapReduce principle 2.1 MapReduce programming model

MapReduce adopts the idea of "divide and conquer", distributes the operation of large-scale data sets to each sub-node under the management of a master node, and then obtains the final result by integrating the intermediate results of each node. To put it simply, MapReduce is "the decomposition of tasks and the summary of results".

In Hadoop, there are two machine roles for performing MapReduce tasks: one is JobTracker;, the other is TaskTracker,JobTracker, which is used to schedule work, and TaskTracker is used to perform work. There is only one JobTracker in a Hadoop cluster.

In distributed computing, MapReduce framework is responsible for dealing with complex problems in parallel programming, such as distributed storage, job scheduling, load balancing, fault-tolerant balancing, fault-tolerant processing and network communication. The processing process is highly abstracted into two functions: map and reduce,map are responsible for decomposing tasks into multiple tasks, and reduce is responsible for summarizing the results of multitasking after decomposition.

It should be noted that the dataset (or task) processed with MapReduce must have the characteristics that the dataset to be processed can be decomposed into many small datasets, and each small dataset can be processed in full parallel.

2.2 Hadoop MapReduce architecture

Like HDFS, Hadoop MapReduce also uses the Master/Slaves (MCMG S) architecture, which is mainly composed of the following components: Client, JobTracker, TaskTracker, and Task, as shown in the following figure.

These components are described below:

A 、 Client

The MapReduce program written by the user is submitted to JobTracker side through Client. meanwhile. Users can view the running status of jobs through some interfaces provided by Client. Within Hadoop, the MapReduce program is represented by "Job". A MapReduce program can correspond to several jobs, and each job is broken down into several Map/Reduce tasks (Task).

B 、 JobTracker

JobTracker is mainly responsible for resource monitoring and job scheduling. JobTracker monitors the health of all TaskTracker and jobs. Once a failure is found, it transfers the corresponding tasks to other nodes. At the same time, JobTracker will track the progress of task execution, resource usage and other information, and give this information to the task scheduler, and the scheduler will select the appropriate task to use these resources when resources are idle. In Hadoop, the task scheduler is a pluggable module, and users can design the corresponding scheduler according to their own needs.

C 、 TaskTracker

TaskTracker periodically reports the use of resources and the progress of tasks on this node to JobTracker through Heartbeat, and receives commands sent by JobTracker and performs corresponding operations (such as starting new tasks, killing tasks, etc.).

TaskTracker uses the same amount as "slot" to divide the number of nodes, and "slot" represents computing resources (CPU, memory, etc.).

A Task gets a slot before it has a chance to run, and the role of the Hadoop scheduler is to allocate the free slot on each TaskTracker to Task use.

Slot is divided into Map slot and Reduce slot, which provide Map Task and Reduce Task respectively. TaskTracker defines the concurrency of Task by the number of slot.

D 、 Task

Task can be divided into two types: Map Task and Reduce Task, both of which are started by TaskTracker. We know that HDFS stores data in fixed-size block, while for MapReduce, the basic unit of processing is split.

Split is a logical concept that contains only some metadata information, such as the starting position of the data, the length of the data, the node where the data resides, and so on. The partition method is entirely up to the user, but it is recommended that the partition size of split is the same as the block size of HDFS.

It is important to note that the number of split determines the number of Map Task, because each split is handled by a Map Task.

The corresponding diagram of Split and blcok:

MapTask first parses the corresponding split iteration into key/value pairs, then calls the user-defined map function for processing, and finally stores the temporary results on the local disk, in which the temporary data is divided into several partition, and each partition will be processed by a ReduceTask.

The ReduceTask is divided into three phases: the first step is to read the Map Task intermediate result from the remote node, called the Shuffle phase. The second step is to sort the key/value by key, which is called the Sort phase. The third step is to read in turn, call the user-defined reduce function to process, and store the final result on HDFS, which is called Reduce stage.

2.3 Lifecycle of Hadoop MapReduce job

This section mainly explains the life cycle of the physical entity job of Map Reduce, that is, the whole process from the job submission to the end of the run. As shown in the following figure:

Step 1 Job submission and initialization. After the user submits the job, the JobClient instance first uploads the job-related information, such as program jar package, job configuration file, shard meta-information file, to the distributed file system (usually HDFS), where the shard meta-information file records the logical location information of each input shard. JobClient then notifies JobTracker through RPC. After JobTracker receives a new job submission request, the job is initialized by the job scheduling module: a JobInProgress object is created for the job to track the running status of the job, while JobInProgress creates a TaskInProgress object for each Task to track the running status of each task, and TaskInProgress may need to manage multiple "Task run attempts" (called "Task Attempt").

Step 2 task scheduling and monitoring. As mentioned earlier, the functions of task scheduling and monitoring are performed by JobTracker. TaskTracker periodically reports the resource usage of this node to JobTracker through Heartbeat. Once there are free resources, JobTracker will select an appropriate task to use the free resources according to a certain policy, which is completed by the task scheduler. The task scheduler is a pluggable independent module with a two-tier architecture, that is, the job is selected first, and then the task is selected from the job, in which the data locality needs to be taken into account when selecting the task. In addition, JobTracker tracks the entire running process of the job and provides a comprehensive guarantee for the successful operation of the job. First, when the TaskTracker or Task fails, the computing task is transferred; secondly, when the execution progress of a Task lags far behind other Task of the same job, start a same Task for it, and select the Task result with fast calculation as the final result.

Step 3 prepare the task operating environment. Runtime environment preparation includes JVM startup and resource isolation, both of which are implemented by TaskTracker. TaskTracker starts a separate JVM for each Task to prevent different Task from interacting with each other while running; at the same time, TaskTracker uses operating system processes to achieve resource isolation to prevent Task from abusing resources.

Step 4 task execution. After TaskTracker prepares the running environment for Task, it starts Task. During the running process, the latest progress of each Task is first reported by Task to TaskTracker through RPC, and then reported by TaskTracker to JobTracker.

Step 5 the assignment is complete. After all the Task has been executed, the entire job executes successfully.

2.4 the running mechanism of Hadoop MapReduce jobs

These are in chronological order: input fragmentation (input split), map phase, combiner phase, shuffle phase, and reduce phase.

1) input shard (input split): before map calculation, mapreduce calculates the input shard (input split) according to the input file. Each input shard (input split) is for a map task. The input shard (input split) stores not the data itself, but an array of shard length and a location where the data is recorded. The input shard (input split) is often closely related to the block (block) of the hdfs. If we set the block size of hdfs to 64mb, and if we input three files, the sizes are 3mb, 65mb and 127mb, then mapreduce will divide the 3mb file into one input shard (input split), 65mb is two input shards (input split), and 127mb is also two input shards (input split). In other words, if we make input shard adjustment before map calculation, such as merging small files, then five map tasks will be performed. And the data size of each map is uneven, which is also a key point of mapreduce optimization calculation.

2) map stage: the programmer writes the map function, so the efficiency of the map function is relatively easy to control, and generally the map operation is localized, that is, it is carried out on the data storage node.

3) combiner phase: the combiner phase is optional for programmers, and combiner is actually a reduce operation, so we can see that the WordCount class is loaded with reduce. Combiner is a localized reduce operation, which is a follow-up operation of map operation, mainly to do a simple operation of merging and repeating key values before map calculates the intermediate files. For example, we do statistics on the frequency of words in the file. If we encounter a hadoop word in map calculation, it will be recorded as 1, but in this article, hadoop may appear many times, so the map output file will be redundant a lot. Therefore, do a merge operation on the same key before reduce calculation, then the file will become smaller, thus improving the transmission efficiency of broadband. After all, the broadband resource of hadoop computing power is often the bottleneck of computing and the most valuable resource, but the combiner operation is risky. The principle of using it is that the input of combiner will not affect the final input of reduce calculation. For example, if the calculation is just to calculate the total, the maximum value, the minimum value can be used combiner. But if you use combiner to do the average calculation, the final reduce calculation will go wrong.

4) shuffle phase: the process of taking the output of map as the input of reduce is shuffle, which is the focus of mapreduce optimization. I'm not going to talk about how to optimize the shuffle phase here, but I'm going to talk about the principle of the shuffle phase, because the shuffle phase is not clear in most books. Shuffle starts with the output operation in the map stage. Generally speaking, mapreduce calculates huge amounts of data, so it is impossible for map to put all files into memory, so the process of writing map to disk is very complicated, not to mention the map output has to sort the results, memory overhead is very large, map will open a ring memory buffer in memory when doing output, this buffer is specially used for output. The default size is 100mb, and a threshold is set for the buffer in the configuration file. (both the size and threshold can be configured in the configuration file). At the same time, map also starts a daemon thread for the output operation. If the memory of the buffer reaches 80% of the threshold, the daemon thread will write to disk, a process called spill. The other 20% of the memory can continue to write data to be written to disk. Writing to disk and writing to memory do not interfere with each other. If the cache is full, map will block the operation of writing to memory and let the operation of writing to disk be completed before continuing to write to memory. As I mentioned earlier, there will be a sort operation before writing to disk, which is done when writing to disk. Not when writing to memory, if we define the combiner function, then the combiner operation will be performed before sorting. Every time the spill operation is written to disk, an overflow file will be written, that is, how many overflow files will be generated by spill after several times of map output, and when the map output is all done, map will merge these output files. In this process, there will also be a Partitioner operation, for which many people are very confused. In fact, the Partitioner operation is very similar to the input slicing (Input split) in the map stage. A Partitioner corresponds to a reduce job. If our mapreduce operation has only one reduce operation, then there is only one Partitioner. If we have multiple reduce operations, then there will be multiple Partitioner corresponding to it. Partitioner is therefore the input shard of reduce, which the programmer can program to control. It is mainly based on the actual values of key and value, according to the actual business type or for better reduce load balancing requirements, which is a key to improve the efficiency of reduce. When it comes to the reduce stage, the map output file is merged. Partitioner will find the corresponding map output file, and then carry out the copy operation. During the copy operation, reduce will start several replication threads. The default number of these threads is 5. Programmers can also change the number of replication threads in the configuration file. This replication process is similar to the process of writing map to disk. It also has a threshold and memory size. The threshold can also be configured in the configuration file. The memory size is the memory size of reduce's tasktracker directly. When copying, reduce will also sort and merge files. After these operations, reduce calculation will be carried out.

5) reduce phase: like the map function, it is also written by the programmer, and the final result is stored on the hdfs.

As shown in the figure:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.