What is mapreduce programming and what is its principle 07/02 Update SLTechnology News&Howtos

What is mapreduce programming and what is its principle

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article will explain in detail what mapreduce programming is and what the principle is. The content of the article is of high quality, so Xiaobian shares it with you for reference. I hope you have a certain understanding of relevant knowledge after reading this article.

I believe that there will be many articles related to MapReduce in Hadoop. Here is a brief introduction to MapReduce.

MapReduce of Hadoop comes from MapReduce in three papers of Google, and its core idea is "divide and conquer." Map is responsible for "splitting", that is, breaking down complex tasks into several "simple tasks" for parallel processing. The premise for splitting is that these small tasks can be computed in parallel and have few dependencies on each other. Reduce is responsible for "combining," that is, globally summarizing the results of the map phase.

At this stage, MapReduce generally runs on the Yarn resource platform of Hadoop 2.x version, and the specific operation process will be described in detail later.

MapReduce Programming Specification

MapReduce development has eight steps, of which Map phase is divided into 2 steps, Shuffle phase 4 steps, Reduce phase is divided into 2 steps.

Map Stage 2 Steps

1)Set the InputFormat class, divide the data into Key-Value(K1 and V1) pairs, and input to the second step;

2)Custom Map logic, convert the result of the first step into another Key-Value(K2 and V2) pair, and output the result;

Shuffle Stage 4 Steps

3)partitioning the output Key-Value pairs;

4)sorting data of different partitions according to the same Key;

5)(Optional) Initial protocol of grouped data to reduce network copy of data;

6)grouping the data, and putting the Value of the same Key into a set;

Reduce phase 2 steps

7)Sort and merge the results of multiple Map tasks, write Reduce function to realize its own logic, process the input Key-Value, and convert it into new Key-Value(K3 and V3) output;

8)Set OutputFormat to process and save the Key-Value data of Reduce output;

yarn resource scheduling

Yarn is a resource management system module in hadoop cluster. Yarn module has been introduced since hadoop 2.0. Yarn can provide resource management and scheduling for various computing frameworks, mainly used to manage resources in cluster (mainly various hardware resources of server, including CPU, memory, disk, network IO, etc.) and schedule various tasks running on Yarn.

Compared to hadoop version 1.x, yarn's core starting point is to separate resource management from job monitoring, which is achieved by having a global resource manager (RM) and an application manager (AM) for each application.

To sum up, yarn is mainly for scheduling resources and managing tasks.

YARN is a Master/Slave structure, mainly composed of several components such as ResourceManager, NodeManager, ApplicationMaster and Container.

Resource Manager (RM) is responsible for handling client requests and managing and scheduling resources across NM. Assign idle Container runs to ApplicationMaster and monitor their health status. It consists of two main components: scheduler and application manager:

Scheduler: Scheduler allocates resources in the system to each running application according to constraints such as capacity and queue. The scheduler allocates resources only according to the resource requirements of each application, and the resource allocation unit is Container. Shceduler is not responsible for monitoring or tracking the state of the application. In summary, the scheduler allocates resources encapsulated in containers to applications based on the resource requirements of the application and the resource situation of the clustered machines.

Application Manager: The Application Manager is responsible for managing all applications in the entire system, including application submission, negotiating resources with the scheduler to start the ApplicationMaster, monitoring the running state of the ApplicationMaster and restarting it when it fails, etc. Tracking the progress and status of assigned containers is also its responsibility.

NodeManager (NM) NodeManager: NodeManager (NM) NodeManager is the resource and task manager on each node. It regularly reports resource usage on this node and the running status of each Container to ResourceManager; at the same time, it receives and processes container start/stop requests from ApplicationMaster.

Application Master (AM): Applications submitted by users contain an Application Master, which is responsible for monitoring applications, tracking application execution status, restarting failed tasks, etc. ApplicationMaster is the application framework responsible for coordinating resources to ResourceManager and working with NodeManager for Task execution and monitoring.

Container： Container is a resource abstraction in YARN. It encapsulates multidimensional resources on a node, such as memory, CPU, disk, network, etc. When ApplicationMaster applies for resources from ResourceManager, the resources returned by ResourceManager for ApplicationMaster are represented by Container.

Yarn architecture and workflow

1. The client submits the upload task to the ApplicationManager process in the master node ResourceManager(RM) via the command (hadoop jar xxx.jar);

2. The master node RM judges the cluster status, selects a NodeManager(NM), and opens a resource Container to start the AppMaster process;

3. AppMaster process obtains the task request received by RM and assigns tasks;

4. AppMaster asks ResourceScheduler in RM for resource allocation scheme according to task situation;

5. AppMaster finds each NodeManager slave node according to the resource allocation scheme;

6. Open the resource Container on the slave node and run Task;

7. AppMaster obtains the task execution progress and results uploaded by each Task;

8. AppMaster returns the results of task execution to ApplicationManager.

About what is mapreduce programming and what the principle is to share here, I hope the above content can be of some help to everyone, you can learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.