Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How is MapReduce executed?

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how the MapReduce is implemented. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

1mr principle

Mapreduce is a programming framework for distributed computing programs and the core framework for users to develop "data analysis applications based on hadoop".

The core function of Mapreduce is to integrate the business logic code written by users and its own default components into a complete distributed operation program, which runs concurrently on a hadoop cluster.

2 Why use mapreduce

Large amounts of data can not be processed on a single computer because of hardware resource constraints.

Once the stand-alone program is extended to the cluster to run distributed, it will greatly increase the complexity and development difficulty of the program.

With the introduction of the mapreduce framework, developers can focus most of their work on the development of business logic and leave the complexity of distributed computing to the framework.

3 mapreduce structure and core operation mechanism

1 structure

A complete mapreduce program has three types of instance processes when it is distributed:

MRAppMaster: responsible for process scheduling and state coordination of the whole program

MapTask: responsible for the entire data processing flow of the map phase

ReduceTask: responsible for the entire data processing flow of the reduce phase

Thought: divide and rule, divide first and then combine

2 overall flow chart

The number of maptask cannot be set, and reducetask can set job.setNumReduceTasks (5) on its own.

3 process analysis

1. When a mr program starts, the first thing to start is after MRAppMaster,MRAppMaster starts, according to the description information of this job, calculate the number of maptask instances needed, and then apply to the cluster for machines to start the corresponding number of maptask processes.

After the 2.maptask process starts, the data is processed according to the given data slice range. The main process is as follows:

Use the inputformat specified by the customer to obtain the RecordReader read data to form an input KV pair

Pass the input KV pair to the customer-defined map () method, do the logical operation, and collect the KV pair output by the map () method to the cache

The KV pairs in the cache are sorted according to the K partition and are constantly overwritten to the disk file.

After 3.MRAppMaster monitors that all maptask process tasks are completed, it starts the corresponding number of reducetask processes according to the parameters specified by the customer, and informs the reducetask process of the data range to be processed (data partition).

After the 4.Reducetask process starts, according to the location of the data to be processed told by MRAppMaster, several maptask output result files are obtained from several machines where maptask is running, and remerged and sorted locally, then according to the KV of the same key as a group, call the customer-defined reduce () method for logical operation, collect the result KV of the operation output, and then call the outputformat specified by the customer to output the result data to external storage.

4 shuffle mechanism

1 Overview

In mapreduce, how the data processed by the map phase is transferred to the reduce phase is the most critical process in the mapreduce framework, which is called shuffle.

Shuffle: shuffle, deal-(core mechanism: data partitioning, sorting, caching)

Specifically, the processing result data output by maptask is distributed to reducetask, and in the process of distribution, the data is partitioned and sorted by key.

2 main processes

Shuffle is a process in the MR processing flow. Each processing step is completed on each map task and reduce task node. As a whole, it is divided into three operations:

Partition partition

Sort sorts according to key

Merging local value with Combiner

Specifically, the processing result data output by maptask is distributed to reducetask, and in the process of distribution, the data is partitioned and sorted by key.

3 detailed process

Maptask collects the kv pairs output by our map () method, first enters the partition method, marks the data with the partition, and then sends the data to the memory buffer (default 100m)

When the ring buffer reaches 80%, the local disk files are continuously overflowed from the memory buffer, and multiple files may be overflowed (the data is quickly sorted before overwriting, and the sort is sorted in dictionary order according to the key index)

Multiple overflow files will be merged into large overflow files (merge sorting algorithm), overwritten files can also be combiner operation, as long as the summary operation, the average is not.

In the process of overflow and merge, partitoner is called to group and sort against key.

Reducetask according to its own partition number, go to each maptask machine to get the corresponding result partition data, pull the data first stored in memory, memory is not enough, and then stored to disk.

Reducetask will fetch the result files from different maptask of the same partition, and reducetask will merge (merge and sort) these files.

After merging into a large file, the process of shuffle is over, followed by the logical operation of reducetask (taking a key-value pair group from the file and calling the user-defined reduce () method)

The buffer size in Shuffle will affect the execution efficiency of mapreduce programs. In principle, the larger the buffer, the fewer disk io times and the faster the execution speed.

The size of the buffer can be adjusted by parameters: io.sort.mb defaults to 100m.

This is how the MapReduce shared by the editor is implemented. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report