MapReduce computing framework 07/13 Update SLTechnology News&Howtos

MapReduce computing framework

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Monday, February 18, 2019

MapReduce computing framework

Mapreduce is a distributed computing programming framework whose core function is to distribute the core logic code written by users.

Run on a cluster of many servers;

Why Makeup?

(1) Massive data processing on a single machine is not competent because of hardware resource constraints, because it needs to be processed in a distributed cluster manner.

(2) Once the stand-alone version of the program is extended to a cluster to run distributed, it will greatly increase the complexity and development difficulty of the program.

(3) With the introduction of mapreduce framework, developers can focus most of their work on the development of business logic, while the complexity of distributed computing is handled by the framework.

MAPREDUCE Program Run Demo

Hadoop's distribution package has a built-in hadoop-mapreduce-example-2.4.1.jar, which contains various MRs.

Example program, which can be run by the following steps:

Start hdfs, yarn

Then run it on any server in the cluster (for example, run wordcount):

hadoop jar hadoop-mapreduce-example-2.4.1.jar wordcount /wordcount/data /wordcount/out

MapReduce introduces problems

1. Distribute the program and start the distributed program

2. Buffering and scheduling of intermediate data

3. Task monitoring and failure handling

MapReduce framework operating mechanism

MapReduce is divided into three processes:

1. map //A reads file B calls business logic code (programmers only care about this part) C collects call results

2, shuffle mechanism//cache

3. reduce //A Pull data from cache B Call business logic code (programmers only care about this part) C Collect result output (final result) By default, write the final result to hdfs

Data flow of MapReduce operation mechanism

1. map //key: line start offset value: line content

Shuffle //Shuffle by key distribution: kv of the same key must be sent to the same reduce task

3. reduce //integrate the same key values into a group

The Shuffle Mechanism in MapReduce Framework

Shuffle cache flow:

---shuffle is a process in MR processing flow. Each processing step of shuffle is distributed on each maptask and reduce task node. Overall, shuffle is divided into 3 operations:

1, partition

Sort by key

3. Combiner combines local values

Shuffle stage text detailed explanation

1. After taking the data in the map phase, the map method (our custom) will be called first.

2. After getting it, there will be a context.write output result in the map, and the output result at the map end will be given to the shuffle stage.

3. There is a ring buffer at the map end (default memory size: 100M)[the function implemented is to collect these kv] collect thread

4. In the process of continuous output and continuous collection, the circular buffer area will be continuously written and will be filled, so the internal mechanism will not let it be filled. When it is written to 80%, it will overflow. Or at the map end, the overflow data will be managed by (thread split thread). Here, the overflow data will also be partitioned (sorted). Next, split thread will store the overflow data on the disk [here, the data stored in the disk is sorted in good areas]. Overflow files are divided into good areas, and the areas are orderly.

5. At the map end, the last time, all the data will overflow, which is also divided into good areas and orderly areas. Then it forms a series of small files divided into good areas, which are then merged to form large files. This merger is a one-to-one merger of all the data in the partition to form Zone 1. Here again it is partitioned and ordered within the area (this is the final file form formed at the map end).

Shuffle is not done on a node. Shuffle is the data scheduling mechanism between map and reduce. The process mainly includes: cache partition sorting

7. Reduce actively downloads the final file formed at the map end (first actively downloads the contents of Area 1 at all map ends). Here, zone 1, zone 2, zone 0 will be divided into different reduce tasks.

8. Next, the data taken from Area 1 at the map end will be merged and sorted at the reduce end//merge sort.

9. The parameters passed by the reduce method once per aggregation call key: are the same key of the aggregation group, values: are iterators of all values of the aggregation group

//Generates an aggregate values iterator to pass to the reduce method, and passes the key of the top kv in this set of aggregate kv (aggregation is based on GroupingComparator) to the input parameter key of the reduce method. Eventually, it'll form an orderly, archived file.

Tip: Other reduce also do the same thing, except that the data obtained by other reduce may be the content of area 1 and area 2, and the processing process is the same as above. Each reduce task forms a final ordered result file

10. The final file formed at the reduce end is ordered internally, but not necessarily in order at all. This requires our program to intervene. If it is global sorting, we need to add partition control to make this partition according to certain sections, and finally form a global order of reduce. A key in front of a certain demarcation point is a zone, a key in the middle is a zone, and the last key is a zone.

Summary: The whole shuffle process is as follows:

map task outputs results to a memory cache and overflows to disk files

combiner call

Partitioning/sorting

reduce task pulls the corresponding partition data from the map output file

reduce end merge sort

Generates an aggregation values iterator to pass to the reduce method, and passes the key of the top kv in this set of aggregation kv (aggregation is based on GroupingComparator) to the input parameter key of the reduce method.

Shuffle is not done on a node. Shuffle is the data scheduling mechanism between map and reduce. The process mainly includes: cache partition sorting

In MapReduce, there are 6 times in the whole process that require io operations, namely:

1. Take data for the first time (take data from hdfs to map)

2. Overflow data (second io operation occurs)

3, merge(small files merge into large files occurs the third io operation)

4, combiner local merge (the fourth io operation occurs)

5. Merge sort (the fifth io operation occurs in the process of combining biners into reduce processing)

6. The result of reduce processing is stored on hdfs. The sixth io operation occurs.

This is also the biggest bottleneck of MapReduce compared with spark. Spark only takes data from hdfs for the first time, and after processing the task, stores the file on hdfs and also takes place an io operation. All the processing in the middle is in memory, so there is no large number of io operations. It is fast and all spark is the mainstream computing engine.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.