What are the MapReduce processes? 04/19 Update SLTechnology News&Howtos

What are the MapReduce processes?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "MapReduce process." In the actual case operation process, many people will encounter such difficulties. Next, let Xiaobian lead you to learn how to deal with these situations! I hope you can read carefully and learn something!

MapReduce is a yarn-based distributed, offline, parallel computing framework, the main responsibility is to process massive data sets, is a very important tool in the Hadoop ecosystem, so MapReduce is a key knowledge point of big data learning, need to be well mastered!

MapReduce which contains many components, but the most important or Job submission and Map, Reduce the whole process of these two parts, learning as long as the two main lines clear details into a knowledge system, then MapReduce learning will be handy. The submission process of Job jobs has a fairly detailed step analysis and illustration in the Hadoop authoritative guide book, so this summary is mainly about how massive data is extracted in the MapReduce process and processed in MapTask and ReduceTask, as well as the components involved in the application. Let's take a look at it.

The diagram above describes the entire MapRreduce process as a whole, roughly divided into five steps

1. input (map side reads fragmented data)--->2. Map processing--->3. shuffle process--->4. reduce processing--->5. output (reduce side outputs processing results) Now we will analyze and explain this process step by step. Note: The data structure of MP is:key-value

1. Map terminal reads data:

a. Before reading, the client will slice the data. The fragmentation mechanism is as follows. One fragment corresponds to one map. The block size of the client can be adjusted. minSize, maxSize Change the number of maps. The default value of minSize is 1. maxSize is the maximum value of long.

b. As shown in the figure below, format the data in TextInputFormat first, and then call lineRecordReader cyclically.

nextKeyValue, getCurrentKey, getCurrentValue, etc. Get data to MapTask as a form

c. Slice reading details: read one more line (the first slice) for each reading; the next slice will always discard the first line; the last slice cannot read one more line.

2. Map processing

a. On the Map side, call the map() method we wrote according to the business logic. Call the map() method once for each line to process the data. There is only one time to call the setup() method before calling the map method and the cleanup() method after calling the map method.

At this stage, the data is broken down into formal key-value pairs

b. At this stage, there can be a combiner process to locally integrate the data (when the data volume is too large), and the combiner can call

3. Shuffle process: refers to the operation process of data (data partitioning, sorting, caching) from the output of Map terminal to the input of Reduce terminal.

a. After the output is output from the map end, it will enter the outputCollector, a data collector, and then the data collector will transfer the data into a circular buffer with 20% reserved area (usually 100M).

b. When the data overflows in the ring buffer, there will be a spiller overflow device, in which the getPartition(k,v,num) method will be called to partition the data, and then fast sort within the partition according to the hashcode, and then send the data to Reduce.

4. Reduce processing

a. The data processed by shuffle process is an index file partitioned and sorted, while the reducetask framework reads a key from the file and passes it to the reduce method, and at the same time passes a value iterator.

b. The hasnext method of Value iterator will determine whether the next key in the file is the key passed in (if yes, return the value; if not, stop and call the next key instead)

c. The effect it seems to have, reducetask is to group the data in advance, and each group calls the reduce method once (actually not).

d. After reducetask processing, merge all partition files to generate large file output (default output to hdfs)

e、

5, output (reduce end output processing results)

TextOutputFormat the data, then call the lineRecordWriter loop

nextKeyValue, getCurrentKey, getCurrentValue, output to external file system (hdfs)

"MapReduce process what" content is introduced here, thank you for reading. If you want to know more about industry-related knowledge, you can pay attention to the website. Xiaobian will output more high-quality practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.