Big data, a good programmer, shares the learning route and analyzes the whole process of MapReduce. 07/01 Update SLTechnology News&Howtos

Big data, a good programmer, shares the learning route and analyzes the whole process of MapReduce.

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Big data, a good programmer, shares the whole process of MapReduce parsing, mobile data and mobile computing.

When learning big data, I came into contact with the two concepts of mobile data and mobile computing, which are closely related and very different, in which mobile computing is also called local computing.

The mobile data used in previous data processing is actually transferring the data that needs to be processed to each node that stores the logic of different ways of processing data. The efficiency of doing so is very low, especially the amount of data in big data is very large, at least more than GB, and even larger is TB, PB, and even larger, and the efficiency of disk ID O and network ID O is very low, so it takes a long time to deal with, which is far from meeting our requirements. And mobile computing appeared.

Mobile computing, also known as local computing, is that the data is stored on the node and no longer changes, but the processing logic program is transferred to each data node. Because the size of the processor is certainly not particularly large, it is possible to quickly transfer the program to each node where the data is stored, and then process the data locally, with high efficiency. Nowadays, big data's processing technology is all in this way.

To say succinctly:

Map phase:

1. Read: read the data source and filter the data into K _ Univ one by one.

2. Map: in the map function, process the parsed Kmax V and generate a new one.

3. Collect: output the result and store it in the buffer in the ring

4. Spill: memory area is full, data is written to local disk, and temporary files are produced.

5. Combine: merge temporary files to ensure the production of a data file

Reduce phase:

1. In the Shuffle:Copy phase, Reduce Task copies a point of data remotely to each Map Task. 2. If its size exceeds a certain threshold, write to disk; otherwise, put it in memory.

3. Merge: merge memory and disk files to prevent excessive memory consumption or disk files

4. Local sorting is performed in the Sort:Map Task stage, and merging sorting is performed in the Reduce Task stage.

5. Reduce: give the data to the reduce function

6. The Write:reduce function writes the result of its calculation to HDFS.

In an in-depth analysis:

MapTask stage

(1) Read phase: MapTask parses each key/value from the input InputSplit through the RecordReader written by the user.

(2) Map stage: this node mainly gives the parsed key/value to the user to write the map () function to deal with, and generates a series of new key/value.

(3) Collect collection phase: when the user writes the map () function, when the data processing is completed, OutputCollector.collect () is usually called to output the result. Inside the function, it partitions the generated key/value (calling Partitioner) and writes it to a ring memory buffer.

(4) Spill phase: that is, "overflow". When the ring buffer is full, MapReduce will write the data to the local disk and generate a temporary file. It is important to note that before writing the data to the local disk, the data is locally sorted and, if necessary, merged, compressed, and so on.

Details of the overflow phase:

Step 1: use the quick sort algorithm to sort the data in the cache. The sorting method is to sort according to the partition number partition, and then sort by key. In this way, after sorting, the data is aggregated in partitions, and all data in the same partition is sorted according to key.

Step 2: according to the partition number, write the data in each partition to the temporary file output/spillN.out under the task working directory (N represents the current number of overwrites). If the user sets Combiner, an aggregate operation is performed on the data in each partition before writing to the file.

Step 3: write the meta-information of the partition data to the in-memory index data structure SpillRecord, where the meta-information of each partition includes the offset in the temporary file, the data size before compression and the data size after compression. If the current memory index size exceeds 1MB, the memory index is written to the file output/spillN.out.index.

(5) Combine phase: when all data processing is completed, MapTask merges all temporary files once to ensure that only one data file is eventually generated. When all the data is processed, MapTask merges all temporary files into one large file, saves it to the file output/file.out, and generates the corresponding index file output/file.out.index. During the file merge process, MapTask merges on a partition basis. For a partition, it will use multiple rounds of recursive merging. Merge io.sort.factor files per round, and add the resulting files back to the list to be merged. After sorting the files, repeat the above process until you finally get a large file. Let each MapTask end up generating only one data file, avoiding the overhead of random reads caused by opening a large number of files and reading a large number of small files at the same time. The information includes the offset in the temporary file, the data size before compression, and the data size after compression. If the current memory index size exceeds 1MB, the memory index is written to the file output/spillN.out.index.

Shuffle phase (output to reduce input on the map side)

1) maptask collects the kv pairs output by our map () method and puts them in the memory buffer

2) keep overflowing local disk files from memory buffers, possibly overflowing multiple files

3) multiple overflow files will be merged into large overflow files

4) in the process of overflow and merge, partitioner is called to partition and sort against key.

5) reducetask goes to each maptask machine to get the corresponding result partition data according to its own partition number

6) reducetask will get the result files from different maptask of the same partition, and reducetask will merge (merge and sort) these files.

7) after merging into a large file, the process of shuffle ends, and then goes into the logical operation of reducetask (taking a key-value pair group from the file and calling the user-defined reduce () method)

3) Note that the buffer size in Shuffle will affect the execution efficiency of mapreduce programs. In principle, the larger the buffer, the fewer disk io times and the faster the execution speed. The size of the buffer can be adjusted by parameters: io.sort.mb defaults to 100m.

ReduceTask stage

(1) Copy phase: ReduceTask remotely copies a piece of data from each MapTask, and for a certain piece of data, if its size exceeds a certain threshold, it is written to disk, otherwise it is directly stored in memory.

(2) Merge phase: while copying data remotely, ReduceTask starts two background threads to merge files on memory and disk to prevent excessive memory use or too many files on disk.

(3) Sort stage: according to the semantics of MapReduce, the input data of the reduce () function written by the user is a group of data aggregated by key. In order to aggregate the same data from key, Hadoop uses a sorting-based strategy. Because each MapTask has implemented a local sorting of its own processing results, the ReduceTask only needs to merge and sort all the data once.

(4) Reduce phase: the reduce () function writes the calculation result to HDFS.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.