In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
[TOC]
1. Basic process 1. Process
Map side:
1) suppose there are two map task running in parallel.
2) after each map task task is processed, the output is stored in a ring buffer through the collector collector.
3) the working principle of ring buffer:
1 > the default size of the ring buffer is 100m. You can configure mapred-site.xml:mapreduce.task.io.sort.mb to configure size 2 > the ring buffer threshold is 80%. If you exceed it, you will start spill overwriting. You can configure mapred-site.xml:mapreduce.map.sort.spill.percent to configure the percentage of threshold 3 > ring buffer to store two kinds of data, one is metadata: the partition number of KV, the starting position of key, the starting position of value, and the length of value. Each metadata length is fixed at 4 int length one is raw data: storing the original data of key and value 4 > at the starting point of metadata and raw data, there will be a dividing line to distinguish the storage area of the two kinds of data, and then the two start to write data in the opposite direction. 5 > when the ring buffer exceeds 80%, 80% of the data will be locked and then overwritten to disk to become a small file, and in the process, 80% of the space cannot be written (a new thread in the background will perform the overflow). At the same time, the remaining 20% can continue to write data. Unlock 80% of the space until the end of the overflow.
4) spill: when the buffer space exceeds 80%, a background thread starts and begins to overflow into small files and write to disk. In this process, the metadata in the buffer is sorted according to the partition number (one overflow file per partition) and then sorted by key within the same partition (the sorting algorithm here uses quick sorting). Then, according to the sorted metadata, the corresponding original data is overwritten. Finally, we get the overwritten files that have been partitioned and sorted according to key in the partition.
At the same time, in the last step of overflow writing, you can join the combine process (optional).
Here we are talking about sorting the metadata and then overwriting the corresponding raw data based on the sorted metadata. Why are you doing this? Because the sorting process involves the movement of data, and the original data is generally larger than the metadata, so the mobile costs (including memory space, cpu and other costs) are relatively large. So here we sort the metadata directly according to the key in the original data, and finally form an ordered metadata area. Finally, as long as the KV of the corresponding position is read from the original data area according to the metadata in turn, the ordered original data can be obtained.
5) merge sorting: when the overflow write is completed, multiple overwrites usually occur, resulting in multiple partitions and orderly overwrite files in the region. The next step is to merge and sort multiple overflow files in the same partition to form a large overflow file, which is orderly. The combine process can also be added to this process (optional). In fact, the process of merger is carried out many times, not at one time.
6) finally, the merged overflow file is compressed and written to disk. At this point, the process of shuffle on the map side has been completed.
Reduce side: one partition corresponds to one reduce task
7) some MRAppMasterthreads in reducer regularly ask map task for the location of the output file, and mapper will report the information to MRAppMaster after the end of map, so that reducer can get the status of map and get the directory of map's result file. Then reduce automatically pulls the result file of the same partition to multiple map. During the pull process, the data is temporarily stored in the buffer. The default is 100m, which is also a ring buffer. When the amount of data is greater than the buffer size, the data is written to disk.
8) merge sort merge: after pulling, multiple result files will be merged and sorted, and finally a large ordered file will be synthesized. This merge process will involve where the input and output of data are located, such as input and output are in memory, input memory, output hard disk; input hard disk, output is also hard disk. If the way is different, the obvious performance must be different. This is a point of MapReduce optimization.
9) the grouping group process is followed. Merge the key-value pairs of the same key into the form (key, array). For example, (king,1), (king,2) merged into (king, [1d2]). You can customize the grouping method here.
10) in the following group operation, a packet will only call the reduce method once, and by default, only the first KV in the packet will be used as the input of the reduce, and the remaining KV will not be processed and discarded directly. You can customize the grouping class here.
11) after the merge,group process is completed, the reduce method is called once for each KV, resulting in reduce output.
2. Key processes in the shuffle process.
Partition: partition
Spill: overwrite
Merge: migration merge
Sort: sort. There are three sorting times. They are the quick sort in the overflow write, and the merge sort of multiple overflow write files. And merging and sorting the result files of multiple map in the reduce side.
For the first merge on the combine:map side, the business logic is reduce, only local, and the process is optional. But it can be used as an optimization point, because it can reduce the amount of data that reduce pulls data from map.
3. The structure of the overflow file after the map merge (1) Storage structure
The file that overflows on the map side actually has two parts, one is the index file, the other is the data itself.
! [] (E:\ file\ big data\ picture\ assets\ MapReduce-mapMerge.png)
Index file: mainly records the offset of each partition in the data file.
Data file: records the length of KV and the data of KV.
(2) characteristics
In fact, it can be seen that when the map side stores the overflow results of multiple partitions, it is not stored independently in a separate file, but in the same file. then the data of different partitions are read through the index file to identify the offset of the data of each partition in the total data file. The advantage of this storage method is that if there are many partitions, storing the data of each partition separately will generate multiple files and take up the index resources of multiple hdfs. In the above way, only two files need to be read.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 269
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.