In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
MapReducer Workflow Chart:
1. Source code analysis in MapReduce phase 1) client submits source code analysis
Explanation:
-determines whether to print the log
-determine whether to use the new API, check the connection
-when checking the connection, check the input and output paths, calculate slices, and copy jar and configuration files to HDFS
-when calculating slices, calculate the minimum number of slices (default is 1, which can be customized) and maximum number of slices (default is the maximum value of long, which can be customized)
-check to see if the given file is a file, and if the directory calculates the slices of all files in the directory
-calculate slice size by block size and minimum and maximum number of slices
-over slice size, calculate the number of map and the nodes distributed to
-submit job to yarn for MapReduce calculation
2) Source code analysis in map phase (input phase of Map)
Explanation:
-first Map Task the task, call the run () method, the run () method goes through the following stages
-initializes the taskcontext object
-initialization of the mapper object. A default value is included here. If there is no custom mapper class, the system's Mapper is used by default.
-format the file input. A default value is included here. If there is no custom inputFormat class, the system's TextinputFormat is used by default.
-create an input object, create a specific file read class, read one line per iteration by default through lineReader (), here implement an iterative judgment nextKeyVaule (), and initialize key and value when nextKeyVaule is implemented
-Input initialization: calculate the open location, read the contents of the file, (discard the first line)
-call the run method of mapper to read in a loop until the end, reading one more line, and start abandons the first row of data to be read by the previous slice. Note that the run method here will call the setup, map, cleanup methods in the Mapper class we wrote.
3) Source code analysis in map phase (output phase of Map)
Explanation:
-the output object is created by newOutCollector
In -newOutCollector, collector and partitions need to be prepared to calculate the number of reduce, and the output of the map side will be written to collector.
-preparing collector is actually preparing MpaOutputBuffer, which is a particularly complex process, which is roughly explained here, that is, the collected KV,P is written to a circular buffer, and then the data is written to a file after sorting and partitioning. (the specific process will be explained in shuffle below)
-finally, after the mapOut ends, the close method is called to close the output. When closed, the remaining data in the buffer ring is buffered, and all the small files are sorted and merged into one large file.
2. Detailed explanation of shuffle process
Process introduction:
If you store a 300m file in hdfs, the default size of each block is 128m, and the default slice size is 128m, so each MapTask task processes one split, there are three MapTask parallel processing. After each MapTask task is processed, the output will be stored in a ring buffer through the collector, and the writing process will be sorted simply. The default of this ring buffer is 100m. When the size of the ring buffer is more than 80%, a background thread will start to write the data in the ring buffer to the disk file, while Map will continue to write data to the ring buffer. How ring buffering works: the size of the ring buffer defaults to 100m (mapred-site.xml:mapreduce.task.io.sort.mb can be configured) the threshold of the ring buffer is: 80% (mapred-site.xml:mapreduce.map.sort.spill.percent, default 80%) in the ring buffer, there are two kinds of data stored in the ring buffer, one is metadata: the partition number, the starting position of the map's key, and the starting position of the map's value. The length of map's value (each metadata length is 4 int length, the length is fixed) one is raw data: when storing the original data and metadata, key and value storing the map will establish an equator between the metadata and the original data, divide the two, and then constantly write data to both ends, locking the data when the ring buffer data is written to 80%. It is then overwritten to the hard disk as a small file, while the rest of the ring buffer can still write data until the overflow ends, the lock is released, and metadata and raw data can continue to be written to the buffer. Buffer overwrites small files: when overwriting small files, the metadata in the buffer is sorted according to the partition number and key, and then the corresponding raw data is overwritten according to the sorted metadata (this is because the size of the metadata is fixed, which is easier than sorting the original data directly) In this way, several small files that have been sorted by partition and key will be overwritten (here you can add conbiner) to merge the small files after overwriting: at this time, the small files after overwriting will be merged into one large file (using merge sorting), and the merged large files will be sorted according to partition and key. Reduce pulls the corresponding data: a thread in Reducer periodically asks MRAppMaster for the location of the output result file. After the end of mapper, the information will be reported to MRAppMaster, so that Reducer will know the status of Mapper and get the directory of map result files. Reduce will pull the small files of the same partition locally and then merge and sort the pulled small files of the same partition into an ordered large file (the same key together). Then, according to the grouping rules, the same key calls the reduce method once for a group, and the processing data is finally written to different partition files according to the partition.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.