In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what is the working mechanism of MapReduce". Interested friends may wish to take a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the working mechanism of MapReduce"?
Analyze the running Mechanism of MapReduce Job
Static diagram:
Dynamic graph:
Progress and status updates
Fail
User code in a map or reduce task throws a run exception
JVM abruptly quit
Node Manager failed to run
Resource Manager failed to run
For high availability, HA, in dual hot standby mode, it is necessary to run a pair of resource managers
Shuffle and sorting
MapReduce ensures that the input of each reducer is sorted by key. The process by which the system performs sorting and passes the map output as input to reducer is called shuffle.
Map end
When the map function starts to produce output, it doesn't simply write it to disk. This process is more complex, which uses buffering to write to memory and pre-sorts for efficiency reasons.
Each map task has a ring memory buffer for storing task output. By default, the buffer size is 100Mb, which can be adjusted by changing the mapreduce.task.io.sort.mbproperty. As soon as the flush content reaches the threshold (mapreduce.map.sort.spill.percent, the default is 0.80,80%), a background thread starts to spill the content to disk. During the overflow write to disk, the map output continues to be written to the buffer, but if the buffer fills up during that time, the map is blocked until the disk write process is complete. The overflow write process polls the contents of the buffer to the directory specified by the mapreduce.cluster.local.dir property under the job-specific subdirectory.
Before writing to the disk, the thread first divides the data into corresponding partition according to the reducer of the data to be transmitted. In each partition, background threads press keys to sort in memory, and if there is a combiner function, it runs on the sorted output. Running the combiner function makes the map output more compact, thus reducing the data written to disk and passed to reducer.
Each time the memory buffer reaches the overflow threshold, a new overflow file (spill file) is created, so there are several overflow files after the map task finishes writing its last output record. Before the task is completed, the overflow file is merged into a partitioned and sorted output file. The configuration property mapreduce.task.io.sort.factor controls how many streams can be merged at most at a time, and the default value is 10.
If there are at least three overflow files (through the mapreduce.map.combine.minspills property setting), combiner will run again before the output file is written to disk. As mentioned earlier, combiner can be run repeatedly on input, but it does not affect the final result. If there are only one or two overflow files, then due to the reduced size of the map output, it is not worth the overhead of calling combiner, so combiner will not be run again for that map output.
It is often a good idea to compress compressed map output as it is written to disk, because it will write disk faster, save disk space, and reduce the amount of data passed to reducer. By default, the output is not compressed, but you can easily enable this feature as long as mapreduce.map.output.compress is set to true. The compression library used is specified by mapreduce.map.output.compress.codec.
Reducer gets the partition of the output file through HTTP. The number of worker threads used for file partitions is controlled by the task's mapreduce.shuffle.max.threads property, which is set for each node manager, not for each map task. The default value of 0 sets the maximum number of threads to twice the number of processors in the machine.
Provide as much memory space as possible for the shuffle process
Write map and reduce functions with as little memory as possible, and should not use memory indefinitely
On the map side, you can achieve the best performance by avoiding multiple overflows to write to the disk
On the reduce side, the best performance can be achieved when all the intermediate data resides in memory.
Reduce end
Configuration tuning
Execution of tasks
It is speculated that during parallel execution, Hadoop will not try to diagnose or fix slow tasks for slow tasks. Instead, it will try to detect and start another same task as a backup when one task runs slower than expected. This is the so-called "speculative execution"
OutputCommitters
Hadoop MapReduce uses a submission protocol to ensure that both jobs and tasks are fully successful or failed. This behavior is achieved by using OutputCommitters for the job
At this point, I believe you have a deeper understanding of "what is the working mechanism of MapReduce". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.