In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the relevant knowledge of what the working mechanism of MapReduce is, the content is detailed and easy to understand, the operation is simple and fast, and it has a certain reference value. I believe you will gain something after reading this article on the working mechanism of MapReduce. Let's take a look at it.
MapReduce, which is essentially a programming model, is also a related implementation for dealing with large data sets. The reason why there is this model is to hide "parallel computing, fault-tolerant processing, data distribution, load balancing", so as to realize a kind of abstraction of big data computing.
1. Environmental description
The deployment node operating system is CentOS, firewall and SElinux are disabled, a shiyanlou user is created and / app directory is created under the system root directory, which is used to store Hadoop and other components running packages. Because this directory is used to install component programs such as hadoop, users must give rwx permission to shiyanlou (it is common practice for root users to create a / app directory under the root directory and change the directory owner to shiyanlou (chown-R shiyanlou:shiyanlou / app).
* * build environment for Hadoop**:
Virtual machine operating system: CentOS6.6 64-bit, single core, 1G memory
L JDK:1.7.0_55 64 bit
L Hadoop:1.1.2
2. Brief introduction of MapReduce principle 2.1MapReduce
MapReduce is a very popular distributed computing framework, which is designed for parallel computing of massive data. Google was the first to propose the technical framework, while Google was inspired by functional programming languages such as LISP,Scheme,ML. The core steps of the MapReduce framework are mainly divided into two parts: Map and Reduce. When you submit a computing job to the MapReduce framework, it will first split the computing job into several Map tasks, and then assign them to different nodes to execute. Each Map task processes part of the input data. When the Map task is completed, it will generate some intermediate files, which will be used as input data for the Reduce task. The main goal of the Reduce task is to aggregate and output the output of the previous several Map. From a high-level abstraction, the data flow diagram of MapReduce is shown in the following figure:
2.2 MapReduce process Analysis 2.2.1 Map process
\ 1. Each input shard is handled by a map task. By default, it takes the size of a block of HDFS (the default is 64m) as a shard, of course, we can also set the size of the block. The result of map output is temporarily placed in a ring memory buffer (the size of the buffer defaults to 100m, which is controlled by the io.sort.mb attribute). When the buffer is about to overflow (the default is 80% of the buffer size, controlled by the io.sort.spill.percent attribute), an overflow file is created in the local file system and the data in the buffer is written to the file.
\ 2. Before writing to disk, the thread first divides the data into the same number of partitions according to the number of reduce tasks, that is, a reduce task corresponds to a partition of data. This is done to avoid the embarrassing situation in which some reduce tasks are assigned a large amount of data, while some reduce tasks are assigned with little or no data. In fact, partitioning is the process of hash data. Then sort the data in each partition, and if Combiner is set at this time, Combia the sorted results in order to write as little data to disk as possible
\ 3. When the map task outputs the last record, there may be a lot of overflow files that need to be merged. Sorting and combia operations are performed continuously during the merge process for two purposes:
L minimize the amount of data written to disk each time
L minimize the amount of data transmitted by the network in the next replication phase. Finally, it is merged into a partitioned and sorted file. In order to reduce the amount of data transmitted over the network, the data can be compressed here, as long as mapred.compress.map.out is set to true.
\ 4. Copy the data in the partition to the corresponding reduce task. One might ask: how does the data in the partition know which reduce it corresponds to? In fact, the map task keeps in touch with his father, TaskTracker, while TaskTracker keeps his heartbeat with JobTracker. So the macro information of the entire cluster is saved in JobTracker. As long as the reduce task gets the corresponding map output location from JobTracker.
2.2.2 Reduce process
\ 1. Reduce will receive data from different map tasks, and the data from each map is orderly. If the amount of data accepted by the reduce side is quite small, it is stored directly in memory (the buffer size is controlled by the mapred.job.shuffle.input.buffer.percent attribute, indicating the percentage of heap space used for this purpose). If the amount of data exceeds a certain percentage of the buffer size (as determined by mapred.job.shuffle.merge.percent), the data is merged and overwritten to disk.
\ 2. As more files are overwritten, the background thread merges them into a larger, orderly file to save time for subsequent merges. In fact, no matter on the map side or the reduce side, MapReduce performs sorting and merging operations repeatedly.
\ 3. Many intermediate files (written to disk) are produced during the merge, but MapReduce keeps the data written to disk as little as possible, and the result of the last merge is not written to disk, but is entered directly into the reduce function.
2.3 Analysis of the working Mechanism of MapReduce
1. Submit MapReduce programs on any node in the cluster
After 2.JobClient receives the job, JobClient requests JobTracker to get a Job ID
3. Copy the resource files needed to run the job to the HDFS (including the JAR file packaged by the MapReduce program, the configuration file and the input partition information calculated by the client). These files are stored in a folder specially created by JobTracker for the job, and the folder is named the Job ID of the job.
4. After obtaining the job ID, submit the job
After receiving the job, 5.JobTracker puts it in a job queue and waits for the job scheduler to schedule it. When the job scheduler schedules the job according to its own scheduling algorithm, it creates a map task for each partition according to the input partition information, and assigns the map task to TaskTracker execution.
6. For map and reduce tasks, TaskTracker has a fixed number of map and reduce slots depending on the number of host cores and the size of memory. What needs to be emphasized here is that map tasks are not randomly assigned to a TaskTracker. There is a concept called data localization (Data-Local). It means: assign the map task to the TaskTracker containing the data block processed by the map, and copy the program JAR package to the TaskTracker to run, which is called "the operation moves, the data does not move"
7.TaskTracker sends a heartbeat to JobTracker every once in a while, telling JobTracker that it is still running, and that the heartbeat also carries a lot of information, such as the progress of the current map task. When JobTracker receives the message that the last task of the job is completed, it sets the job to successful. When JobClient queries the status, it will know that the task has been completed and displays a message to the user
8. The running TaskTracker acquires the resources needed to run from HDFS, including the JAR files packaged by MapReduce programs, configuration files and input partitions calculated by the client.
Start a new JVM virtual machine after 9.TaskTracker acquires the resources
10. Run every task.
This is the end of the article on "what is the working mechanism of MapReduce?" Thank you for reading! I believe you all have a certain understanding of the knowledge of "what is the working mechanism of MapReduce". If you want to learn more knowledge, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.