How to implement the submission task in JOB in Hadoop 07/02 Update SLTechnology News&Howtos

How to implement the submission task in JOB in Hadoop

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "how to achieve the submission task in JOB in Hadoop". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The simple concept of 1.MapReduce

Baidu encyclopedia: MapReduce is a programming model for parallel computing of large-scale datasets (larger than 1TB). The concepts "Map" and "Reduce", and their main ideas, are borrowed from functional programming languages, as well as features borrowed from vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. The current software implementation is to specify a Map (mapping) function, which is used to map a set of key-value pairs into a new set of key-value pairs, and to specify concurrent Reduce (reduction) functions, which are used to ensure that each of the mapped key-value pairs shares the same key group. As for what is a functional programming language and a vector programming language, I am not very clear, see the explanation link:

Http://www.cnblogs.com/kym/archive/2011/03/07/1976519.html.

MapReduce is a distributed computing model proposed by Google, which is mainly used in the search field to solve the computing problems of massive data. When you submit a computing job to the MapReduce framework, it will first split the computing job into several Map tasks, and then assign them to different nodes to execute. Each Map task processes part of the input data. When the Map task is completed, it will generate some intermediate files, which will be used as input data for the Reduce task. The main goal of the Reduce task is to aggregate and output the output of the previous Map. That is to say, HDFS has provided us with high-performance, high-concurrency services, but parallel programming is not a job that all programmers can do. If our application itself can not be concurrent, then the HDFS of Hadoop is meaningless. The great thing about MapReduce is that programmers who are not familiar with parallel programming, such as mine, can take full advantage of distributed systems. Here to explain the following: Hadoop itself this framework is based on the foreigners company Google's three major papers GFS,BigTable,MapReduce (programming model), using Java language to achieve the framework. Google uses C++ to implement it, and the MapReduce programming model (which is highly abstract) is largely inseparable from the following picture. the difference between the Spark parallel computing framework (and Hadoop's MapReduce) is that it puts the intermediate result, that is, the result of the map function, directly into memory, rather than into the HDFS on the local disk. None of this is the point, but the process in the following figure:

The above picture is the flow chart given in the paper. It all starts with the top user program, which links to the MapReduce library and implements the most basic Map and Reduce functions. The order of execution in the figure is marked with numbers. The 1.MapReduce library first divides the input file of user program into M (M is user-defined), each of which usually has 16MB to 64MB, which is divided into split0~4; as shown on the left of the figure, and then uses fork to copy the user process to other machines in the cluster. One of the copies of the 2.user program is called master, and the rest, called worker,master, is responsible for scheduling, assigning jobs (Map Job 3 or Reduce Job) to the idle worker, and the number of worker can also be specified by the user. 3. The worker assigned to the Map job begins to read the input data of the corresponding shards. The number of Map jobs is determined by M and corresponds to split one by one. The Map job extracts key-value pairs from the input data, and each key-value pair is passed to the map function as a parameter, and the intermediate key-value pairs generated by the map function are cached in memory. 4. The cached intermediate key-value pairs are regularly written to the local disk and are divided into R regions. The size of R is defined by the user, and each zone will correspond to a Reduce job in the future. The location of these intermediate key-value pairs will be notified to master,master to forward the information to Reduce worker. 5.master tells the worker assigned to the Reduce job where the partition it is responsible for (there must be more than one place, and the intermediate key-value pairs generated by each Map job may be mapped to all R different partitions). When Reduce worker reads all the intermediate key-value pairs it is responsible for, it sorts them first so that the key-value pairs of the same key are clustered together. Because different keys may map to the same partition, that is, the same Reduce job (who makes the partitions less), sorting is necessary. 6.reduce worker traverses the sorted intermediate key-value pair, and for each unique key, the key and associated value are passed to the reduce function, and the output generated by the reduce function is added to the output file of the partition. 7. When all the Map and Reduce jobs are completed, and the master wakes up the genuine user program,MapReduce function call to return user program, all the code is executed, and the MapReduce output is placed in the output file of R partitions (one for each Reduce job). Users usually do not need to merge the R files, but give them as input to another MapReduce program for processing. In the whole process, the input data comes from the underlying distributed file system (GFS), the intermediate data is placed on the local file system, and the final output data is written to the underlying distributed file system (GFS). And we should pay attention to the difference between the Map/Reduce job and the map/reduce function: the Map job handles a slice of input data and may need to call the map function multiple times to deal with each input key-value pair; the Reduce job handles the middle key-value pair of a partition, during which the reduce function is called for each different key, and the Reduce job finally corresponds to an output file.

As for the following picture, the model implementation of Hadoop MapReduce (color) is as follows (of course, this is not drawn by me, but just a porter of nature):

(input)-> map->-> combine->-> reduce-> (output)

Reference link: how to explain MapReduce to your wife.

MapReduce in 2.Hadoop1.x

There are two different periods of MapReduce in Hadoop. The initial implementation of MapReduce in Hadoop does a lot of things, while the core Job Tracker of the framework is to be both a parent and a mother. Look at the following picture:

The flow and design ideas of the original MapReduce program: 1. First of all, the user program (JobClient) submits a job,job message and sends it to Job Tracker. Job Tracker is the center of the Map-reduce framework. It needs to communicate regularly with the machines in the cluster (heartbeat), needs to manage which programs should run on which machines, and needs to manage all job failures, restarts and other operations. 2.TaskTracker is a part of every machine in the Map-reduce cluster, and what he does is mainly to monitor the resources of his machine. 3.TaskTracker also monitors the tasks health of the current machine. TaskTracker needs to send this information to JobTracker,JobTracker via heartbeat and collects this information to assign which machines the newly submitted job is running on.

Now that Hadoop2 has improved it, it has some problems. The main problems are as follows:

1.JobTracker is the centralized processing point of Map-reduce, and there is a single point of failure. 2.JobTracker completes too many tasks, resulting in excessive resource consumption. When there are too many map-reduce job, it will cause a lot of memory overhead. Potentially, it also increases the risk of JobTracker fail. This is also the industry's general conclusion that the Map-Reduce of the old Hadoop can only support 4000 nodes of hosts. 3. On the TaskTracker side, it is too simple to take the number of map/reduce task as the representation of resources, without taking into account the memory consumption of cpu/. If two task with large memory consumption are scheduled together, OOM is easy to occur. 4. On the TaskTracker side, resources are forcibly divided into map task slot and reduce task slot. If there is only map task or reduce task in the system, it will cause a waste of resources, that is, the problem of cluster resource utilization mentioned earlier. When analyzing the source code, we will find that the code is very difficult to read, often because a class does too many things, the code volume is more than 3000 lines, which makes the task of class unclear, which increases the difficulty of bug repair and version maintenance. 5. From an operational point of view, the current Hadoop MapReduce framework forces system-level upgrades when there are any important or unimportant changes (such as bug fixes, performance improvements, and characterization). To make matters worse, it forces every client of a distributed cluster system to be updated at the same time, regardless of user preferences. These updates can cause users to waste a lot of time verifying that their previous application is suitable for the new version of Hadoop. YARN+MapReduce, a new scheme in 3.Hadoop2.x

First of all, don't be fooled by YARN, it is only responsible for resource scheduling and management, and MapReduce is the guy in charge of computing, so YARN! = MapReduce2. This is what the master said:

YARN is not the next generation MapReduce (MRv2). The next generation MapReduce is exactly the same as the first generation MapReduce (MRv1) in programming interface and data processing engine (MapTask and ReduceTask). It can be considered that MRv2 reuses these modules of MRv1, but it is different from the resource management and job management system. Resource management and job management in MRv1 are implemented by JobTracker, which integrates two functions, while in MRv2, these two parts are separated. Job management is implemented by ApplicationMaster, while resource management is completed by the new system YARN. Because YARN is universal, YARN can also be used as a resource management system for other computing frameworks, not only for MapReduce, but also for other computing frameworks (Spark).

Looking at the picture above, we can see that mapreduce in Hadoop1 can do everything, while MapReduce in Hadoop2 specializes in data analysis. YARN exists as a resource manager.

With YARN, Apache Hadoop NextGen MapReduce (YARN) is said on the official website. Its architecture is as follows:

In Hadoop2, the two main functions of JobTracker are separated into separate components, which are resource management and task scheduling / monitoring. The new resource manager globally manages the allocation of computing resources for all applications, and the ApplicationMaster of each application is responsible for scheduling and coordinating accordingly. An application is nothing more than a single traditional MapReduce task or a DAG (directed acyclic graph) task. ResourceManager and the node management server of each machine can manage the user's processes on that machine and organize computing. 1. In fact, the ApplicationMaster of each application is a detailed framework library that combines resources obtained from ResourceManager and NodeManagr to run and monitor tasks. two。 In the figure above, ResourceManager supports hierarchical application queues, which enjoy a certain proportion of the resources of the cluster. In a sense, it is a pure scheduler that does not monitor and track the status of the application during execution. Similarly, it cannot restart tasks that fail due to application failures or hardware errors. ResourceManager schedules resources based on the requirements of the application; each application needs different types of resources and therefore different containers. Resources include: memory, CPU, disk, network and so on. It can be seen that this is significantly different from the current Mapreduce fixed-type resource usage model, which has a negative impact on the use of clusters. The resource manager provides a plug-in for scheduling policies that is responsible for allocating cluster resources to multiple queues and applications. The scheduling plug-in can be based on the existing capacity scheduling and fair scheduling model. 3. In the figure above, NodeManager is the agent for each machine framework, is the container for executing the application, monitors the application's resource usage (CPU, memory, hard disk, network) and reports to the scheduler. 4. In the figure above, the ApplicationMaster of each application is responsible for requesting the appropriate resource container from the scheduler, running tasks, tracking the status of the application and monitoring their progress, and dealing with the causes of task failure.

Again, in a Hadoop2 cluster, a complete set of flowcharts for a client to submit a task:

1. The client's mapreduce program is submitted to the hadoop cluster 2. 2. 0 through hadoop shell. Through RPC communication, the program will transmit the information about the program packed into jar packets to RM (ResourceManager) in the Hadoop cluster, which can be called the process of getting JOBID. The information submitted by 3.RM assigns a unique ID to the task, and the storage path of run.jar on HDFS is sent to the client. 4. After the client gets that storage path, it will splice the final storage path directory accordingly, and then store the run.jar in the HDFS directory in multiple copies. By default, the number of backups is 10. Configurable. 5. The client submits some configuration information, such as the final storage path, JOB ID, etc. RM will put these configuration information into a queue, the so-called scheduler. As for the scheduling algorithm, there is no need to delve into. 7. NM (NodeManager) and RM keep communicating through the heartbeat mechanism, and NM will regularly pick up tasks from RM. 8. RM will start the task monitoring process Application Master in any one or more NM. It is used to monitor the execution of YARN CHild in other NM. After getting the task, 9.NM will go to HDFS to download run.jar. Then start the YARN Child process on the local machine to execute the map or reduce function. The intermediate result data after the processing of the map function will be placed in the local file system. 10. After finishing the program, write the result data to HDFS. That's what the whole process looks like. 4. The meaning of the appearance of YARN-reference

With the advent of YARN, you are no longer constrained by simpler MapReduce development patterns, but can create more complex distributed applications. In fact, you can think of the MapReduce model as one of some applications that the YARN architecture can run, just exposing more functionality of the underlying framework for custom development. This capability is very powerful because the usage model of YARN is almost unlimited and no longer needs to be isolated from other more complex distributed application frameworks that may exist on a cluster, just like MRv1. It can even be said that as YARN becomes more robust, it has the ability to replace some other distributed processing frameworks, thus completely eliminating the resource overhead dedicated to other frameworks while simplifying the entire system.

To demonstrate the efficiency of YARN over MRv1, consider the parallelism of brute force testing older versions of LAN Manager Hash, which is a typical method of old Windows ®for cryptographic hashing. In this scenario, the MapReduce method doesn't make much sense because the Mapping/Reducing phase involves too much overhead. Instead, it makes more sense to abstract the job assignment so that each container has a portion of the password search space, enumerates on top of it, and informs you whether the correct password has been found. The point here is that passwords are determined dynamically through a function (which is a bit tricky), rather than having to map all possibilities into one data structure, which makes the MapReduce style unnecessary and impractical.

To sum up, the problems under the MRv1 framework only need an associative array, and these problems tend to evolve in the direction of big data's operation. However, the problems must not always be limited to this paradigm, because you can now more easily abstract them, writing custom clients, application main programs, and applications that match any design you want.

5. Write applications for simple MapReduce Yarn

We directly take the example of wordcount in the official website of Apache Hadoop to illustrate the writing of MapReduce program.

Source Code

Import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat Public class WordCount {/ / write your own Mapper, which needs to inherit the input type of org.apache.hadoop.mapreduce.Mapper public static class TokenizerMapper extends Mapper {/ /, and the output / / as the member variable private final static IntWritable one = new IntWritable (1); private Text word = new Text () / / key: offset offset, can almost be ignored / / value: data of one line string line / / context: context of the context of computer calculation public void map (Object key, Text value, Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer (value.toString ()); while (itr.hasMoreTokens ()) {word.set (itr.nextToken ()) Context.write (word, one);} / / write your own Reducer, you need to inherit org.apache.hadoop.mapreduce.Reducer public static class IntSumReducer extends Reducer {private IntWritable result = new IntWritable (); public void reduce (Text key, Iterable values, Context context) throws IOException, InterruptedException {int sum = 0 For (IntWritableval: values) {sum + = val.get ();} result.set (sum); context.write (key, result);}} / / the main function starts running JOB public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); Job job = Job.getInstance (conf, "word count"); job.setJarByClass (WordCount.class); job.setMapperClass (TokenizerMapper.class) Job.setCombinerClass (IntSumReducer.class); job.setReducerClass (IntSumReducer.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class); FileInputFormat.addInputPath (job, new Path (args [0])); FileOutputFormat.setOutputPath (job, new Path (args [1])); System.exit (job.waitForCompletion (true)? 0: 1) / / submitted JOB successfully. Quit the source code analysis of submitting Job in JVM virtual machine}} 6.Hadoop2.0

-- at this point, communication with the server RM has been established.

-the next step is to submit the job task.

This is the end of the content of "how to implement the submission task in JOB in Hadoop". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.