Hadoop source code analysis (* IDs class and * Context class) 07/01 Update SLTechnology News&Howtos

Hadoop source code analysis (* IDs class and * Context class)

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Let's start by analyzing the internal operation mechanism of Hadoop MapReduce. The user submits a Job (job) to Hadoop, and the job is executed under the control of the JobTracker object. The Job is broken down into Task (tasks), distributed to the cluster, and run under the control of TaskTracker. Task, which includes MapTask and ReduceTask, is where MapReduce's Map and Reduce operations are performed. The method of task distribution is similar to the division of labor between NameNode and DataNode in HDFS. NameNode corresponds to JobTracker,DataNode and TaskTracker. The clients of JobTracker,TaskTracker and MapReduce communicate through RPC. For more information, please refer to the analysis of HDFS section.

Let's first analyze some auxiliary classes, first of all, the classes related to ID. The inheritance tree of ID is as follows:

This figure shows some of the problems caused by the current migration of Hadoop's org.apache.hadoop.mapred to org.apache.hadoop.mapreduce, in which the gray is marked @ Deprecated. ID carries an integer and implements the WritableComparable interface, which indicates that it is comparable and can be serialized / deserialized by Hadoop's io mechanism (the compareTo/readFields/write method must be implemented). JobID is the unique identifier assigned to the job by the system, and its toString result is job__. Example: job_200707121733_0003 indicates that this is the third assignment of jobtracker200707121733 (using the start time of jobtracker as ID).

The job is divided into tasks, and the task number TaskID contains the job ID to which it belongs, as well as the task ID, while maintaining whether this is a Map task (member variable isMap). The string of the task number is expressed as task___ [m | r] _. For example, task_200707121733_0003_m_000005 represents task 000005 of job 200707121733000003, and the changed task is a Map task.

A task may have more than one execution (error recovery / elimination Stragglers, etc.), so it is necessary to distinguish between multiple execution of the task, which is done through the class TaskAttemptID, which adds a trial number to the task number. An example of a task attempt is attempt_200707121733_0003_m_000005_0, which is attempt 0 of task task_200707121733_0003_m_000005.

JVMId is used to manage Java virtual machines during task execution, which we will discuss later.

To make Job and Task work, Hadoop provides a series of contexts that hold information about the work of Job and Task.

At the top of the inheritance tree is org.apache.hadoop.mapreduce.JobContext, which we have introduced earlier, which provides some read-only properties of Job, two member variables, one holds JobID, and the other type JobConf,JobContext holds all the information except JobID in JobConf. It defines the following configuration items:

Implementation of l mapreduce.inputformat.class:InputFormat

Implementation of l mapreduce.map.class:Mapper

L mapreduce.combine.class: implementation of Reducer

Implementation of l mapreduce.reduce.class:Reducer

L mapreduce.outputformat.class: implementation of OutputFormat

L mapreduce.partitioner.class: implementation of Partitioner

At the same time, it provides methods to obtain the corresponding Class of the class by using the Class.forName method provided by Java reflection through the class name. The JobContext object of org.apache.hadoop.mapred has more member variable progress than org.apache.hadoop.mapreduce.JobContext, which is used to obtain progress information. Its type is JobConf member job points to the corresponding member of mapreduce.JobContext, and no new features have been added.

JobConf inherits from Configuration and maintains some configuration information needed for MapReduce execution. It manages 46 configuration parameters, including the older version of the above mapreduce configuration item, such as mapreduce.map.class corresponding to mapred.mapper.class. We will introduce these configuration items when we use them.

Job, a subclass of org.apache.hadoop.mapreduce.JobContext, was also introduced earlier, and we'll come back to it later when we discuss the dynamic behavior of the system.

TaskAttemptContext is used for task execution. It introduces TaskAttemptID that identifies task execution and task status status, and provides a new access interface. Org.apache.hadoop.mapred 's TaskAttemptContext inherits from the corresponding version of mapreduce, but adds progress to record progress.

For more exciting content, please follow: http://bbs.superwu.cn

Follow the Superman College Wechat QR code:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.