Analysis of relevant knowledge points submitted by Hadoop Job 07/19 Update SLTechnology News&Howtos

Analysis of relevant knowledge points submitted by Hadoop Job

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "Hadoop Job submission related knowledge point analysis", the article explains the content is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "Hadoop Job submission related knowledge point analysis" bar!

The Configuration class is used to access the configuration parameters of hadoop.

The Configuration class first loads the configuration files core-default.xml and core-site.xml of hadoop through a static code snippet, with the following code:

Static {/ / print deprecation warning if hadoop-site.xml is found in classpath ClassLoader cL = Thread.currentThread (). GetContextClassLoader (); if (cL = = null) {cL = Configuration.class.getClassLoader ();} if (cL.getResource ("hadoop-site.xml")! = null) {LOG.warn ("DEPRECATED: hadoop-site.xml found in the classpath. "+" Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "+" mapred-site.xml and hdfs-site.xml to override properties of "+" core-default.xml, mapred-default.xml and hdfs-default.xml "+" respectively ");} addDefaultResource (" core-default.xml "); addDefaultResource (" core-site.xml ");}

DefaultResources is an ArrayList that is used to save the default profile path. If a default profile path is not in defaultResource, add it. This logic is shown in the

Implemented in the addDefaultResource method.

Properties is a Properties object that holds the configuration properties parsed from the configuration file, and if multiple configuration files have the same key, the latter overrides the value of the former.

The JobConf class is used to configure Map/Reduce job information, inherited from the Configuration class.

The JobConf class first loads the mapred-default.xml and mapred-site.xml configuration properties files through static code snippets.

DEFAULT_MAPRED_TASK_JAVA_OPTS= "- Xmx200m", by default, the maximum JAVA virtual machine memory specified by the JAVA command line option of the JAVA task is 200m.

JobClient class is the main interface for users to interact with JobTracker, through which you can submit jobs, track the progress of job, access the logs of task components, query the status information of the cluster, and so on.

The job submission is implemented through the runJob method, and the related code is as follows:

Public static RunningJob runJob (JobConf job) throws IOException {JobClient jc = new JobClient (job); RunningJob rj = jc.submitJob (job); try {if (! jc.monitorAndPrintJob (job, rj)) {LOG.info ("Job Failed:" + rj.getFailureInfo ()); throw new IOException ("Job failed!");} catch (InterruptedException ie) {Thread.currentThread (). Interrupt ();} return rj;}

First create a JobClient object that connects to the JobTracker based on the JobConf object in the constructor.

JobClient communicates with JobTracker through jobSubmitClient. JobSubmitClient is a dynamic proxy class of type JobSubmissionProtocol and is generated by the following methods:

Private static JobSubmissionProtocol createRPCProxy (InetSocketAddress addr, Configuration conf) throws IOException {return (JobSubmissionProtocol) RPC.getProxy (JobSubmissionProtocol.class, JobSubmissionProtocol.versionID, addr, UserGroupInformation.getCurrentUser (), conf, NetUtils.getSocketFactory (conf, JobSubmissionProtocol.class);}

The key of the getProxy method is the Invoker class. The Invoker class implements the InvocationHandler interface, which mainly has two member variables. RemoteId is a Client.ConnectionId type, which stores the connection address and the user's ticket, and the client connection server is uniquely identified. We can also see some configuration property values from here. The default rpcTimeout is 0.

The maximum idle time for ipc.client.connection.maxidletime client connections is 10s

The maximum number of retries for an ipc.client.connect.max.retries client to establish a connection with the server is 10

Whether ipc.client.tcpnodelay enables the Nagle algorithm (congestion control for TCP/IP). If enabled, the latency will be reduced, but small datagrams will be increased. The default is false. Client is a Client class that is used for IPC communication. Client will be cached through the ClientCache class. If it is not in the cache, a new Client will be created, otherwise the original client count will be increased by 1. The main method of the Invoker class is the invoke method, and the function of the invoke method is to call the method of client and return the result. The object of a dynamic proxy class proxy is a Client object.

The submitJobInternal method is really used to submit the job. The specific steps are as follows:

1. Initialize the staging directory. The root directory of the staging directory is configured by mapreduce.jobtracker.staging.root.dir. The default is / tmp/hadoop/mapred/staging. For a user, the staging directory is $ROOT/userName/.staging.

2. Getting a new job id,job id from JobTracker is incremented from 1.

3. Get the staging directory / jobid of the submitJobDir= user who submitted the job, and set this directory to the value of mapreduce.job.dir.

4. To copy and initialize the copyAndConfigureFiles file, first get the replication value from the configuration property mapred.submit.replication. The default is 10. Then determine whether the submitJobDir directory exists, if there is an exception; otherwise create the submitJobDir directory; get the distributed cache file path of job = submitJobDir/files; get the distributed cache archive path of job = submitJobDir/archives; get the distributed cache libjars path of job = submitJobDir/libjars; if the command line parameter has tmpfiles, copy these files to the distributed cache file path and add this path to the distributed cache If the command line parameter has tmpjars, copy these files to the distributed cache libjars path and add the path to the distributed cache; if the command line parameter has tmparchives, copy the files to the distributed cache archive path and add the path to the distributed cache Get the path of the jar package according to the mapred.jar attribute. If you do not specify the name of the job, the name of the jar package will be used as the job name; get the storage path of the job jar = submitJobDir/job.jar; copy the user-specified jar package to the storage path of the job jar; set the working directory, which defaults to the value specified by the configuration attribute mapred.working.dir.

5. Get the path submitJobFile=submitJobDir/job.xml; setting of the job configuration file

Mapreduce.job.submithostaddress is the native ip address, set

Mapreduce.job.submithost is the native hostname.

6. Create an input partition for job, which is done by the writeSplits method. Take old api as an example, first call the getSplits method of InputFormat to get an array of InputSplit partitions. The getSplits method of the FileInputFormat class is implemented as follows:

Get the input file path list through the listStatus method, filter out _ and. The path at the beginning and the filter based on the set mapred.input.pathFilter.class

Set mapreduce.input.num.files to the number of input files in JobConf

Calculate the total totalSize of all input files, the target partition size goalSize=totalSize/numSplits (configured by mapred.map.tasks, default is 1), the minimum partition size minSize=mapred.min.split.size configuration and the larger value between 1. For each input file, if the file length is not equal to 0 and is divisible, calculate the partition size splitSize=Math.max (minSize,Math.min (goalSize,blockSize)), and blockSize is the block size of the HDFS storage file. For each partition size, calculate the array of hosts that contribute the most to it (based on the byte size of the rack and block), and then add the partition to the partition list Then sort the partition list according to the large partition length from large to small; then write the partition list to the partition file, partition file name = submitJobDir/job.split, partition file storage format: SPL byte information, partition version number, {InputSplit class name, InputSplit class information} +; SplitMetaInfo array records the offset of each partition information in the file, host information and length; write the partition Meta information SplitMetaInfo array to the file submitJobDir/job.splitmetainfo.

7. JobConf sets mapred.map.tasks to the number of partitions.

8. Get the name of the queue submitted by job according to mapred.job.queue.name, which is default by default, and then get the access control list based on the queue name.

9. Write the reconfigured JobConf to the submitJobDir/job.xml file.

10. Send the jobid,submitJobDir information to JobTracker to formally submit the job, and track the status of the job through the NetworkedJob object.

The monitorAndPrintJob method monitors the operation of the job and prints the status of the job in real time.

Thank you for your reading, the above is the content of "Hadoop Job submitting relevant knowledge points analysis". After the study of this article, I believe you have a deeper understanding of the problem of Hadoop Job submitting relevant knowledge points analysis, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.