In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article will explain in detail how to configure, test and debug Map-Reduce in Hadoop. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.
Configuration, testing and debugging of Map-Reduce
Evironvemnt:
Cdh6.1
Configuration profile location
Use cdh6.1, which is located in
/ etc/hadoop/conf, actually, there are several directories under / etc/hadoop, such as conf.dist,conf.pseudo,conf.impala
File list
Hadoop-env.sh, which can control global environment variables
Core-site.xml, the most important is the parameter fs.defaultFS
1.value = File:\ home\, this is stand-alone mode (single-node standalone), hadoop daemon runs in a jvm process. It is mainly convenient for debugging.
2.value = hdfs://localhost:8020, which is pseudo-distributed (Pseudo-distributes), where each daemon runs on a separate jvm process, but still on one host. Mainly used for learning, testing, debugging and so on.
3.value = hdfs://host:8020, cluster mode.
Hdfs-site.xml, the most important is the parameter dfs.replication
Except that the cluster mode is 3, it is generally set to 1.
Dfs.namenode.replication.min = 1, the bottom line for block replication
Mapred-site.xml, the most important is the parameter mapred.job.tracker
That is, jobtracker runs on that machine.
Yarn-site.xml, mainly used to configure resourcemanager.
Hadoop-metrics.properties, if Ambari is configured, you need to configure this file in order to send monitoring metrics to the Ambari server.
Log4j.properties
If there are multiple configuration files loaded, in general, the later loaded configuration overrides the same early loaded configuration file. To prevent unwanted overwrites, the configuration file has the keyword final, which prevents subsequent overwrites.
For the configuration of conf and jvm, we can write certain configurations to jvm properties, and if we do so, it is the highest priority, higher than conf.
Hadoop jar-Ddfs.replication=1
Map-Reduce Sample
First of all, the main program, MyWordCount inherits from Tool and Configured, Configured is mainly used to help Tool implement Configurable.
Interface Tool extends Configurable
Configured extends Configurable
ToolRunner is usually called to run the program, and GenericOptionsParser is called inside ToolRunner, so your program can add parameters.
The difference between here and hadoop1 is that org.apache.hadoop.mapreduce, I remember 1. 0, seems to be mapred.
/ * write by jinbao * / package com.jinbao.hadoop.mapred.unittest;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer Import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner / * * @ author cloudera * * / public class MyWordCount extends Configured implements Tool {/ * * @ param args * @ throws Exception * / public static void main (String [] args) throws Exception {try {ToolRunner.run (new MyWordCount (), args) } catch (Exception e) {e.printStackTrace ();} @ Override public int run (String [] args) throws Exception {if (args.length! = 2) {System.err.printf ("usage:% s, [generic options]\ n", getClass () .getSimpleName ()) ToolRunner.printGenericCommandUsage (System.err); return-1;} Configuration conf = new Configuration (); Job job = Job.getInstance (conf, "word counting"); job.setJarByClass (MyWordCount.class) Job.setMapperClass (TokenizerMapper.class); job.setReducerClass (SumReducer.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class); FileInputFormat.addInputPath (job,new Path (args [0])); FileOutputFormat.setOutputPath (job,new Path (args [1])) System.exit (job.waitForCompletion (true)? 0:1); return 0;} / * * @ author cloudera * * / public static class TokenizerMapper extends Mapper {private final static IntWritable one = new IntWritable (1) Private Text word = new Text (); public void map (Object key, Text value, Context context) throws IOException,InterruptedException {StringTokenizer itr = new StringTokenizer (value.toString ()); while (itr.hasMoreTokens ()) {word.set (itr.nextToken ()) Context.write (word, one);} public static class SumReducer extends Reducer {private static IntWritable result = new IntWritable (); public void reduce (Text key, Iterable values,Context context) throws IOException, InterruptedException {int sum = 0 For (IntWritableval:values) {sum + = val.get ();} result.set (sum); context.write (key, result);}
MapReduce Web UI
MRv1: http://jobtracker-host:50030
MRv2: http://resourcemgr-host:8088/cluster
Application details, you can go to job history to see.
Unit Test-MRUnit
This is a toolkit specifically for map-reduce unit testing
Need to download dependency
1. Junit, this eclipse already comes with it, and it is also available under the lib of hadoop.
2. Mockito, it's in the bag below.
3. Powermock, download and connect to here
4. MRUnit, go to apache's house to find here.
Here is my program:
Package com.jinbao.hadoop.mapred.unittest;import java.io.IOException;import java.util.ArrayList;import java.util.Arrays;import java.util.List;import org.apache.hadoop.io.*;import org.junit.Before;import org.junit.Test;import org.junit.runner.RunWith;import org.powermock.core.classloader.annotations.PrepareForTest;import org.apache.hadoop.mrunit.mapreduce.MapDriver;import org.apache.hadoop.mrunit.mapreduce.MapReduceDriver;import org.apache.hadoop.mrunit.mapreduce.ReduceDriver Import org.apache.hadoop.mrunit.types.Pair;public class MyWordCountTest {private MapDriver mapDriver; private ReduceDriver reduceDriver; private MapReduceDriver mapReduceDriver; @ Before public void setUp () {MyWordCount.TokenizerMapper mapper = new MyWordCount.TokenizerMapper (); MyWordCount.SumReducer reducer = new MyWordCount.SumReducer (); mapDriver = MapDriver.newMapDriver (mapper); reduceDriver = ReduceDriver.newReduceDriver (reducer) MapReduceDriver = MapReduceDriver.newMapReduceDriver (mapper, reducer);} @ Test public void testMapper () throws IOException {mapDriver.withInput (new LongWritable (), new Text ("test input from unit test")); ArrayList outputRecords = new ArrayList (); outputRecords.add (new Pair (new Text ("test"), new IntWritable (1))) OutputRecords.add (new Pair (new Text ("input"), new IntWritable (1)); outputRecords.add (new Pair (new Text ("from"), new IntWritable (1))); outputRecords.add (new Pair (new Text ("unit"), new IntWritable (1)); outputRecords.add (new Pair (new Text ("test"), new IntWritable (1))) MapDriver.withAllOutput (outputRecords); mapDriver.runTest ();} @ Test public void testReducer () throws IOException {reduceDriver.withInput (new Text ("input"), new ArrayList (Arrays.asList (new IntWritable (1), new IntWritable (3); reduceDriver.withOutput (new Text ("input"), new IntWritable (4)) ReduceDriver.runTest ();} @ Test public void testMapperReducer () throws IOException {mapReduceDriver.withInput (new LongWritable (), new Text ("test input input input input input test")); ArrayList outputRecords = new ArrayList () OutputRecords.add (new Pair (new Text ("input"), new IntWritable (5)); outputRecords.add (new Pair (new Text ("test"), new IntWritable (2)); mapReduceDriver.withAllOutput (outputRecords); mapReduceDriver.runTest ();}} Run MRUnit
More than 90% of the problems can be solved by directly running the @ Test method in the above figure, otherwise your UnitTest coverage is too low, then if you have a problem with cluster later, the debug cost will be relatively high.
Run Locally
Configure Debug Configuration in Eclipse:
/ home/cloudera/workspace/in / home/cloudera/workspace/out
Note: job runner runs in local directories. By default, using toolrunner starts a jvm of standalone to run hadoop. In addition, there can only be 0 or 1 reduce. This is not a problem, as long as it is very convenient for debugging.
The default in YARN is that mapreduce.framework.name must be set to local, but this is the default, so you don't need to worry about it.
Run in Cluster
Export jar, I use eclipse to do, use ant, command line, etc., depending on your preference.
If your jar package has dependencies, put dependency packages everywhere in a lib and configure which main class is in the minifest. There is no difference between this package and war package.
% hadoop fs-copyFromLocal / home/cloudera/word.txt data/in
% hadoop jar wordcount.jar data/in data/out
IsolationRunner and Remote Debugger premise: keep.failed.task.files. This option defaults to false, which means that for failed task, the temporary data and directories running will not be saved. This is a per job configuration, add this option when running job. How to rerun: when the task environment for fail is available, you can rerun a separate task. The way to rerun is: 1. On the tasktracker machine where task went wrong 2. Find the directory environment 1. 1 of fail's task runtime on this tasktracker. In tasktracker, there is a separate execution environment for each task, including its work directory, its corresponding intermediate files, and configuration files that it needs to run. These directories are determined by the configuration of tasktracker, and the configuration options are: mapred.local.dir. This option may be a comma-separated path list, and each list is the root directory where tasktracker establishes the working directory for the task executed on it. For example, if mapred.local.dir=/disk1 / mapred/local,/disk2/mapred/local, then the task execution environment is mapred.local.dir / taskTracker/jobcache/job-ID/task-attempt-ID3. Once you find the execution working directory of the task, you can go to that directory, and then there will be the running environment of the task, which usually includes a work directory, a job.xml file, and a data file for task to operate on (split.dta for map and file.out for reduce). 4. Once you find the environment, you can rerun the task. 1. Cd work2. Hadoop org.apache.hadoop.mapred.IsolationRunner.. / job.xml ◦ so that IsolationRunner reads the configuration of job.xml (here job.xml is equivalent to the interface between the hadoop-site.xml configuration file submitted to the client and the command line-D configuration), and then rerun the map or reduce. 1. So far, task has been implemented as a stand-alone rerun, but the single-step breakpoint debug has not been resolved. What is actually taken advantage of here is the remote debug function of jvm. The ways are as follows: 1. Before rerunning task, export has an environment variable: export HADOOP_OPTS= "- agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8888" 2. In this way, the hadoop instruction will send the debug message through port 8888. Then in your own local development environment IDE (such as eclipse), launch a remote debugging, and make a breakpoint in the code, you can run on the tasktracker independent map or reduce task remote step debugging. For details, you can go to this blog. Http://blog.csdn.net/cwyspy/article/details/10004995
Note: unfortunately, in the recent version, IsolationRunner can no longer be used, so in hadoop2, you need to find the failed node, copy the problem file and debug it on a stand-alone machine.
Merge result set
Depending on the number of Reduce, there can be multiple part result sets, so you can use the following command to merge
% hadoop fs-getmerge max-temp max-temp-local
% sort max-temp-local | tail
Tuning a Job
Number of mappers
Number of reducers
Combiners
Intermediate compression
Custom serialization
Shuffle tweaks
MapReduce Workflows
In other words, as a rule of thumb, think about adding more jobs, rather than adding complexity to jobs.
ChainMapper and ChainReducer
It's a Map*/Reduce model, which means multiple mappers work as a chain, and after last mapper, output will go to reducer. This sounds reduced network IO.
Though called 'ChainReducer', actually only a Reducer working for ChainMapper, so gets the name.
Mapper1- > Mapper2- > MapperN- > Reducer
JobControl
MR has a class JobControl, but as I test it's really not maintained well.
Simply to use:
If (Run (job1))
Run (job2)
Apache Oozie
Oozie is a Java Web application that runs in a Java servlet container, Tomcat--, and uses a database to store the following:
Workflow definition
Currently running workflow instance, including instance status and variables
An Oozie workflow is a set of actions (for example, Hadoop's Map/Reduce job, Pig job, and so on) placed in the control dependency DAG (directed acyclic graph Direct Acyclic Graph), which specifies the order in which the actions are executed. We will use hPDL, a XML process definition language, to describe this diagram.
HPDL is a concise language that uses only a small number of process control and action nodes. The control node defines the process to be executed and includes the start and end points of the workflow (start, end, and fail nodes) and the mechanisms that control the workflow execution path (decision, fork, and join nodes). Action nodes are mechanisms through which workflows trigger the execution of calculations or processing tasks. Oozie provides support for the following types of actions: Hadoop map-reduce, Hadoop file system, Pig, Java, and Oozie subworkflows (SSH actions have been removed from versions later than Oozie schema 0.2)
This is the end of the article on "how to configure, test and debug Map-Reduce in Hadoop". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.