How to configure, test and debug Map-Reduce in Hadoop 07/12 Update SLTechnology News&Howtos

How to configure, test and debug Map-Reduce in Hadoop

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to configure, test and debug Map-Reduce in Hadoop. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

Configuration, testing and debugging of Map-Reduce

Evironvemnt:

Cdh6.1

Configuration profile location

Use cdh6.1, which is located in

/ etc/hadoop/conf, actually, there are several directories under / etc/hadoop, such as conf.dist,conf.pseudo,conf.impala

File list

Hadoop-env.sh, which can control global environment variables

Core-site.xml, the most important is the parameter fs.defaultFS

1.value = File:\ home\, this is stand-alone mode (single-node standalone), hadoop daemon runs in a jvm process. It is mainly convenient for debugging.

2.value = hdfs://localhost:8020, which is pseudo-distributed (Pseudo-distributes), where each daemon runs on a separate jvm process, but still on one host. Mainly used for learning, testing, debugging and so on.

3.value = hdfs://host:8020, cluster mode.

Hdfs-site.xml, the most important is the parameter dfs.replication

Except that the cluster mode is 3, it is generally set to 1.

Dfs.namenode.replication.min = 1, the bottom line for block replication

Mapred-site.xml, the most important is the parameter mapred.job.tracker

That is, jobtracker runs on that machine.

Yarn-site.xml, mainly used to configure resourcemanager.

Hadoop-metrics.properties, if Ambari is configured, you need to configure this file in order to send monitoring metrics to the Ambari server.

Log4j.properties

If there are multiple configuration files loaded, in general, the later loaded configuration overrides the same early loaded configuration file. To prevent unwanted overwrites, the configuration file has the keyword final, which prevents subsequent overwrites.

For the configuration of conf and jvm, we can write certain configurations to jvm properties, and if we do so, it is the highest priority, higher than conf.

Hadoop jar-Ddfs.replication=1

Map-Reduce Sample

First of all, the main program, MyWordCount inherits from Tool and Configured, Configured is mainly used to help Tool implement Configurable.

Interface Tool extends Configurable

Configured extends Configurable

ToolRunner is usually called to run the program, and GenericOptionsParser is called inside ToolRunner, so your program can add parameters.

The difference between here and hadoop1 is that org.apache.hadoop.mapreduce, I remember 1. 0, seems to be mapred.

/ * write by jinbao * / package com.jinbao.hadoop.mapred.unittest;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer Import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner / * * @ author cloudera * * / public class MyWordCount extends Configured implements Tool {/ * * @ param args * @ throws Exception * / public static void main (String [] args) throws Exception {try {ToolRunner.run (new MyWordCount (), args) } catch (Exception e) {e.printStackTrace ();} @ Override public int run (String [] args) throws Exception {if (args.length! = 2) {System.err.printf ("usage:% s, [generic options]\ n", getClass () .getSimpleName ()) ToolRunner.printGenericCommandUsage (System.err); return-1;} Configuration conf = new Configuration (); Job job = Job.getInstance (conf, "word counting"); job.setJarByClass (MyWordCount.class) Job.setMapperClass (TokenizerMapper.class); job.setReducerClass (SumReducer.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class); FileInputFormat.addInputPath (job,new Path (args [0])); FileOutputFormat.setOutputPath (job,new Path (args [1])) System.exit (job.waitForCompletion (true)? 0:1); return 0;} / * * @ author cloudera * * / public static class TokenizerMapper extends Mapper {private final static IntWritable one = new IntWritable (1) Private Text word = new Text (); public void map (Object key, Text value, Context context) throws IOException,InterruptedException {StringTokenizer itr = new StringTokenizer (value.toString ()); while (itr.hasMoreTokens ()) {word.set (itr.nextToken ()) Context.write (word, one);} public static class SumReducer extends Reducer {private static IntWritable result = new IntWritable (); public void reduce (Text key, Iterable values,Context context) throws IOException, InterruptedException {int sum = 0 For (IntWritableval:values) {sum + = val.get ();} result.set (sum); context.write (key, result);}

MapReduce Web UI

MRv1: http://jobtracker-host:50030

MRv2: http://resourcemgr-host:8088/cluster

Application details, you can go to job history to see.

Unit Test-MRUnit

This is a toolkit specifically for map-reduce unit testing

Need to download dependency

1. Junit, this eclipse already comes with it, and it is also available under the lib of hadoop.

2. Mockito, it's in the bag below.

3. Powermock, download and connect to here

4. MRUnit, go to apache's house to find here.

Here is my program:

Package com.jinbao.hadoop.mapred.unittest;import java.io.IOException;import java.util.ArrayList;import java.util.Arrays;import java.util.List;import org.apache.hadoop.io.*;import org.junit.Before;import org.junit.Test;import org.junit.runner.RunWith;import org.powermock.core.classloader.annotations.PrepareForTest;import org.apache.hadoop.mrunit.mapreduce.MapDriver;import org.apache.hadoop.mrunit.mapreduce.MapReduceDriver;import org.apache.hadoop.mrunit.mapreduce.ReduceDriver Import org.apache.hadoop.mrunit.types.Pair;public class MyWordCountTest {private MapDriver mapDriver; private ReduceDriver reduceDriver; private MapReduceDriver mapReduceDriver; @ Before public void setUp () {MyWordCount.TokenizerMapper mapper = new MyWordCount.TokenizerMapper (); MyWordCount.SumReducer reducer = new MyWordCount.SumReducer (); mapDriver = MapDriver.newMapDriver (mapper); reduceDriver = ReduceDriver.newReduceDriver (reducer) MapReduceDriver = MapReduceDriver.newMapReduceDriver (mapper, reducer);} @ Test public void testMapper () throws IOException {mapDriver.withInput (new LongWritable (), new Text ("test input from unit test")); ArrayList outputRecords = new ArrayList (); outputRecords.add (new Pair (new Text ("test"), new IntWritable (1))) OutputRecords.add (new Pair (new Text ("input"), new IntWritable (1)); outputRecords.add (new Pair (new Text ("from"), new IntWritable (1))); outputRecords.add (new Pair (new Text ("unit"), new IntWritable (1)); outputRecords.add (new Pair (new Text ("test"), new IntWritable (1))) MapDriver.withAllOutput (outputRecords); mapDriver.runTest ();} @ Test public void testReducer () throws IOException {reduceDriver.withInput (new Text ("input"), new ArrayList (Arrays.asList (new IntWritable (1), new IntWritable (3); reduceDriver.withOutput (new Text ("input"), new IntWritable (4)) ReduceDriver.runTest ();} @ Test public void testMapperReducer () throws IOException {mapReduceDriver.withInput (new LongWritable (), new Text ("test input input input input input test")); ArrayList outputRecords = new ArrayList () OutputRecords.add (new Pair (new Text ("input"), new IntWritable (5)); outputRecords.add (new Pair (new Text ("test"), new IntWritable (2)); mapReduceDriver.withAllOutput (outputRecords); mapReduceDriver.runTest ();}} Run MRUnit

More than 90% of the problems can be solved by directly running the @ Test method in the above figure, otherwise your UnitTest coverage is too low, then if you have a problem with cluster later, the debug cost will be relatively high.

Run Locally

Configure Debug Configuration in Eclipse:

/ home/cloudera/workspace/in / home/cloudera/workspace/out

Note: job runner runs in local directories. By default, using toolrunner starts a jvm of standalone to run hadoop. In addition, there can only be 0 or 1 reduce. This is not a problem, as long as it is very convenient for debugging.

The default in YARN is that mapreduce.framework.name must be set to local, but this is the default, so you don't need to worry about it.

Run in Cluster

Export jar, I use eclipse to do, use ant, command line, etc., depending on your preference.

If your jar package has dependencies, put dependency packages everywhere in a lib and configure which main class is in the minifest. There is no difference between this package and war package.

% hadoop fs-copyFromLocal / home/cloudera/word.txt data/in

% hadoop jar wordcount.jar data/in data/out

IsolationRunner and Remote Debugger premise: keep.failed.task.files. This option defaults to false, which means that for failed task, the temporary data and directories running will not be saved. This is a per job configuration, add this option when running job. How to rerun: when the task environment for fail is available, you can rerun a separate task. The way to rerun is: 1. On the tasktracker machine where task went wrong 2. Find the directory environment 1. 1 of fail's task runtime on this tasktracker. In tasktracker, there is a separate execution environment for each task, including its work directory, its corresponding intermediate files, and configuration files that it needs to run. These directories are determined by the configuration of tasktracker, and the configuration options are: mapred.local.dir. This option may be a comma-separated path list, and each list is the root directory where tasktracker establishes the working directory for the task executed on it. For example, if mapred.local.dir=/disk1 / mapred/local,/disk2/mapred/local, then the task execution environment is mapred.local.dir / taskTracker/jobcache/job-ID/task-attempt-ID3. Once you find the execution working directory of the task, you can go to that directory, and then there will be the running environment of the task, which usually includes a work directory, a job.xml file, and a data file for task to operate on (split.dta for map and file.out for reduce). 4. Once you find the environment, you can rerun the task. 1. Cd work2. Hadoop org.apache.hadoop.mapred.IsolationRunner.. / job.xml ◦ so that IsolationRunner reads the configuration of job.xml (here job.xml is equivalent to the interface between the hadoop-site.xml configuration file submitted to the client and the command line-D configuration), and then rerun the map or reduce. 1. So far, task has been implemented as a stand-alone rerun, but the single-step breakpoint debug has not been resolved. What is actually taken advantage of here is the remote debug function of jvm. The ways are as follows: 1. Before rerunning task, export has an environment variable: export HADOOP_OPTS= "- agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8888" 2. In this way, the hadoop instruction will send the debug message through port 8888. Then in your own local development environment IDE (such as eclipse), launch a remote debugging, and make a breakpoint in the code, you can run on the tasktracker independent map or reduce task remote step debugging. For details, you can go to this blog. Http://blog.csdn.net/cwyspy/article/details/10004995

Note: unfortunately, in the recent version, IsolationRunner can no longer be used, so in hadoop2, you need to find the failed node, copy the problem file and debug it on a stand-alone machine.

Merge result set

Depending on the number of Reduce, there can be multiple part result sets, so you can use the following command to merge

% hadoop fs-getmerge max-temp max-temp-local

% sort max-temp-local | tail

Tuning a Job

Number of mappers

Number of reducers

Combiners

Intermediate compression

Custom serialization

Shuffle tweaks

MapReduce Workflows

In other words, as a rule of thumb, think about adding more jobs, rather than adding complexity to jobs.

ChainMapper and ChainReducer

It's a Map*/Reduce model, which means multiple mappers work as a chain, and after last mapper, output will go to reducer. This sounds reduced network IO.

Though called 'ChainReducer', actually only a Reducer working for ChainMapper, so gets the name.

Mapper1- > Mapper2- > MapperN- > Reducer

JobControl

MR has a class JobControl, but as I test it's really not maintained well.

Simply to use:

If (Run (job1))

Run (job2)

Apache Oozie

Oozie is a Java Web application that runs in a Java servlet container, Tomcat--, and uses a database to store the following:

Workflow definition

Currently running workflow instance, including instance status and variables

An Oozie workflow is a set of actions (for example, Hadoop's Map/Reduce job, Pig job, and so on) placed in the control dependency DAG (directed acyclic graph Direct Acyclic Graph), which specifies the order in which the actions are executed. We will use hPDL, a XML process definition language, to describe this diagram.

HPDL is a concise language that uses only a small number of process control and action nodes. The control node defines the process to be executed and includes the start and end points of the workflow (start, end, and fail nodes) and the mechanisms that control the workflow execution path (decision, fork, and join nodes). Action nodes are mechanisms through which workflows trigger the execution of calculations or processing tasks. Oozie provides support for the following types of actions: Hadoop map-reduce, Hadoop file system, Pig, Java, and Oozie subworkflows (SSH actions have been removed from versions later than Oozie schema 0.2)

This is the end of the article on "how to configure, test and debug Map-Reduce in Hadoop". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.