How to build the basic template of MapReduce Program 07/08 Update SLTechnology News&Howtos

How to build the basic template of MapReduce Program

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article focuses on "how to build the basic template of MapReduce programs", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn how to build the basic template of MapReduce programs.

What is a development dataset?

A popular development strategy is to create a small, sampled subset of data for the big data set in the production environment, called the development dataset. This development dataset may be only a few hundred megabytes. When you write programs to deal with them in stand-alone or pseudo-distributed mode, you will find that the development cycle is very short, it is convenient to run programs on your own machine, and you can debug them in a separate environment.

Why did you choose patent citation data for testing?

1. Because they are similar to most data types that you will encounter in the future

2. The diagrams made up of patent citation data are more or less the same as web links and social network diagrams.

3. Patents are issued in time order, and some characteristics are similar to time series.

4. Each patent is associated with a person (inventor) and a location (inventor's country), which you can regard as personal information or geographic data

5. You can think of this data as a normal database relationship with a clear schema, and the format is simply separated by commas

Data set adopts standard

The dataset uses the standard comma-separated values (comma-separated values, CSV) format.

Build the basic template for MapReduce programs

Most MapReduce programs can simply rely on a template and its variants. When writing a new MapReduce program, we usually take an existing MapReduce program and modify it to what we want it to be.

A template for a typical Hadoop program

Public class MyJob extends Configured implements Tool {

Public static class MapClass extends MapReduceBase

Implements Mapper {

Public void map (Text key, Text value

OutputCollector output

Reporter reporter) throws IOException {

Output.collect (value, key)

}

Public static class Reduce extends MapReduceBase

Implements Reducer {

Public void reduce (Text key, Iterator values

OutputCollector output

Reporter reporter) throws IOException {

String csv = ""

While (values.hasNext ()) {

If (csv.length () > 0) csv + = ","

Csv + = values.next () .toString ()

}

Output.collect (key, new Text (csv))

}

Public int run (String [] args) throws Exception {

Configuration conf = getConf ()

JobConf job = new JobConf (conf, MyJob.class)

Path in = new Path (args [0])

Path out = new Path (args [1])

FileInputFormat.setInputPaths (job, in)

FileOutputFormat.setOutputPath (job, out)

Job.setJobName ("MyJob")

Job.setMapperClass (MapClass.class)

Job.setReducerClass (Reduce.class)

Job.setInputFormat (KeyValueTextInputFormat.class)

Job.setOutputFormat (TextOutputFormat.class)

Job.setOutputKeyClass (Text.class)

Job.setOutputValueClass (Text.class)

Job.set ("key.value.separator.in.input.line", ",")

JobClient.runJob (job)

Return 0

}

Public static void main (String [] args) throws Exception {

Int res = ToolRunner.run (new Configuration (), new MyJob (), args)

System.exit (res)

}

1. We are used to using a single class to define each MapReduce job completely, which is called the MyJob class.

2. Hadoop requires that Mapper and Reducer must be their own static classes, these classes are very small, and the template includes them in the MyJob class as inner classes. the advantage of this is that you can put everything in one file and simplify code management.

3. But keep in mind that these inner classes are independent and usually do not interact with the MyJob class

4. During job execution, various nodes with different JVM replicate and run Mapper and Reducer, while other job classes are executed only on the client

Explain the run () method

1. The core of the framework is in the run () method, also known as driver

2. It instantiates, configures, and passes a job named by the JobConf object to JobClient.runJob () to start the MapReduce job (in turn, the JobClient class communicates with JobTracker to start the job on the cluster)

3. The JobConf object will keep all the configuration parameters needed for the job to run.

4. Driver needs to customize the basic parameters for each job in the job, including input path, output path, Mapper class and Reducer class.

5. Each job can reset the default job properties, such as InputFormat, OutputFormat, etc., or you can call the set () method in the JobConf object to fill any configuration parameters

6. Once the JobConf object is passed to JobClient.runJob (), it is regarded as the blueprint for determining how the job runs

Some instructions on the configuration of driver

1. The JobConf object has many parameters, but we don't want all the parameters to be set by writing driver. We can use the configuration file of Hadoop installation as a good starting point.

2. Users may want to change the job configuration by passing additional parameters when the command line starts a job

3. Driver can support users to modify some of these configurations by customizing a set of commands and processing user parameters.

4. Because such tasks are often needed, the Hadoop framework provides ToolRunner, Tool, and Configured to simplify its implementation.

5. When they are used simultaneously in the MyJob framework above, these classes enable the job to understand the options provided by the user that are supported by GenericOptionParser

For example, the following command:

Bin/hadoop jar playgroup/MyJob.jar MyJob input/cite75-99.txt output

If we run the job just to see the output of mapper (for debugging purposes), we can set the number of reducer to 0 with the option-D mapred.reduce.tasks=0

Bin/hadoop jar playgroup/MyJob.jar MyJob-D mapred.reduce.tasks=0 input/cite75-99.txt output

The following options can be automatically supported by using ToolRunner and MyJob

Options supported by GenericOptionsParser

Option

Description

-conf

Specify a profile

-D

Assign a value to the JobConf attribute

-fs

Specify a NameNode, which can be "local"

-jt

Specify a JobTracker

-files

Specifies a comma-separated list of files for MapReduce jobs. These files are automatically distributed to all nodes so that they can be obtained locally

-libjars

Specify a comma-separated jar file to be included in the classpath of all task JVM

-archives

Specify a comma-separated list of archive files so that it can be opened on all task nodes

Template code Mappper and Reducer

In the template, it is customary to call the Mapper class MapClass and the Reducer class Reduce

Mapper and Reducer are both extensions of MapReduceBase

MapReduceBase is a small class that contains configure () and close (). We use the previous two methods to create and clear map (reduce) tasks, except for more advanced jobs, which we usually don't need to override.

Mapper class and Reducer class template description

Public static class MapClass extends MapReduceBase

Implements Mapper {

Public void map (K1 key, V1 value

OutputCollector output

Reporter reporter) throws IOException {}

}

Public static class Reduce extends MapReduceBase

Implements Reducer {

Public void reduce (K2 key, Iterator values

OutputCollector output

Reporter reporter) throws IOException {}

}

The core operation of the Mapper class is the map () method, and the reduce class is the reduce () method. Each call to the map () method is assigned a key / value pair of types K1 and V1, respectively. This key / value pair is generated by mapper and output through the collect () method of the OutputCollector object. You need to call at the appropriate location in the map () method:

Output.collect ((K2) k, (V2) v)

Each call to the reduce () method in Reducer is assigned a key of type K2 and a set of values of type V2. Note that it must be the same as the K2 and V2 types used in Mapper. The Reduce () method may loop through all values of type V2.

While (values.hasNext ()) {

V2 v = values.next ()

}

The Reduce () method also uses OutputCollector to collect the output of its keys / values, which are of type K3/V3. In the reudce () method, you can call

Output.collect ((K3) k, (V3) v)

In addition to keeping the type of K2 and V3 consistent between Mapper and Reducer, you also need to ensure that the key types used in Mapper and Reducer are consistent with the input format set in driver, the class of the output key, and the class of the output value

Using KeyValueTextInputFormat means that both K1 and V1 must be of Text type

Driver must call setOutputKeyClass () and setOutputValueClass () to specify classes for K2 and V2, respectively.

Finally:

1. All key and value types must be subtypes of Writable to ensure that Hadoop's serialization interface can send data on a distributed cluster.

2. The type of key implements WritableComparable, which is a subinterface of Writable. The type of key also needs to support the compareTo () method, because keys are used to sort in the MapReduce framework.

At this point, I believe you have a deeper understanding of "how to build the basic template of MapReduce programs". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.