How to use Mapreduce Program to complete wordcount 12/22 Update SLTechnology News&Howtos

How to use Mapreduce Program to complete wordcount

2025-12-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

MapReduce Overview:

MapReduce adopts the idea of divide and conquer, distributes the operation of large-scale data sets to each sub-node under the management of a master node to complete, and then integrates the intermediate results of each node to get the final result. To put it simply, MapReduce is "the decomposition of tasks and the summary of results".

In Hadoop, there are two machine roles used to perform MapReduce tasks: one is JobTracker; and the other is TaskTracker. JobTracker is used to schedule work, and TaskTracker is used to perform work. There is only one JobTracker in a Hadoop cluster.

In distributed computing, MapReduce framework is responsible for dealing with complex problems in parallel programming, such as distributed storage, job scheduling, load balancing, fault-tolerant balancing, fault-tolerant processing and network communication. The processing process is highly abstracted into two functions: map and reduce,map are responsible for decomposing tasks into multiple tasks, and reduce is responsible for summarizing the results of multitasking after decomposition.

It should be noted that the dataset (or task) processed with MapReduce must have the characteristics that the dataset to be processed can be decomposed into many small datasets, and each small dataset can be processed in full parallel.

The test text data used by the program:

Dear RiverDear River Bear Spark CarDear Car Bear CarDear Car River Car Spark Spark Dear Spark 1 write main classes (1) Maper classes

The first is the custom Maper class code

Public class WordCountMap extends Mapper {public void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {/ / fields: data representing a line of text: dear bear river String [] words = value.toString (). Split ("\ t") For (String word: words) {/ / each word appears once, and outputs context.write (new Text (word), new IntWritable (1)) as an intermediate result;}

The Map class is a generic type that has four formal parameter types that specify the type of input key, input value, output key, and output value of the map () function. LongWritable: input key type, Text: input value type, Text: output key type, IntWritable: output value type.

String [] words = value.toString () .split ("\ t");, the value of words is Dear River Bear River

input key key is a long integer offset used to find the first row of data and the next row of data, the input value is a line of text Dear River Bear River, the output key is the word Bear, the output value is the integer 1.

Hadoop itself provides a set of basic types that optimize network serialization transport, rather than directly using Java embedded types. These types are all in the org.apache.hadoop.io package. The LongWritable type (equivalent to the Long type of Java), the Text type (equivalent to the String type in Java), and the IntWritable type (equivalent to the Integer type of Java) are used.

The parameters of the map () method are the input key and the input value. Take this procedure as an example, the input key LongWritable key is an offset, the input value Text value is Dear Car Bear Car, we first convert the text value containing one line of input to the String type of Java, and then use the substring () method to extract the columns we are interested in. The map () method also provides a Context instance for writing the output content.

(2) Reducer class public class WordCountReduce extends Reducer {/ * (River, 1) (Spark, 1) (Spark, 1) key: River value: List (1prim 1,1) key: Spark value: List (1m 1,1) * / public void reduce (Text key) Iterable values, Context context) throws IOException, InterruptedException {int sum = 0 For (IntWritable count: values) {sum + = count.get ();} context.write (key, new IntWritable (sum)); / / output final result};}

The Reduce task initially grabs data from the map according to the partition number as follows:

(River, 1)

(spark, 1)

(Spark, 1)

After processing, the results are as follows:

Key: hello value: List (1,1,1)

Key: spark value: List (1, 1, 1)

So the values received by the formal parameter Iterable values of the reduce () function are List (1dint 1,1) and List (1mem1,1je 1).

(3) Main function import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException Public class WordCountMain {/ / if you execute the MR program locally in IDEA, you need to modify the mapreduce.framework.name value in mapred-site.xml to local public static void main (String [] args) throws IOException, ClassNotFoundException, InterruptedException {if (args.length! = 2 | | args = = null) {System.out.println ("please input Path!"); System.exit (0) } / / System.setProperty ("HADOOP_USER_NAME", "hadoop2.7"); Configuration configuration = new Configuration (); / / configuration.set ("mapreduce.job.jar", "/ home/bruce/project/kkbhdp01/target/com.kaikeba.hadoop-1.0-SNAPSHOT.jar"); / / call the getInstance method to generate job instance Job job = Job.getInstance (configuration, WordCountMain.class.getSimpleName ()) / / jar package job.setJarByClass (WordCountMain.class); / / set input / output format through job / / the default input format of / MR is TextInputFormat, so the next two lines can be commented out / / job.setInputFormatClass (TextInputFormat.class); / / job.setOutputFormatClass (TextOutputFormat.class) / / set the input / output path FileInputFormat.setInputPaths (job, new Path (args [0])); FileOutputFormat.setOutputPath (job, new Path (args [1])); / / set the class job.setMapperClass (WordCountMap.class) to handle the Map/Reduce phase; / / map combine reduce network outbound job.setCombinerClass (WordCountReduce.class); job.setReducerClass (WordCountReduce.class) / / if the output kv pairs of map and reduce are of the same type, just set the kv pair of reduce output directly If it is different, you need to set the kv type of map and reduce output / / job.setMapOutputKeyClass (.class) / / job.setMapOutputKeyClass (Text.class); / / job.setMapOutputValueClass (IntWritable.class); / / set the type of reduce task final output key/value job.setOutputKeyClass (Text.class); job.setOutputValueClass (IntWritable.class) / / submit job job.waitForCompletion (true);}} 2 run locally

First change the mapred-site.xml file configuration

Set the value of mapreduce.framework.name to local

Then run locally:

View the results:

(3) Cluster operation mode 1:

Pack first.

Change the configuration file to yarn mode

Add a local jar package location:

Configuration configuration = new Configuration (); configuration.set ("mapreduce.job.jar", "C:\\ Users\ tanglei1\\ IdeaProjects\\ Hadooptang\\ target")

Set to allow cross-platform remote calls:

Configuration.set ("mapreduce.app-submission.cross-platform", "true")

Modify the input parameters:

Running result:

Method 2:

Package the maven project and run the mr program with the command on the server side

Hadoop jarcom.kaikeba.hadoop-1.0-SNAPSHOT.jarcom.kaikeba.hadoop.wordcount.WordCountMain / tttt.txt / wordcount11

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.