How to realize inverted Index based on MR Program in Hadoop programming 11/03 Update SLTechnology News&Howtos

How to realize inverted Index based on MR Program in Hadoop programming

2025-11-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces Hadoop programming based on MR program how to achieve inverted index, the article is very detailed, has a certain reference value, interested friends must read it!

I. data preparation

1. Input file data

Here we prepare three input files, one of which is as follows

A.txt

Hello tom hello jerry hello tom

B.txt

Hello jerry hello jerry tom jerry

C.txt

Hello jerry hello tom

2. Finally output file data

The result of the final output file is:

[plain] view plain copyhello c.txtmuri-> 2a. Txtmuri-> 2a. Txtmuri-> 3 jerry c.txtmuri-> 1 b.txtmuri-> 3 a.txtmuri-> 1 tom c.txtmure-> 1 b.txtmure-> 1 a.txtmure-> 2

2. Analysis of inverted indexing process

According to the input file data and the final output file results, this program needs to be implemented using two MR, and the specific process can be summarized as follows:

-the output format of the first step Mapper is as follows:-context.wirte ("hello- > a.txt", "1") context.wirte ("hello- > b.txt", "1") context.wirte ("hello- > b.txt") "1") context.wirte ("hello- > c.txt", "1") context.wirte ("hello- > c.txt" "1")-the input data format of the first step Reducer is as follows:-- the output data format of the first step Reducer is as follows-context.write ("hello- > a.txt", "3") context.write ("hello- > b.txt") "2") context.write ("hello- > c.txt", "2")-the input data format obtained by the second step Mapper is as follows:-context.write ("hello- > a.txt", "3") context.write ("hello- > b.txt", "2") context.write ("hello- > c.txt") "2")-the data format of the second step Mapper output is as follows:-context.write ("hello", "a.txt-> 3") context.write ("hello", "b.txt-> 2") context.write ("hello") "c.txt-> 2")-the input data format obtained by step 2 Reducer is as follows:-- step 2 Reducer output data format is as follows:-context.write ("hello") "a.txt-> 3b.txt-> 2c.txt-> 2") the final result is: hello a.txt-> 3b.txt-> 2c.txt-> 2.

III. Program development

3.1. the first step MR program and input and output

Package com.lyz.hdfs.mr.ii; import java.io.IOException; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat Import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat / * * the first step of inverted indexing is the Map Reduce program, where the program puts all Map/Reduce/Runner programs in one class * @ author liuyazhuang * / public class InverseIndexStepOne {/ * the mapper program that completes the first step of inverted indexing * @ author liuyazhuang * / public static class StepOneMapper extends Mapper {@ Override protected void map (LongWritable key, Text value, Mapper.Context context) throws IOException InterruptedException {/ / get a row of data String line = value.toString () / / cut out each word String [] fields = StringUtils.split (line, "); / / get the slice information of the data FileSplit fileSplit = (FileSplit) context.getInputSplit (); / / get the file name String fileName = fileSplit.getPath (). GetName () For (String field: fields) {context.write (new Text (field + "-->" + fileName), new LongWritable (1)) } / * the Reducer program that completes the first step of inverted indexing * the final output is: * hello-- > a.txt 3 hello-- > b.txt 2 hello-- > c.txt 2 jerry-- > a.txt 1 jerry-- > b.txt 3 jerry-- > c.txt 1 tom-- > a.txt 2 tom-- > b.txt 1 Tom-- > c.txt 1 * @ author liuyazhuang * * / public static class StepOneReducer extends Reducer {@ Override protected void reduce (Text key) Iterable values, Reducer.Context context) throws IOException, InterruptedException {long counter = 0 For (LongWritable value: values) {counter + = value.get ();} context.write (key, new LongWritable (counter));}} / / run the first MR program public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); Job job = Job.getInstance (conf); job.setJarByClass (InverseIndexStepOne.class); job.setMapperClass (StepOneMapper.class) Job.setReducerClass (StepOneReducer.class); job.setMapOutputKeyClass (Text.class); job.setMapOutputValueClass (LongWritable.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (LongWritable.class); FileInputFormat.addInputPath (job, new Path ("D:/hadoop_data/ii")); FileOutputFormat.setOutputPath (job, new Path ("D:/hadoop_data/ii/result")); job.waitForCompletion (true);}}

3.1.1 input data

A.txt

Hello tom hello jerry hello tom

B.txt

Hello jerry hello jerry tom jerry

C.txt

Hello jerry hello tom

3.1.2

Output result:

Hello-- > a.txt 3 hello-- > b.txt 2 hello-- > c.txt 2 jerry-- > a.txt 1 jerry-- > b.txt 3 jerry-- > c.txt 1 tom-- > a.txt 2 tom-- > b.txt 1 tom-- > c.txt 1

3.2 second step MR program and input / output

Package com.lyz.hdfs.mr.ii; import java.io.IOException; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat Import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat / * inverted index second step Map Reduce program Here the program puts all the Map/Reduce/Runner programs in one class * @ author liuyazhuang * * / public class InverseIndexStepTwo {/ * the mapper program that completes the second step of the inverted index * * the input information from the first step MR program is: * hello-- > a.txt 3 hello-- > b.txt 2 hello-- > c.txt 2 jerry-- > a.txt 1 Jerry-- > b.txt 3 jerry-- > c.txt 1 tom-- > a.txt 2 tom-- > b.txt 1 tom-- > c.txt 1 * @ author liuyazhuang * * / public static class StepTwoMapper extends Mapper {@ Override protected void map (LongWritable key) Text value, Mapper.Context context) throws IOException, InterruptedException {String line = value.toString () String [] fields = StringUtils.split (line, "\ t"); String [] wordAndFileName = StringUtils.split (fields [0], "-->"); String word = wordAndFileName [0]; String fileName = wordAndFileName [1]; long counter = Long.parseLong (fields [1]); context.write (new Text (word), new Text (fileName + "- >" + counter)) }} / * the Reducer program that completes the second step of the inverted index * gets the input information in the format: * * the final output is as follows: * hello c.txtmure-> 2b.txtmure-> 2a. Txtmure-> 3 jerry c.txtmure-> 1 b.txtmure-> 3 a.txtmuri-> 1 tom c.txtmure-> 1 b.txtmuri-> 1 a.txtmure-> 2 * @ author liuyazhuang * * / public static class StepTwoReducer extends Reducer {@ Override protected void reduce (Text key, Iterable values, Reducer.Context context) throws IOException InterruptedException {String result = "" For (Text value: values) {result + = value + "";} context.write (key, new Text (result));}} / / run the first MR program public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); Job job = Job.getInstance (conf); job.setJarByClass (InverseIndexStepTwo.class); job.setMapperClass (StepTwoMapper.class) Job.setReducerClass (StepTwoReducer.class); job.setMapOutputKeyClass (Text.class); job.setMapOutputValueClass (Text.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (Text.class); FileInputFormat.addInputPath (job, new Path ("D:/hadoop_data/ii/result/part-r-00000")); FileOutputFormat.setOutputPath (job, new Path ("D:/hadoop_data/ii/result/final")); job.waitForCompletion (true);}}

3.2.1 input data

Hello-- > a.txt 3 hello-- > b.txt 2 hello-- > c.txt 2 jerry-- > a.txt 1 jerry-- > b.txt 3 jerry-- > c.txt 1 tom-- > a.txt 2 tom-- > b.txt 1 tom-- > c.txt 1

3.2.2 output result

Hello c.txtmure-> 2b.txtmure-> 2a. Txtmure-> 3 jerry c.txtmure-> 1 b.txtmure-> 3 a.txtmure-> 1 tom c.txtmure-> 1 a.txtmure-> 1 a.txtmure-> 2 above is all the contents of this article "how Hadoop programming implements inverted indexing based on MR programs". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.