In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces Hadoop programming based on MR program how to achieve inverted index, the article is very detailed, has a certain reference value, interested friends must read it!
I. data preparation
1. Input file data
Here we prepare three input files, one of which is as follows
A.txt
Hello tom hello jerry hello tom
B.txt
Hello jerry hello jerry tom jerry
C.txt
Hello jerry hello tom
2. Finally output file data
The result of the final output file is:
[plain] view plain copyhello c.txtmuri-> 2a. Txtmuri-> 2a. Txtmuri-> 3 jerry c.txtmuri-> 1 b.txtmuri-> 3 a.txtmuri-> 1 tom c.txtmure-> 1 b.txtmure-> 1 a.txtmure-> 2
2. Analysis of inverted indexing process
According to the input file data and the final output file results, this program needs to be implemented using two MR, and the specific process can be summarized as follows:
-the output format of the first step Mapper is as follows:-context.wirte ("hello- > a.txt", "1") context.wirte ("hello- > b.txt", "1") context.wirte ("hello- > b.txt") "1") context.wirte ("hello- > c.txt", "1") context.wirte ("hello- > c.txt" "1")-the input data format of the first step Reducer is as follows:-- the output data format of the first step Reducer is as follows-context.write ("hello- > a.txt", "3") context.write ("hello- > b.txt") "2") context.write ("hello- > c.txt", "2")-the input data format obtained by the second step Mapper is as follows:-context.write ("hello- > a.txt", "3") context.write ("hello- > b.txt", "2") context.write ("hello- > c.txt") "2")-the data format of the second step Mapper output is as follows:-context.write ("hello", "a.txt-> 3") context.write ("hello", "b.txt-> 2") context.write ("hello") "c.txt-> 2")-the input data format obtained by step 2 Reducer is as follows:-- step 2 Reducer output data format is as follows:-context.write ("hello") "a.txt-> 3b.txt-> 2c.txt-> 2") the final result is: hello a.txt-> 3b.txt-> 2c.txt-> 2.
III. Program development
3.1. the first step MR program and input and output
Package com.lyz.hdfs.mr.ii; import java.io.IOException; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat Import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat / * * the first step of inverted indexing is the Map Reduce program, where the program puts all Map/Reduce/Runner programs in one class * @ author liuyazhuang * / public class InverseIndexStepOne {/ * the mapper program that completes the first step of inverted indexing * @ author liuyazhuang * / public static class StepOneMapper extends Mapper {@ Override protected void map (LongWritable key, Text value, Mapper.Context context) throws IOException InterruptedException {/ / get a row of data String line = value.toString () / / cut out each word String [] fields = StringUtils.split (line, "); / / get the slice information of the data FileSplit fileSplit = (FileSplit) context.getInputSplit (); / / get the file name String fileName = fileSplit.getPath (). GetName () For (String field: fields) {context.write (new Text (field + "-->" + fileName), new LongWritable (1)) } / * the Reducer program that completes the first step of inverted indexing * the final output is: * hello-- > a.txt 3 hello-- > b.txt 2 hello-- > c.txt 2 jerry-- > a.txt 1 jerry-- > b.txt 3 jerry-- > c.txt 1 tom-- > a.txt 2 tom-- > b.txt 1 Tom-- > c.txt 1 * @ author liuyazhuang * * / public static class StepOneReducer extends Reducer {@ Override protected void reduce (Text key) Iterable values, Reducer.Context context) throws IOException, InterruptedException {long counter = 0 For (LongWritable value: values) {counter + = value.get ();} context.write (key, new LongWritable (counter));}} / / run the first MR program public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); Job job = Job.getInstance (conf); job.setJarByClass (InverseIndexStepOne.class); job.setMapperClass (StepOneMapper.class) Job.setReducerClass (StepOneReducer.class); job.setMapOutputKeyClass (Text.class); job.setMapOutputValueClass (LongWritable.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (LongWritable.class); FileInputFormat.addInputPath (job, new Path ("D:/hadoop_data/ii")); FileOutputFormat.setOutputPath (job, new Path ("D:/hadoop_data/ii/result")); job.waitForCompletion (true);}}
3.1.1 input data
A.txt
Hello tom hello jerry hello tom
B.txt
Hello jerry hello jerry tom jerry
C.txt
Hello jerry hello tom
3.1.2
Output result:
Hello-- > a.txt 3 hello-- > b.txt 2 hello-- > c.txt 2 jerry-- > a.txt 1 jerry-- > b.txt 3 jerry-- > c.txt 1 tom-- > a.txt 2 tom-- > b.txt 1 tom-- > c.txt 1
3.2 second step MR program and input / output
Package com.lyz.hdfs.mr.ii; import java.io.IOException; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat Import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat / * inverted index second step Map Reduce program Here the program puts all the Map/Reduce/Runner programs in one class * @ author liuyazhuang * * / public class InverseIndexStepTwo {/ * the mapper program that completes the second step of the inverted index * * the input information from the first step MR program is: * hello-- > a.txt 3 hello-- > b.txt 2 hello-- > c.txt 2 jerry-- > a.txt 1 Jerry-- > b.txt 3 jerry-- > c.txt 1 tom-- > a.txt 2 tom-- > b.txt 1 tom-- > c.txt 1 * @ author liuyazhuang * * / public static class StepTwoMapper extends Mapper {@ Override protected void map (LongWritable key) Text value, Mapper.Context context) throws IOException, InterruptedException {String line = value.toString () String [] fields = StringUtils.split (line, "\ t"); String [] wordAndFileName = StringUtils.split (fields [0], "-->"); String word = wordAndFileName [0]; String fileName = wordAndFileName [1]; long counter = Long.parseLong (fields [1]); context.write (new Text (word), new Text (fileName + "- >" + counter)) }} / * the Reducer program that completes the second step of the inverted index * gets the input information in the format: * * the final output is as follows: * hello c.txtmure-> 2b.txtmure-> 2a. Txtmure-> 3 jerry c.txtmure-> 1 b.txtmure-> 3 a.txtmuri-> 1 tom c.txtmure-> 1 b.txtmuri-> 1 a.txtmure-> 2 * @ author liuyazhuang * * / public static class StepTwoReducer extends Reducer {@ Override protected void reduce (Text key, Iterable values, Reducer.Context context) throws IOException InterruptedException {String result = "" For (Text value: values) {result + = value + "";} context.write (key, new Text (result));}} / / run the first MR program public static void main (String [] args) throws Exception {Configuration conf = new Configuration (); Job job = Job.getInstance (conf); job.setJarByClass (InverseIndexStepTwo.class); job.setMapperClass (StepTwoMapper.class) Job.setReducerClass (StepTwoReducer.class); job.setMapOutputKeyClass (Text.class); job.setMapOutputValueClass (Text.class); job.setOutputKeyClass (Text.class); job.setOutputValueClass (Text.class); FileInputFormat.addInputPath (job, new Path ("D:/hadoop_data/ii/result/part-r-00000")); FileOutputFormat.setOutputPath (job, new Path ("D:/hadoop_data/ii/result/final")); job.waitForCompletion (true);}}
3.2.1 input data
Hello-- > a.txt 3 hello-- > b.txt 2 hello-- > c.txt 2 jerry-- > a.txt 1 jerry-- > b.txt 3 jerry-- > c.txt 1 tom-- > a.txt 2 tom-- > b.txt 1 tom-- > c.txt 1
3.2.2 output result
Hello c.txtmure-> 2b.txtmure-> 2a. Txtmure-> 3 jerry c.txtmure-> 1 b.txtmure-> 3 a.txtmure-> 1 tom c.txtmure-> 1 a.txtmure-> 1 a.txtmure-> 2 above is all the contents of this article "how Hadoop programming implements inverted indexing based on MR programs". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.