In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article is about how Nutch implements the output of HDFS files. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
Take 1.7 as an example, the previous output of Nutch can be customized in other storage systems, the specific principle is not detailed.
There is a requirement for the project that the files are still saved in HDFS rather than indexed to other storage systems.
In other words, you don't have to write.
Public class XXX implements IndexWriter
Such a plug-in.
So, the question is, how to modify the source code of Nutch so that the results can be stored smoothly in HDFS?
Then let's modify it step by step from the source and solve the problem when we encounter it.
First of all, in Crawl.java, there was some code in the indexing phase.
If (I > 0) {linkDbTool.invert (linkDb, segments, true, true, false); / invert links if (solrUrl! = null) {/ / index, dedup & merge FileStatus [] fstats = fs.listStatus (segments, HadoopFSUtil.getPassDirectoriesFilter (fs)); IndexingJob indexer = new IndexingJob (getConf ()); indexer.index (crawlDb, linkDb, Arrays.asList (HadoopFSUtil.getPaths (fstats) SolrDeleteDuplicates dedup = new SolrDeleteDuplicates (); dedup.setConf (getConf ()); dedup.dedup (solrUrl);}
The most important thing is
IndexingJob indexer = new IndexingJob (getConf ()); indexer.index (crawlDb, linkDb, Arrays.asList (HadoopFSUtil.getPaths (fstats)
In other words, this is the entrance to the index.
This code is blocked here. My personal method is if (solrUrl! = null) {- "if (false) {
This also keeps the original code so that it can be restored if there is a problem with the later code.
-what's next? Add our own indexing task code as follows:
If (true) {/ / add my index job / / index, dedup & merge FileStatus [] fstats = fs.listStatus (segments, HadoopFSUtil.getPassDirectoriesFilter (fs)); IndexingJob indexer = new IndexingJob (getConf ()); indexer.index (crawlDb, linkDb,Arrays.asList (HadoopFSUtil.getPaths (fstats)), true,false, null);}
In this way, the transformation of the periphery of the indexing task has been completed, and the appearance has only been changed here without breaking the bones.
Let's start to transform the interior!
First of all, we have to find a way to MR. Where is the entrance?
IndexerMapReduce.initMRJob (crawlDb, linkDb, segments, job)
If you follow in this sentence, you can see the specific MR class, as follows:
Job.setMapperClass (IndexerMapReduce.class); job.setReducerClass (IndexerMapReduce.class)
That is to say, Mr classes are all IndexerMapReduce.class.
So let's start analyzing the map and reduce functions of this class.
Note: the format of my URL file is url\ t sender=xxx\ t receiver=xxx\ t oldname=xxx\ t newname=xxx\ n
-
The changes are as follows:
1 function declaration for reduce
From
Public void reduce (Text key, Iterator values, OutputCollector output, Reporter reporter)
Modify to
Public void reduce (Text key, Iterator values, OutputCollector output, Reporter reporter)
This will lead to three errors, which can be blocked out.
2 look at the last two lines of reduce
NutchIndexAction action = new NutchIndexAction (doc, NutchIndexAction.ADD); output.collect (key, action)
A change needs to be made here as follows:
/ / NutchIndexAction action = new NutchIndexAction (doc, / / NutchIndexAction.ADD); / / output.collect (key, action); Object senderObject = doc.getFieldValue ("sender"); Object receiverObject = doc.getFieldValue ("receiver"); Object singerObject = doc.getFieldValue ("singer") If (null! = senderObject & & null! = receiverObject & & null! = singerObject) {String sender = senderObject.toString (); String receiver = receiverObject.toString (); String singer = singerObject.toString () / / output it output.collect (new Text (sender), new Text (singer)); output.collect (new Text (receiver), new Text (singer));}
If you compile ant at this time, you will naturally report an error, as follows:
[javac] / usr/local/music_Name_to_Singer/nutch-1.7/src/java/org/apache/nutch/indexer/IndexerMapReduce.java:53: error: IndexerMapReduce is not abstract and does not override abstract method reduce (Text,Iterator,OutputCollector,Reporter) in Reducer [javac] public class IndexerMapReduce extends Configured implements [javac] ^ [javac] 1 error [javac] 1 warning
That's because we need to change one place:
In IndexerMapReduce.java
The original code:
Job.setOutputFormat (IndexerOutputFormat.class); job.setOutputKeyClass (Text.class); job.setMapOutputValueClass (NutchWritable.class); job.setOutputValueClass (NutchWritable.class)
Now you want to change it to:
Job.setOutputFormat (TextOutputFormat.class); job.setOutputKeyClass (Text.class); job.setMapOutputValueClass (NutchWritable.class); job.setOutputValueClass (Text.class)
And
Public class IndexerMapReduce extends Configured implements
Mapper
Reducer {
Modify to
Public class IndexerMapReduce extends Configured implements
Mapper
Reducer {
And then ant
You can see
BUILD SUCCESSFUL
Total time: 15 seconds
Indicates that the compilation is successful!
Don't rush to run, there is one more area that needs to be changed!
There is some code in InexingJob as follows:
Final Path tmp = new Path ("tmp_" + System.currentTimeMillis () + "-" + new Random (). NextInt ()); FileOutputFormat.setOutputPath (job, tmp); try {JobClient.runJob (job); / / do the commits once and for all the reducers in one go if (! noCommit) {writers.open (job, "commit") Writers.commit ();} long end = System.currentTimeMillis (); LOG.info ("Indexer: finished at" + sdf.format (end) + ", elapsed:" + TimingUtil.elapsedTime (start, end));} finally {FileSystem.get (job) .delete (tmp, true);}
Indicates that Nutch2.7 directs output to other outputs by default, rather than local HDFS.
So FileSystem.get (job) .delete (tmp, true); is used to delete this file, and we need to modify this place to keep the file.
Otherwise, the documents we have worked so hard to write will be deleted capriciously in one sentence.
-the code is as follows:
Note: my requirement here is to output to the same day's directory. So the code is:
/ / final Path tmp = new Path ("tmp_" + System.currentTimeMillis () + "-" / / + new Random () .nextInt ()); Calendar cal = Calendar.getInstance (); int year = cal.get (Calendar.YEAR); int month = cal.get (Calendar.MONTH) + 1; int day = cal.get (Calendar.DAY_OF_MONTH) Final Path tmp = new Path (getConf (). Get ("pathPrefix"), "year=" + year+ "/ month=" + month+ "/ day=" + day); FileOutputFormat.setOutputPath (job, tmp); try {JobClient.runJob (job); / / do the commits once and for all the reducers in one go if (! noCommit) {writers.open (job, "commit") Writers.commit ();} long end = System.currentTimeMillis (); LOG.info ("Indexer: finished at" + sdf.format (end) + ", elapsed:" + TimingUtil.elapsedTime (start, end));} finally {/ / FileSystem.get (job) .delete (tmp, true);}
At this point, the compilation can be passed.
Okay, that's it for the time being, the effect picture:
Thank you for reading! On "Nutch how to achieve HDFS file output" this article is shared here, I hope the above content can be of some help to you, so that you can learn more knowledge, if you think the article is good, you can share it out for more people to see it!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.