How to realize HDFS File output by Nutch 04/18 Update SLTechnology News&Howtos

How to realize HDFS File output by Nutch

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about how Nutch implements the output of HDFS files. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Take 1.7 as an example, the previous output of Nutch can be customized in other storage systems, the specific principle is not detailed.

There is a requirement for the project that the files are still saved in HDFS rather than indexed to other storage systems.

In other words, you don't have to write.

Public class XXX implements IndexWriter

Such a plug-in.

So, the question is, how to modify the source code of Nutch so that the results can be stored smoothly in HDFS?

Then let's modify it step by step from the source and solve the problem when we encounter it.

First of all, in Crawl.java, there was some code in the indexing phase.

If (I > 0) {linkDbTool.invert (linkDb, segments, true, true, false); / invert links if (solrUrl! = null) {/ / index, dedup & merge FileStatus [] fstats = fs.listStatus (segments, HadoopFSUtil.getPassDirectoriesFilter (fs)); IndexingJob indexer = new IndexingJob (getConf ()); indexer.index (crawlDb, linkDb, Arrays.asList (HadoopFSUtil.getPaths (fstats) SolrDeleteDuplicates dedup = new SolrDeleteDuplicates (); dedup.setConf (getConf ()); dedup.dedup (solrUrl);}

The most important thing is

IndexingJob indexer = new IndexingJob (getConf ()); indexer.index (crawlDb, linkDb, Arrays.asList (HadoopFSUtil.getPaths (fstats)

In other words, this is the entrance to the index.

This code is blocked here. My personal method is if (solrUrl! = null) {- "if (false) {

This also keeps the original code so that it can be restored if there is a problem with the later code.

-what's next? Add our own indexing task code as follows:

If (true) {/ / add my index job / / index, dedup & merge FileStatus [] fstats = fs.listStatus (segments, HadoopFSUtil.getPassDirectoriesFilter (fs)); IndexingJob indexer = new IndexingJob (getConf ()); indexer.index (crawlDb, linkDb,Arrays.asList (HadoopFSUtil.getPaths (fstats)), true,false, null);}

In this way, the transformation of the periphery of the indexing task has been completed, and the appearance has only been changed here without breaking the bones.

Let's start to transform the interior!

First of all, we have to find a way to MR. Where is the entrance?

IndexerMapReduce.initMRJob (crawlDb, linkDb, segments, job)

If you follow in this sentence, you can see the specific MR class, as follows:

Job.setMapperClass (IndexerMapReduce.class); job.setReducerClass (IndexerMapReduce.class)

That is to say, Mr classes are all IndexerMapReduce.class.

So let's start analyzing the map and reduce functions of this class.

Note: the format of my URL file is url\ t sender=xxx\ t receiver=xxx\ t oldname=xxx\ t newname=xxx\ n

The changes are as follows:

1 function declaration for reduce

From

Public void reduce (Text key, Iterator values, OutputCollector output, Reporter reporter)

Modify to

Public void reduce (Text key, Iterator values, OutputCollector output, Reporter reporter)

This will lead to three errors, which can be blocked out.

2 look at the last two lines of reduce

NutchIndexAction action = new NutchIndexAction (doc, NutchIndexAction.ADD); output.collect (key, action)

A change needs to be made here as follows:

/ / NutchIndexAction action = new NutchIndexAction (doc, / / NutchIndexAction.ADD); / / output.collect (key, action); Object senderObject = doc.getFieldValue ("sender"); Object receiverObject = doc.getFieldValue ("receiver"); Object singerObject = doc.getFieldValue ("singer") If (null! = senderObject & & null! = receiverObject & & null! = singerObject) {String sender = senderObject.toString (); String receiver = receiverObject.toString (); String singer = singerObject.toString () / / output it output.collect (new Text (sender), new Text (singer)); output.collect (new Text (receiver), new Text (singer));}

If you compile ant at this time, you will naturally report an error, as follows:

[javac] / usr/local/music_Name_to_Singer/nutch-1.7/src/java/org/apache/nutch/indexer/IndexerMapReduce.java:53: error: IndexerMapReduce is not abstract and does not override abstract method reduce (Text,Iterator,OutputCollector,Reporter) in Reducer [javac] public class IndexerMapReduce extends Configured implements [javac] ^ [javac] 1 error [javac] 1 warning

That's because we need to change one place:

In IndexerMapReduce.java

The original code:

Job.setOutputFormat (IndexerOutputFormat.class); job.setOutputKeyClass (Text.class); job.setMapOutputValueClass (NutchWritable.class); job.setOutputValueClass (NutchWritable.class)

Now you want to change it to:

Job.setOutputFormat (TextOutputFormat.class); job.setOutputKeyClass (Text.class); job.setMapOutputValueClass (NutchWritable.class); job.setOutputValueClass (Text.class)

And

Public class IndexerMapReduce extends Configured implements

Mapper

Reducer {

Modify to

Public class IndexerMapReduce extends Configured implements

Mapper

Reducer {

And then ant

You can see

BUILD SUCCESSFUL

Total time: 15 seconds

Indicates that the compilation is successful!

Don't rush to run, there is one more area that needs to be changed!

There is some code in InexingJob as follows:

Final Path tmp = new Path ("tmp_" + System.currentTimeMillis () + "-" + new Random (). NextInt ()); FileOutputFormat.setOutputPath (job, tmp); try {JobClient.runJob (job); / / do the commits once and for all the reducers in one go if (! noCommit) {writers.open (job, "commit") Writers.commit ();} long end = System.currentTimeMillis (); LOG.info ("Indexer: finished at" + sdf.format (end) + ", elapsed:" + TimingUtil.elapsedTime (start, end));} finally {FileSystem.get (job) .delete (tmp, true);}

Indicates that Nutch2.7 directs output to other outputs by default, rather than local HDFS.

So FileSystem.get (job) .delete (tmp, true); is used to delete this file, and we need to modify this place to keep the file.

Otherwise, the documents we have worked so hard to write will be deleted capriciously in one sentence.

-the code is as follows:

Note: my requirement here is to output to the same day's directory. So the code is:

/ / final Path tmp = new Path ("tmp_" + System.currentTimeMillis () + "-" / / + new Random () .nextInt ()); Calendar cal = Calendar.getInstance (); int year = cal.get (Calendar.YEAR); int month = cal.get (Calendar.MONTH) + 1; int day = cal.get (Calendar.DAY_OF_MONTH) Final Path tmp = new Path (getConf (). Get ("pathPrefix"), "year=" + year+ "/ month=" + month+ "/ day=" + day); FileOutputFormat.setOutputPath (job, tmp); try {JobClient.runJob (job); / / do the commits once and for all the reducers in one go if (! noCommit) {writers.open (job, "commit") Writers.commit ();} long end = System.currentTimeMillis (); LOG.info ("Indexer: finished at" + sdf.format (end) + ", elapsed:" + TimingUtil.elapsedTime (start, end));} finally {/ / FileSystem.get (job) .delete (tmp, true);}

At this point, the compilation can be passed.

Okay, that's it for the time being, the effect picture:

Thank you for reading! On "Nutch how to achieve HDFS file output" this article is shared here, I hope the above content can be of some help to you, so that you can learn more knowledge, if you think the article is good, you can share it out for more people to see it!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.