How to rewrite the bin/crawl script of nutch2.3 into a java class 07/04 Update SLTechnology News&Howtos

How to rewrite the bin/crawl script of nutch2.3 into a java class

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to rewrite the bin/crawl script of nutch2.3 into java class. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

Rewrite the bin/crawl script of nutch3.3 to the java class

After nutch2.8, the previous master code org.apache.nutch.crawl.Crawl class is gone, leaving only the corresponding control script bin/crawl, which makes it inconvenient to debug in IDEA, so I learned about the shell script. According to nutch3.3 's bin/crawl and bin/nutch scripts, I translated bin/crawl into java's Crawl class for debugging in IDEA.

Code design description

I referred to nutch2.7 's crawl class, nutch3.3 's bin/crawl and bin/nutch, and tried to translate according to the original organizational structure and logic of the shell script.

The main business logic is in the public int run (String [] args) method

The main entry of the program is main, which calls ToolRunner.run (NutchConfiguration.create (), new Crawl (), args); execute the run method above

Public void binNutch5j (String jobName,String commandLine,String options) is equivalent to the function _ _ bin_nutch in bin/crawl script.

Public int runJob (String jobName,String commandLine,String options) is equivalent to the function of the script bin/nutch. Instead of using if-else or switch-case as in the script, we use reflection to create the corresponding job.

Public void preConfig (Configuration conf,String options) is used to set the configuration item for each Job according to instructions such as commonOptions with the-D parameter

CLASS_MAP is a static (static) attribute, a hash table (HashMap) that records the mapping between JobName and the corresponding class name

Gora BUG description

I used to use the batchId parameter in each job according to the script, and encountered the following problem:

Gora MongoDb Exception, can't serialize Utf8

It seems to be a serialization problem, as if the gora-0.6 version solves this BUG, but my nutch code is gora-0.5 and will not be upgraded, so simply remove the-batchId parameter and use the-all parameter, as you can see in the code.

With regard to upgrading to gora-0.6, you can study it again when you have time.

Through the rewriting of this script, I learned the basic use of the script. At the same time, I practiced the previous knowledge such as java reflection, and was deeply impressed by the complete crawling process and main control logic of nutch. The main reason is that the bug of gora stuck me for a few days. I thought there was something wrong with my translation. It seems that the debugging ability still needs to be strengthened.

Java code

This code translates the bin/crawl and bin/nutch scripts of nutch3.3

Add the Crawl class to the org.apache.nutch.crawl package, and the source code is as follows:

Package org.apache.nutch.crawl;/** * Created by brianway on 2016-1-19. * @ author brianway * @ site brianway.github.io * org.apache.nutch.crawl.Crawl; * / import org.apache.commons.lang.StringUtils;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import org.apache.nutch.fetcher.FetcherJob;import org.apache.nutch.util.NutchConfiguration;import org.apache.nutch.util.NutchTool Import org.slf4j.Logger;import org.slf4j.LoggerFactory;import java.lang.reflect.Constructor;import java.util.HashMap;import java.util.Map;import java.util.Random;// Commons Logging imports//import org.apache.hadoop.fs.*;//import org.apache.hadoop.mapred.*;//import org.apache.nutch.util.HadoopFSUtil;//import org.apache.nutch.util.NutchJob;//import org.apache.nutch.crawl.InjectorJob;//import org.apache.nutch.crawl.GeneratorJob / / import org.apache.nutch.fetcher.FetcherJob;//import org.apache.nutch.parse.ParserJob;//import org.apache.nutch.crawl.DbUpdaterJob;//import org.apache.nutch.indexer.IndexingJob;//import org.apache.nutch.indexer.solr.SolrDeleteDuplicates;public class Crawl extends NutchTool implements Tool {public static final Logger LOG = LoggerFactory.getLogger (Crawl.class); / * Perform complete crawling and indexing (to Solr) given a set of root urls and the-solr parameter respectively. More information and Usage parameters can be found below. * / public static void main (String args []) throws Exception {int res = ToolRunner.run (NutchConfiguration.create (), new Crawl (), args); System.exit (res);} / to compile @ Override public Map run (Map args) throws Exception {return null } @ Override public int run (String [] args) throws Exception {if (args.length < 1) {System.out.println ("Usage: Crawl-urls-crawlId-solr [- threads n] [- depth I] [- topN N]"); / / ("Usage: crawl []"); return-1 } / /-check args-/* / /! Literally translated by the script, it feels like there are few parameters, so comment out and change the following way: String seedDir = args [1]; String crawlID = args [2]; String solrUrl=null; int limit=1; if (args.length-1 = = 3) {limit= Integer.parseInt (args [3]);} else if (args.length-1 = = 4) {solrUrl= args [3] Limit = Integer.parseInt (args [4]);} else {System.out.println ("Unknown # of arguments" + (args.length-1)); System.out.println ("Usage: crawl []"); return-1 / / "Usage: Crawl-solr [- dir d] [- threads n] [- depth I] [- topN N]" / / "Usage: crawl []";} * / String seedDir = null; String crawlID = null; String solrUrl=null; int limit = 0; long topN = Long.MAX_VALUE Int threads = getConf (). GetInt ("fetcher.threads.fetch", 10); / / parameter-format in crawl class is / / like nutch2.7 "Usage: Crawl-solr [- dir d] [- threads n] [- depth I] [- topN N]" / / not like nutch3.3 "Usage: crawl []"; for (int I = 0; I < args.length If ("- urls" .equals (args [I])) {seedDir = args [+ + I];} else if ("- crawlId" .equals (args [I])) {crawlID = args [+ + I];} else if ("- threads" .equals (args [I])) {threads = Integer.parseInt (args [+ + I]) } else if ("- depth" .equals (args [I]) {limit = Integer.parseInt (args [+ + I]);} else if ("- topN" .equals (args [I])) {topN = Long.parseLong (args [+ + I]);} else if ("- solr" .equals (args [I])) {solrUrl = args [+ + I] Else {System.err.println ("Unrecognized arg" + args [I]); return-1;}} if (StringUtils.isEmpty (seedDir)) {System.out.println ("Missing seedDir: crawl []"); return-1 } if (StringUtils.isEmpty (crawlID)) {System.out.println ("Missing crawlID: crawl []"); return-1;} if (StringUtils.isEmpty (solrUrl)) {System.out.println ("No SOLRURL specified. Skipping indexing. ");} if (limit = = 0) {System.out.println (" Missing numberOfRounds: crawl [] "); return-1;} / * * MODIFY THE PARAMETERS BELOW TO YOUR NEEDS * / / set the number of slaves nodes int numSlaves = 1 / / and the total number of available tasks / / sets Hadoop parameter "mapred.reduce.tasks" int numTasks = numSlaves

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.