Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Introduction and usage of Nutch2.2.1

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "Introduction and Usage of Nutch 2.2.1". The explanation content in this article is simple and clear, and it is easy to learn and understand. Please follow the ideas of Xiaobian to study and learn "Introduction and Usage of Nutch 2.2.1" together.

1. Nutch

Nutch is an open source web crawler project, more specifically a crawler software that can be used directly to crawl web content.

Nutch is now available in two versions, 1.x and 2.x. The latest version of 1.x is 1.7 and the latest version of 2.x is 2.2.1. The main difference between the two versions is the underlying storage.

Version 1.x is based on Hadoop architecture, the underlying storage uses HDFS, while 2.x uses Apache Gora to enable Nutch to access NoSQL such as HBase, Accumulo, Cassandra, MySQL, DataFileAvroStore, AvroStore, etc.

2. Compile Nutch

Nutch 2.x no longer provides complete deployment files from version 1.7, only source code files and related build.xml files, which requires users to compile Nutch themselves, while the entire Nutch 3.x version does not provide compiled files, so if you want to learn Nutch 3.2.1 features, you must manually compile files yourself.

2.1 Download Unzip $ wget http://archive.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz $ tar zxf apache-nutch-2.2.1-src.tar.gz2.2 compile $ cd apache-nutch-2.2.1 $ ant

It is possible that you will get the following error:

Trying to override old definition of task javac [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.ivy-probe-antlib:ivy-download: [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

Solution:

Download sonar-ant-task-2.1.jar and copy it to apache-nutch-2.2.1

Modify build.xml to include the jar package added above:

Nutch uses ivy to build, so compilation takes a long time, if the compilation time is too long, it is recommended to modify the maven repository address, modify the method:

Fix this by replacing http://repo1.maven.org/maven2/in ivysettings.xml under ivy/with http://mirrors.biblio.org/maven2/. The code location is:

The compiled directory is as follows:

➜ apache-nutch-2.2.1 tree -L 1.├── CHANGES.txt├── LICENSE.txt├── NOTICE.txt├── README.txt├── build├── build.xml├── conf├── default.properties├── docs├── ivy├── lib├── runtime├── sonar-ant-task-2.1.jar└── src7 directories, 7 files

You can see that there are two more directories after compilation: build and runtime.

3. modify the configuration file

Since Nutch3.x version stores Cassandra, HBase, Accumulo, Avro, etc. using Gora, it is necessary to specify Gora attribute in this file, for example, specify the default storage mode gora.datastore.default= org.apache.gora.hbase.store.HBaseStore. The value of this attribute can be obtained by searching for storage.data.store.class attribute in nutch-default.xml. Without modifying the gora.properties file, the storage class is org.apache.gora.memory.store.MemStore, which stores data in memory for testing purposes only.

Here, change the storage method to HBase, please refer to http://wiki.apache.org/nutch/Nutch3Tutorial.

Modify conf/nutch-site.xml

storage.data.store.class org.apache.gora.hbase.store.HBaseStore Default class for storing data

Modify ivy/ivy.xml

Modify conf/gora.properties to ensure that HBaseStore is set to default storage,

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Because HBase is used here, you also need an HBase environment. You can use Standalone mode to build an HBase environment. Please refer to HBase Quick Start. For clarification purposes, the current HBase version requirement is hbase-0.90.4.

4. Integrated Solr

Since we need Solr for indexing, we need to install and start a Solr server.

4.1 Download, extract $ wget http://mirrors.cnnic.cn/apache/lucene/solr/4.8.0/solr-4.8.0.tgz $ tar -zxf solr-4.8.0.tgz4.2 Run Solr$ cd solr-4.8.0/example $ java -jar start.jar

Verify successful startup

Open http://localhost:8983/solr/admin/with your browser. If you can see the page, it means that the startup is successful.

4.3 Modify Solr Profile

Copy apache-nutch-2.2.1/conf/schema-solr 4.xml to solr-4.8.0/solr/collection1/conf/schema.xml, and insert it in... Add a line at the end:

Restart Solr,

# Ctrl+C to stop Solr $ java -jar start.jar5. pull data

The compiled script is in the runtime/local/bin directory, and you can run the command to see how to use it:

crawl command:

$ cd runtime/local/bin $ ./ crawl Missing seedDir : crawl

Nutch command:

$ ./ nutch Usage: nutch COMMANDwhere COMMAND is one of: inject inject new urls into the database hostinject creates or updates an existing host table from a text file generate generate new batches to fetch from crawl db fetch fetch URLs marked during generate parse parse URLs marked during fetch updatedb update web table after parsing updatehostdb update host table after parsing readdb read/dump records from page database readhostdb display entries from the hostDB elasticindex run the elasticsearch indexer solrindex run the solr indexer on parsed batches solrdedup remove duplicates from solr parsechecker check the parser for a given url indexchecker check the indexing filters for a given url plugin load a plugin and run one of its classes main() nutchserver run a (local) Nutch server on a user defined port junit runs the given JUnit test or CLASSNAME run the class named CLASSNAMEMost commands print help when invoked w/o parameters.

Next you can grab the web page.

Thank you for reading, the above is the content of "Nutch 2.2.1 introduction and usage". After studying this article, I believe that everyone has a deeper understanding of Nutch 2.2.1 introduction and usage. The specific use situation still needs to be verified by practice. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report