How to configure nutch+hadoop 04/28 Update SLTechnology News&Howtos

How to configure nutch+hadoop

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article is about how to configure nutch+hadoop. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Use of nutch+hadoop configuration

Configure nutch+hadoop

1. Download nutch. If you don't need to specifically develop hadoop, you don't need to download hadoop. Because hadoopcore package and related configuration are included in nutch.

2. Set up a directory (according to your preferences)

/ nutch

/ search (nutchinstallationgoeshere) nutch is installed here, that is, unzipped here

File system save point of / filesystemhadoop

/ local/crawl is followed by an index for search

/ home (nutchuser'shomedirectory) if you use system users, this is basically useless

/ tomcat starts the app that nutch.war uses for search indexes

3The JAVA_HOME must be configured, otherwise the system will not work.

4. Configure ssh for master and slave, otherwise enter passwd every time

Ssh-keygen-trsa

Then you can enter the car.

Cpid_rsa.pubauthorized_keys

(copy to other slave) scp/nutch/home/.ssh/authorized_keysnutch@devcluster02:/nutch/home/.ssh/authorized_keys

5, dos2unix all the .sh, nuch, hadoop files under bin and conf

Dos2unix/nutch/search/bin/*.sh/nutch/search/bin/hadoop

Configure hadoop-size.xml

6. Remember to transfer the master file from nutch/hadoopcopy to this nutch, which should be bug. In other words, this file is required for startup, and the file content is the default localhost (if it is distributed, it may need to be configured)

7 namenode needs to be formatted in the configuration process of nutchaugh Hadoop

Bin/hadoopnamenode-format

8, start: bin/start-all.sh

9, configure crawl (take configuring a URL lucene.apache.org as an example)

Cd/nutch/search mkdirurls viurls/urllist.txt http://lucene.apache.org cd/nutch/search bin/hadoopdfs-puturlsurls cd/nutch/search viconf/crawl-urlfilter.txt changethelinethatreads:+ ^ http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ toread:+ ^ http://([a-z0-9]*\.)*apache.org/

10, start crawl

Bin/nutchcrawlurls-dircrawled-depth4

11, query

Bin/hadoopdfs-copyToLocalcrawled/media/do/nutch/local/ (crawled) copy the things of index into the above configured local, because search cannot be searched in dfs (from the documentation)

12, start nutch.war, test

Under classes in vinutch-site.xmlnutch.war

Starttomcat

Note:

1Masters file nutch is not available, so you need to copy it to conf.

2 there is a problem with the log4j configuration of Magazine Crawl by default, which needs to be added:

Hadoop.log.dir=.

Hadoop.log.file=hadoop.log

3Query Nutch 2.0 must be configured with nutch-site.xml. Reconfigure http.agent. Default.xml already exists.

Problems with nutch+hadoop configuration:

1. When I was running the hadoop program, I terminated it halfway, and then when I added or deleted files to hdfs, there was a Namenodeisinsafemode error:

Rmr:org.apache.hadoop.dfs.SafeModeException:Cannotdelete/user/hadoop/input.Namenodeisinsafemode

Commands for resolution:

Bin/hadoopdfsadmin-safemodeleave# shuts down safemode

Index command:

Bin/nutchindexplainindex/paodingindexesplainindex/crawldbplainindex/linkdbplainindex/

Segments/20090528132511plainindex/segments/20090528132525plainindex/segments/20090528132602

Eg:

Index:

Bin/nutchindexcrawled/indexes_newcrawled/crawldbcrawled/linkdbcrawled/segments/20100313132517

Merge:

Bin/nutchmergecrawled/index_newcrawled/indexes_new

Deduplicating dedup:

Bin/nutchdedupcrawled/index_new .

Thank you for reading! This is the end of the article on "how to configure nutch+hadoop". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.