How Nutch parses Html documents 04/27 Update SLTechnology News&Howtos

How Nutch parses Html documents

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces how Nutch parses Html documents, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Parsing Html document MapReduce task description

First, the main program calls

ParseSegment parseSegment = new ParseSegment (getConf ())

If (! Fetcher.isParsing (job)) {

ParseSegment.parse (segs [0]); / / parse it, if needed

}

(1). The realization of isParseing method

Public static boolean isParsing (Configuration conf) {

Return conf.getBoolean ("fetcher.parse", true)

}

(2), parameter segs [0]

Path [] segs = generator.generate (

CrawlDb

Segments

-1

TopN

System.currentTimeMillis ()

The process of generating generatedSegments in the generate method.

There are many doubts here, let's put it down for a while.

/ / read the subdirectories generated in the temp reads subfolders generated in temp

/ / output and turn them into segments output and transfer them to segments

/ / 1. Create a collection object for Path

List generatedSegments = new ArrayList ()

/ / 2 、

FileStatus [] status = fs.listStatus (tempDir); / / read the segment of multiple fetchlist generated above

Try {

For (FileStatus stat: status) {

Path subfetchlist = stat.getPath ()

If (! subfetchlist.getName () .startsWith ("fetchlist-")

Continue;// filters files that do not begin with fetchlist-

/ / start a new partition job for this segment

Path newSeg = partitionSegment (fs, segments, subfetchlist

NumLists)

/ / A pair of segment performs Partition operation to generate a new directory

GeneratedSegments.add (newSeg)

}

} catch (Exception e) {

LOG.warn ("Generator: exception while partitioning segments, exiting...")

Fs.delete (tempDir, true)

Return null

}

II. Job task configuration

Job.setInputFormat (SequenceFileInputFormat.class)

Job.setMapperClass (ParseSegment.class)

Job.setReducerClass (ParseSegment.class)

FileOutputFormat.setOutputPath (job, segment)

Job.setOutputFormat (ParseOutputFormat.class)

Job.setOutputKeyClass (Text.class)

Job.setOutputValueClass (ParseImpl.class)

JobClient.runJob (job)

III. Input and output of Map and reduce tasks

Map task input and output

Input: WritableComparable/ Content

Output: Text/ ParseImpl

Public void map (WritableComparable key, Content content

OutputCollector output, Reporter reporter)

Reduce task input and output

Input: Text/Iterator

Output: Text/Writable

Public void reduce (Text key, Iterator values

OutputCollector output, Reporter reporter)

4. Job task input class SequenceFileInputFormat

Protected FileStatus [] listStatus (JobConf job) throws IOException {

FileStatus [] files = super.listStatus (job)

For (int I = 0; I

< files.length; i++) { FileStatus file = files[i]; if (file.isDir()) { // it's a MapFile Path dataFile = new Path(file.getPath(), MapFile.DATA_FILE_NAME); FileSystem fs = file.getPath().getFileSystem(job); // use the data file files[i] = fs.getFileStatus(dataFile); } } return files; } public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(split.toString()); return new SequenceFileRecordReader(job, (FileSplit) split); } 五、map()方法和reduce()方法中的实现 (1)、map任务 org.apache.nutch.parse.ParseSegment public void map(WritableComparable key, Content content, OutputCollector output, Reporter reporter) throws IOException { // convert on the fly from old UTF8 keys if (key instanceof Text) { newKey.set(key.toString()); key = newKey; } //2、获取抓取状态， //Nutch.FETCH_STATUS_KEY)-->

_ fst_

Int status =

Integer.parseInt (content.getMetadata () .get (Nutch.FETCH_STATUS_KEY))

/ / 3. If the crawl is successful, if it is not, skip this record

If (status! = CrawlDatum.STATUS_FETCH_SUCCESS) {

/ / content not fetched successfully, skip document

LOG.debug ("Skipping" + key + "as content is not fetched successfully")

Return

}

/ / 4. Determine whether truncation has been tried and whether the document has been truncated. If truncation is to be skipped and the document is truncated, the record is also skipped.

If (skipTruncated & & isTruncated (content)) {

Return

}

ParseResult parseResult = null

Try {

/ / 5. Create a ParseUtil object, call the parsing method parse, and return a parsing result ParseResult

ParseResult = new ParseUtil (getConf ()) .parse (content)

} catch (Exception e) {

LOG.warn ("Error parsing:" + key + ":" + StringUtils.stringifyException (e))

Return

}

/ / the above is mainly parsing, and the following is the processing of parsing

/ /-

/ / 6. Traverse the analytical results obtained in the previous step

For (Entry entry: parseResult) {

/ / 7. Get keys and values

Text url = entry.getKey ()

Parse parse = entry.getValue ()

/ / 8. Get the status

ParseStatus parseStatus = parse.getData () .getStatus ()

Long start = System.currentTimeMillis ()

/ / 9, counter + 1

Reporter.incrCounter ("ParserStatus", ParseStatus.majorCodes [parseStatus.getMajorCode ()], 1)

/ / 10. If the parsing is not successful, set all objects held by parse to null.

If (! parseStatus.isSuccess ()) {

LOG.warn ("Error parsing:" + key + ":" + parseStatus)

Parse = parseStatus.getEmptyParse (getConf ())

}

/ / pass segment name to parse data

/ / 11. Assign the segment name to parse data

Parse.getData (). GetContentMeta (). Set (Nutch.SEGMENT_NAME_KEY

GetConf () .get (Nutch.SEGMENT_NAME_KEY))

/ / compute the new signature

/ / 12. Calculate the new score

Byte [] signature =

SignatureFactory.getSignature (getConf ()) .broadcast (content, parse)

/ / 13. Set the digest value

Parse.getData (). GetContentMeta (). Set (Nutch.SIGNATURE_KEY

StringUtil.toHexString (signature))

Try {

Scfilters.passScoreAfterParsing (url, content, parse)

} catch (ScoringFilterException e) {

If (LOG.isWarnEnabled ()) {

LOG.warn ("Error passing score:" + url + ":" + e.getMessage ())

}

Long end = System.currentTimeMillis ()

LOG.info ("Parsed (" + Long.toString (end-start) + "ms):" + url)

Output.collect (url, new ParseImpl (new ParseText (parse.getText ()

Parse.getData (), parse.isCanonical ())

}

Characteristics of ParseResult object

A. Implement the Iterable interface, which can be iterated; if the iterative object is Entry,entry, the key is Text, and the value is Parse, as follows

Public class ParseResult implements Iterable

The iterative method is as follows

Public Iterator iterator () {

Return parseMap.entrySet () .iterator ()

}

You can see that the iterator combined with Map is used.

B. Hold a HashMap to store the parsing results and the current url. The code is as follows

Private Map parseMap

Private String originalUrl

Description of Parse and ParseImpl

Parse is an interface, and the three methods are as follows

/ * The textual (body) content of the page. This is indexed, searched, and used when generating snippets.*/

/ / the body content of the web page, which will be indexed, searched, and snapshots generated

String getText ()

/ * Other data extracted from the page. , /

/ / flush other data extracted from the web page

ParseData getData ()

/ * * Indicates if the parse is coming from a url or a sub-url * /

/ / identify whether the Parse comes from url or a child url

Boolean isCanonical (); / / Canonical: regular

ParseImpl implements the interfaces of Parse and Writable

There are three fields in ParseImpl, as shown below, where isCanonical is passed in during construction. The default is true.

Private ParseText text

Private ParseData data

Private boolean isCanonical;// rule

Calculation of digest for weight removal

/ / 12. Calculate the new score

Byte [] signature =

SignatureFactory.getSignature (getConf ()) .broadcast (content, parse)

/ / 13. Set the digest value

Parse.getData (). GetContentMeta (). Set (Nutch.SIGNATURE_KEY

StringUtil.toHexString (signature))

Signature:n. Signature; signature; identification mark; distinctive feature; instructions for the usage of drugs.

GetSignature method in the SignatureFactory class

This method looks to see if there is an implementation of Signature in ObjectCache, and if not, create one with reflection and return it.

/ * Return the default Signature implementation. , /

Public static Signature getSignature (Configuration conf) {

String clazz = conf.get ("db.signature.class", MD5Signature.class.getName ())

ObjectCache objectCache = ObjectCache.get (conf)

Signature impl = (Signature) objectCache.getObject (clazz)

If (impl = = null) {

Try {

If (LOG.isInfoEnabled ()) {

LOG.info ("Using Signature impl:" + clazz)

}

Class implClass = Class.forName (clazz)

Impl = (Signature) implClass.newInstance ()

Impl.setConf (conf)

ObjectCache.setObject (clazz, impl)

} catch (Exception e) {

Throw new RuntimeException ("Couldn't create" + clazz, e)

}

Return impl

}

* important

Calculate the characteristics of the web page, and finally call the calculate method of Signature. The following is the class code of the Signature implementation class MD5Signaure

/ * *

* Default implementation of a page signature.

Default implementation class for Signature

* It calculates an MD5 hash of the raw binary content of a page.

It calculates the md5 hash value of the original binary of the page content

* In case there is no content

* it calculates a hash from the page's URL.

* @ author Andrzej Bialecki

, /

Public class MD5Signature extends Signature {

Public byte [] calculate (Content content, Parse parse) {

Byte [] data = content.getContent ()

If (data = null) data = content.getUrl () .getBytes ()

Return MD5Hash.digest (data) .getDigest ()

}

(2), reduce task

to be continued

Multithread parsing

/ / parsing is multithreaded

Private ParseResult runParser (Parser p, Content content) {

ParseCallable pc = new ParseCallable (p, content)

Future task = executorService.submit (pc)

ParseResult res = null

Try {

Res = task.get (maxParseTime, TimeUnit.SECONDS)

} catch (Exception e) {

LOG.warn ("Error parsing" + content.getUrl () + "with" + p, e)

Task.cancel (true)

} finally {

Pc = null

}

Return res

}

On how to parse Nutch Html documents to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.