In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces how Nutch parses Html documents, the content is very detailed, interested friends can refer to, hope to be helpful to you.
Parsing Html document MapReduce task description
First, the main program calls
ParseSegment parseSegment = new ParseSegment (getConf ())
If (! Fetcher.isParsing (job)) {
ParseSegment.parse (segs [0]); / / parse it, if needed
}
(1). The realization of isParseing method
Public static boolean isParsing (Configuration conf) {
Return conf.getBoolean ("fetcher.parse", true)
}
(2), parameter segs [0]
Path [] segs = generator.generate (
CrawlDb
Segments
-1
TopN
System.currentTimeMillis ()
The process of generating generatedSegments in the generate method.
There are many doubts here, let's put it down for a while.
/ / read the subdirectories generated in the temp reads subfolders generated in temp
/ / output and turn them into segments output and transfer them to segments
/ / 1. Create a collection object for Path
List generatedSegments = new ArrayList ()
/ / 2 、
FileStatus [] status = fs.listStatus (tempDir); / / read the segment of multiple fetchlist generated above
Try {
For (FileStatus stat: status) {
Path subfetchlist = stat.getPath ()
If (! subfetchlist.getName () .startsWith ("fetchlist-")
Continue;// filters files that do not begin with fetchlist-
/ / start a new partition job for this segment
Path newSeg = partitionSegment (fs, segments, subfetchlist
NumLists)
/ / A pair of segment performs Partition operation to generate a new directory
GeneratedSegments.add (newSeg)
}
} catch (Exception e) {
LOG.warn ("Generator: exception while partitioning segments, exiting...")
Fs.delete (tempDir, true)
Return null
}
II. Job task configuration
Job.setInputFormat (SequenceFileInputFormat.class)
Job.setMapperClass (ParseSegment.class)
Job.setReducerClass (ParseSegment.class)
FileOutputFormat.setOutputPath (job, segment)
Job.setOutputFormat (ParseOutputFormat.class)
Job.setOutputKeyClass (Text.class)
Job.setOutputValueClass (ParseImpl.class)
JobClient.runJob (job)
III. Input and output of Map and reduce tasks
Map task input and output
Input: WritableComparable/ Content
Output: Text/ ParseImpl
Public void map (WritableComparable key, Content content
OutputCollector output, Reporter reporter)
Reduce task input and output
Input: Text/Iterator
Output: Text/Writable
Public void reduce (Text key, Iterator values
OutputCollector output, Reporter reporter)
4. Job task input class SequenceFileInputFormat
Protected FileStatus [] listStatus (JobConf job) throws IOException {
FileStatus [] files = super.listStatus (job)
For (int I = 0; I
< files.length; i++) { FileStatus file = files[i]; if (file.isDir()) { // it's a MapFile Path dataFile = new Path(file.getPath(), MapFile.DATA_FILE_NAME); FileSystem fs = file.getPath().getFileSystem(job); // use the data file files[i] = fs.getFileStatus(dataFile); } } return files; } public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException { reporter.setStatus(split.toString()); return new SequenceFileRecordReader(job, (FileSplit) split); } 五、map()方法和reduce()方法中的实现 (1)、map任务 org.apache.nutch.parse.ParseSegment public void map(WritableComparable key, Content content, OutputCollector output, Reporter reporter) throws IOException { // convert on the fly from old UTF8 keys if (key instanceof Text) { newKey.set(key.toString()); key = newKey; } //2、 获取抓取状态, //Nutch.FETCH_STATUS_KEY)-->_ fst_
Int status =
Integer.parseInt (content.getMetadata () .get (Nutch.FETCH_STATUS_KEY))
/ / 3. If the crawl is successful, if it is not, skip this record
If (status! = CrawlDatum.STATUS_FETCH_SUCCESS) {
/ / content not fetched successfully, skip document
LOG.debug ("Skipping" + key + "as content is not fetched successfully")
Return
}
/ / 4. Determine whether truncation has been tried and whether the document has been truncated. If truncation is to be skipped and the document is truncated, the record is also skipped.
If (skipTruncated & & isTruncated (content)) {
Return
}
ParseResult parseResult = null
Try {
/ / 5. Create a ParseUtil object, call the parsing method parse, and return a parsing result ParseResult
ParseResult = new ParseUtil (getConf ()) .parse (content)
} catch (Exception e) {
LOG.warn ("Error parsing:" + key + ":" + StringUtils.stringifyException (e))
Return
}
/ / the above is mainly parsing, and the following is the processing of parsing
/ /-
/ / 6. Traverse the analytical results obtained in the previous step
For (Entry entry: parseResult) {
/ / 7. Get keys and values
Text url = entry.getKey ()
Parse parse = entry.getValue ()
/ / 8. Get the status
ParseStatus parseStatus = parse.getData () .getStatus ()
Long start = System.currentTimeMillis ()
/ / 9, counter + 1
Reporter.incrCounter ("ParserStatus", ParseStatus.majorCodes [parseStatus.getMajorCode ()], 1)
/ / 10. If the parsing is not successful, set all objects held by parse to null.
If (! parseStatus.isSuccess ()) {
LOG.warn ("Error parsing:" + key + ":" + parseStatus)
Parse = parseStatus.getEmptyParse (getConf ())
}
/ / pass segment name to parse data
/ / 11. Assign the segment name to parse data
Parse.getData (). GetContentMeta (). Set (Nutch.SEGMENT_NAME_KEY
GetConf () .get (Nutch.SEGMENT_NAME_KEY))
/ / compute the new signature
/ / 12. Calculate the new score
Byte [] signature =
SignatureFactory.getSignature (getConf ()) .broadcast (content, parse)
/ / 13. Set the digest value
Parse.getData (). GetContentMeta (). Set (Nutch.SIGNATURE_KEY
StringUtil.toHexString (signature))
Try {
Scfilters.passScoreAfterParsing (url, content, parse)
} catch (ScoringFilterException e) {
If (LOG.isWarnEnabled ()) {
LOG.warn ("Error passing score:" + url + ":" + e.getMessage ())
}
}
Long end = System.currentTimeMillis ()
LOG.info ("Parsed (" + Long.toString (end-start) + "ms):" + url)
Output.collect (url, new ParseImpl (new ParseText (parse.getText ()
Parse.getData (), parse.isCanonical ())
}
}
Characteristics of ParseResult object
A. Implement the Iterable interface, which can be iterated; if the iterative object is Entry,entry, the key is Text, and the value is Parse, as follows
Public class ParseResult implements Iterable
The iterative method is as follows
Public Iterator iterator () {
Return parseMap.entrySet () .iterator ()
}
You can see that the iterator combined with Map is used.
B. Hold a HashMap to store the parsing results and the current url. The code is as follows
Private Map parseMap
Private String originalUrl
Description of Parse and ParseImpl
Parse is an interface, and the three methods are as follows
/ * The textual (body) content of the page. This is indexed, searched, and used when generating snippets.*/
/ / the body content of the web page, which will be indexed, searched, and snapshots generated
String getText ()
/ * Other data extracted from the page. , /
/ / flush other data extracted from the web page
ParseData getData ()
/ * * Indicates if the parse is coming from a url or a sub-url * /
/ / identify whether the Parse comes from url or a child url
Boolean isCanonical (); / / Canonical: regular
ParseImpl implements the interfaces of Parse and Writable
There are three fields in ParseImpl, as shown below, where isCanonical is passed in during construction. The default is true.
Private ParseText text
Private ParseData data
Private boolean isCanonical;// rule
Calculation of digest for weight removal
/ / 12. Calculate the new score
Byte [] signature =
SignatureFactory.getSignature (getConf ()) .broadcast (content, parse)
/ / 13. Set the digest value
Parse.getData (). GetContentMeta (). Set (Nutch.SIGNATURE_KEY
StringUtil.toHexString (signature))
Signature:n. Signature; signature; identification mark; distinctive feature; instructions for the usage of drugs.
GetSignature method in the SignatureFactory class
This method looks to see if there is an implementation of Signature in ObjectCache, and if not, create one with reflection and return it.
/ * Return the default Signature implementation. , /
Public static Signature getSignature (Configuration conf) {
String clazz = conf.get ("db.signature.class", MD5Signature.class.getName ())
ObjectCache objectCache = ObjectCache.get (conf)
Signature impl = (Signature) objectCache.getObject (clazz)
If (impl = = null) {
Try {
If (LOG.isInfoEnabled ()) {
LOG.info ("Using Signature impl:" + clazz)
}
Class implClass = Class.forName (clazz)
Impl = (Signature) implClass.newInstance ()
Impl.setConf (conf)
ObjectCache.setObject (clazz, impl)
} catch (Exception e) {
Throw new RuntimeException ("Couldn't create" + clazz, e)
}
}
Return impl
}
* important
Calculate the characteristics of the web page, and finally call the calculate method of Signature. The following is the class code of the Signature implementation class MD5Signaure
/ * *
* Default implementation of a page signature.
Default implementation class for Signature
* It calculates an MD5 hash of the raw binary content of a page.
It calculates the md5 hash value of the original binary of the page content
* In case there is no content
* it calculates a hash from the page's URL.
*
* @ author Andrzej Bialecki
, /
Public class MD5Signature extends Signature {
Public byte [] calculate (Content content, Parse parse) {
Byte [] data = content.getContent ()
If (data = null) data = content.getUrl () .getBytes ()
Return MD5Hash.digest (data) .getDigest ()
}
}
(2), reduce task
to be continued
Multithread parsing
/ / parsing is multithreaded
Private ParseResult runParser (Parser p, Content content) {
ParseCallable pc = new ParseCallable (p, content)
Future task = executorService.submit (pc)
ParseResult res = null
Try {
Res = task.get (maxParseTime, TimeUnit.SECONDS)
} catch (Exception e) {
LOG.warn ("Error parsing" + content.getUrl () + "with" + p, e)
Task.cancel (true)
} finally {
Pc = null
}
Return res
}
On how to parse Nutch Html documents to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.