In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article shows you how to achieve generate.max.count parameter processing, the content is concise and easy to understand, can definitely brighten your eyes, through the detailed introduction of this article, I hope you can get something.
The processing of generate.max.count parameters is in the org.apache.nutch.crawl.Generator inner class Selector
Declaration of related variables in org.apache.nutch.crawl.Generator
Private HashMap hostCounts = new HashMap (); private int maxCount
In the config method of the inner class Selector
MaxCount = job.getInt (GENERATOR_MAX_COUNT,-1)
Processing in reduce method
/ * 1. Get the int under a host []. If it is null, declare an array and put it in map. The second value of the int array + 1 * / / 1int [] hostCount = hostCounts.get (hostordomain); if (hostCount = = null) {hostCount = new int [] {1,0}; hostCounts.put (hostordomain, hostCount);} hostCount [1] + + / / increment hostCount//2, check whether the number of topN has been reached. If the first value of hostCount is greater than limit// check if topN reached, select next segment if it iswhile (segCounts [hostCount [0]-1] > = limit//segCounts: & & hostCount [0])
< maxNumSegments) { hostCount[0]++; hostCount[1] = 0;}// reached the limit of allowed URLs per host / domain// see if we can put it in the next segment?if (hostCount[1] >= maxCount) {if (hostCount [0] < maxNumSegments) {hostCount [0] +; hostCount [1] = 0 } else {if (hostCount [1] = = maxCount + 1 & & LOG.isInfoEnabled ()) {LOG.info ("Host ordomain" + hostordomain + "has more than" MaxCount + "URLs for all" + maxNumSegments + "segments. Additional URLs won't be included in the fetchlist. ");} / / skip this entry continue;}} entry.segnum = new IntWritable (hostCount [0]); segCounts [hostCount [0]-1] + +; the above is how to implement the parameter processing of generate.max.count. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.