In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "what are the common crawling methods of web crawlers". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. let's go deep into the editor's train of thought. Let's study and learn what are the common crawling methods of web crawlers!
1. Width priority traversal strategy (BreathFirst).
Directly add the links contained in the new download page at the end of the URL queue to be crawled, which is the core of width-first traversal. That is to say, this method does not clearly propose and use a measure of the importance of web pages, but mechanically extract links from newly downloaded pages and attach them to the URL queue to be crawled as downloads for URL.
2. OCIP strategy (OnlinePageImporteComputation, online page importance calculation).
It can be regarded as an improved PageRank algorithm. Before starting to implement the algorithm, each web page provides the same "cash", and whenever you download a web page, PForce P distributes its "cash" evenly among the linked pages contained in the page, emptying its own cash. To crawl the pages in the URL queue, sort them according to the amount of cash on hand, and give priority to downloading the pages with the most cash.
OCIP and PageRank are basically the same from the big framework, the difference is: PageRank needs iterative calculation every time, OCIP strategy does not need iterative process, so the calculation speed is much faster than PageRank, so it is suitable for real-time computing. In the calculation, PageRank has a long-distance jump process to unlinked web pages, but OCIP does not have this factor. Experiments show that OCIP is a good importance measurement strategy, and its effect is slightly better than that of width-first traversal strategy.
3. Major station priority strategy (LargerSitesFirst).
The idea of the big site priority strategy is very straightforward: to measure the importance of web pages in terms of sites, for pages in the URL queue to be crawled, according to the classification of the websites to which they belong, if which website needs to download the most pages, then download these links first. Its basic idea is to tend to download large websites, because large websites usually contain more pages. Considering that large websites are often the content of well-known enterprises, and their web pages are generally of high quality, this idea is simple, but it has a certain basis.
Thank you for your reading. the above is the content of "what are the common crawling methods of web crawlers". After the study of this article, I believe you have a deeper understanding of the common crawling methods of web crawlers. the specific use of the situation also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.