In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article introduces the relevant knowledge of "how to improve the collection rate of crawler capture". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
1. Minimize the number of visits to the website.
A crawler mainly spends time waiting for a response to the network request, minimizing the number of visits to the website, reducing its own workload, reducing the pressure on the website, and reducing the risk of closure.
The first thing to do is to simplify the process as much as possible and avoid repeated acquisition of multiple pages.
Then go to heavy, generally according to url or id for the only judgment will no longer continue to climb.
2. Distributed crawler: even if various methods are exhausted, the number of web pages that can be crawled per unit time on a single machine is still limited.
In the face of a large number of web page queues, it takes a long time to calculate. In this case, you have to use the machine for time, which is the distributed crawler.
Distribution is not the essence of a crawler, nor is it necessary. For tasks that are independent of each other and have no communication, you can divide the tasks manually and then execute them on multiple machines, reducing the workload of each machine and reducing the time exponentially.
For example, if you have 200W web pages to climb, you can use five machines to climb 40W web pages without repeating each other, which is five times shorter than the time spent on a single machine.
If there is a need for communication, such as a changed queue to be crawled, each crawl will change, even if the segmented task is cross-repeated, it can only be distributed. A Master storage queue, other multiple Slave respectively extract, share a queue, even if the reprimand will not be repeatedly extracted. Scrapy-redis is a widely used distributed crawler framework.
This is the end of the content of "how to improve the collection rate of crawler capture". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.