In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "how to transform Scrapy to achieve multi-site large-scale crawling," interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "how to transform Scrapy to achieve multi-site large-scale crawling"!
modify scheduling queue
Scrapy default scheduling queue is scrapy.pqueues.ScrapyPriorityQueue, which is suitable for targeted crawler use, for general crawler, we should modify to scrapy.pqueues.DownloaderAwarePriorityQueue. Add a line to the settings.py file:
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
Increase concurrency
Add configuration to settings.py:
CONCURRENT_REQUESTS = 100 CONCURRENT_REQUESTS_PER_DOMAIN = 100
However, the amount of concurrency is actually limited by memory and CPU. It is recommended to actually test and choose the most suitable number.
Increase Twisted IO thread pool size
Scrapy is blocking when it comes to DNS resolution. So the higher the volume of requests, the slower DNS resolution will be. To avoid this, increase the size of the thread pool. Add a configuration to settings.py:
REACTOR_THREADPOOL_MAXSIZE = 20
Build a dedicated DNS server
If the number of crawler processes is too high and concurrency is too fast, it may form a Dos attack on DNS servers. It is recommended that you create a DNS server.
Reduce journal volume
Scrapy defaults to DEBUG level logs, and each crawl generates a large number of logs. By adjusting the log level to INFO you can reduce the log volume significantly. Add a line to settings.py
LOG_LEVEL = 'INFO'
Disable Cookies and automatic retry
Large-scale crawlers generally do not need Cookies, so they can be disabled. Automatic retry of failed requests slows crawlers down. However, due to the large crawling range of large-scale crawlers, there is no need to retry individual failed requests. So modify settings.py:
COOKIES_ENABLED = False RETRY_ENABLED = False
Reduce request timeout and disable auto-jump
Some websites have long response times because they are far from the ocean or are disturbed. For this kind of website, you should give up decisively and avoid crawling other websites.
Disabling automatic jumps also helps speed up web visits.
DOWNLOAD_TIMEOUT = 10 REDIRECT_ENABLED = False
Use Breadth Limited Search
Scrapy is based on the depth-first (DFO) search algorithm by default. But in large-scale crawlers, we generally use a Breadth-Limited (BFO) search algorithm:
DEPTH_PRIORITY = 1 SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue' SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
If you find that crawlers consume a lot of memory, but the speed is much lower than the concurrency speed you set, then consider whether a memory leak has occurred.
At this point, I believe everyone has a deeper understanding of "how to transform Scrapy to achieve multi-site large-scale crawling," so you may wish to actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.