How to transform Scrapy to realize large-scale crawling of multiple websites 04/16 Update SLTechnology News&Howtos

How to transform Scrapy to realize large-scale crawling of multiple websites

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "how to transform Scrapy to achieve multi-site large-scale crawling," interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "how to transform Scrapy to achieve multi-site large-scale crawling"!

modify scheduling queue

Scrapy default scheduling queue is scrapy.pqueues.ScrapyPriorityQueue, which is suitable for targeted crawler use, for general crawler, we should modify to scrapy.pqueues.DownloaderAwarePriorityQueue. Add a line to the settings.py file:

SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'

Increase concurrency

Add configuration to settings.py:

CONCURRENT_REQUESTS = 100 CONCURRENT_REQUESTS_PER_DOMAIN = 100

However, the amount of concurrency is actually limited by memory and CPU. It is recommended to actually test and choose the most suitable number.

Increase Twisted IO thread pool size

Scrapy is blocking when it comes to DNS resolution. So the higher the volume of requests, the slower DNS resolution will be. To avoid this, increase the size of the thread pool. Add a configuration to settings.py:

REACTOR_THREADPOOL_MAXSIZE = 20

Build a dedicated DNS server

If the number of crawler processes is too high and concurrency is too fast, it may form a Dos attack on DNS servers. It is recommended that you create a DNS server.

Reduce journal volume

Scrapy defaults to DEBUG level logs, and each crawl generates a large number of logs. By adjusting the log level to INFO you can reduce the log volume significantly. Add a line to settings.py

LOG_LEVEL = 'INFO'

Disable Cookies and automatic retry

Large-scale crawlers generally do not need Cookies, so they can be disabled. Automatic retry of failed requests slows crawlers down. However, due to the large crawling range of large-scale crawlers, there is no need to retry individual failed requests. So modify settings.py:

COOKIES_ENABLED = False RETRY_ENABLED = False

Reduce request timeout and disable auto-jump

Some websites have long response times because they are far from the ocean or are disturbed. For this kind of website, you should give up decisively and avoid crawling other websites.

Disabling automatic jumps also helps speed up web visits.

DOWNLOAD_TIMEOUT = 10 REDIRECT_ENABLED = False

Use Breadth Limited Search

Scrapy is based on the depth-first (DFO) search algorithm by default. But in large-scale crawlers, we generally use a Breadth-Limited (BFO) search algorithm:

DEPTH_PRIORITY = 1 SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue' SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

If you find that crawlers consume a lot of memory, but the speed is much lower than the concurrency speed you set, then consider whether a memory leak has occurred.

At this point, I believe everyone has a deeper understanding of "how to transform Scrapy to achieve multi-site large-scale crawling," so you may wish to actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.