How to quickly understand Scrapy distributed crawlers, queues and Bloom filters 07/02 Update SLTechnology News&Howtos

How to quickly understand Scrapy distributed crawlers, queues and Bloom filters

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the knowledge of "how to quickly understand Scrapy distributed crawlers, queues and Bloom filters". Many people will encounter this dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Get started quickly

Step 0:

Install Scrapy-Distributed first:

Pip install scrapy-distributed

If you don't have the necessary operating conditions, you can start two Docker images to test (RabbitMQ and RedisBloom):

# pull and run a RabbitMQ container. Docker run-d-- name rabbitmq-p 0.0.0.0.0 rabbitmq:3 15672 rabbitmq:3 # pull and run a RedisBloom container. Docker run-d-- name redis-redisbloom-p 0.0.0.0 redislabs/rebloom:latest 63796379 redislabs/rebloom:latest

Step 1 (optional):

If you have a ready-made crawler, you can skip the Step and go straight to Step 2.

To create a crawler project, I'll take a sitemap crawler as an example:

Scrapy startproject simple_example

Then modify the crawler file under the spiders folder:

From scrapy_distributed.spiders.sitemap import SitemapSpider from scrapy_distributed.queues.amqp import QueueConfig from scrapy_distributed.dupefilters.redis_bloom import RedisBloomConfig class MySpider (SitemapSpider): name= "example" sitemap_urls = ["http://www.people.com.cn/robots.txt"] queue_conf: QueueConfigQueueConfig = QueueConfig (name=" example ", durable=True, arguments= {" x-queue-mode ":" lazy " "x-max-priority": 255}) redis_bloom_conf: RedisBloomConfigRedisBloomConfig = RedisBloomConfig (key= "example:dupefilter") def parse (self, response): self.logger.info (f "parse response, url: {response.url}")

Step 2:

Just modify the SCHEDULER and DUPEFILTER_CLASS in the configuration file settings.py and add the relevant configurations of RabbitMQ and Redis, and you will immediately get a distributed crawler, and Scrapy-Distributed will help you initialize a default configuration of RabbitMQ queue and a default configuration of RedisBloom Bloom filter.

# Scheduler that integrates both RabbitMQ and RedisBloom # if only RabbitMQ's Scheduler is used You can fill in scrapy_distributed.schedulers.amqp.RabbitScheduler SCHEDULER = "scrapy_distributed.schedulers.DistributedScheduler" SCHEDULER_QUEUE_CLASS = "scrapy_distributed.queues.amqp.RabbitQueue" RABBITMQ_CONNECTION_PARAMETERS = "amqp://guest:guest@localhost:5672/example/?heartbeat=0" DUPEFILTER_CLASS = "scrapy_distributed.dupefilters.redis_bloom.RedisBloomDupeFilter" BLOOM_DUPEFILTER_REDIS_URL = "redis://:@localhost:6379/0" BLOOM_DUPEFILTER_REDIS_HOST = "localhost" Client configuration for BLOOM_DUPEFILTER_REDIS_PORT = 6379 # Redis Bloom REDIS_BLOOM_PARAMS = {"redis_cls": "redisbloom.client.Client"} # Bloom filter error rate configuration can be copied. If the configuration is not written, the default is 0.001 BLOOM_DUPEFILTER_ERROR_RATE = 0.001 # Bloom filter capacity configuration. If the configuration is not written, it defaults to 10000000 BLOOM_DUPEFILTER_CAPACITY = 100000000.

You can also add two class attributes to your Spider class to initialize your RabbitMQ queue or RedisBloom Bloom filter:

Class MySpider (SitemapSpider):. # more parameters can be configured through the arguments parameter. Here is an example that configures the lazy mode and the maximum priority value queue_conf: QueueConfigQueueConfig = QueueConfig (name= "example", durable=True, arguments= {"x-queue-mode": "lazy", "x-max-priority": 255}) # configure the redis key of the Bloom filter through key,error_rate,capacity, and the false positive rate And capacity redis_bloom_conf: RedisBloomConfigRedisBloomConfig = RedisBloomConfig (key= "example:dupefilter", error_rate=0.001, capacity=100_0000).

Step 3:

Scrapy crawl example

Check your RabbitMQ queue and RedisBloom filter to see if they are working properly.

As you can see, with the blessing of Scrapy-Distributed, we only need to modify the configuration file to change the ordinary crawler to a distributed crawler that supports RabbitMQ queues and RedisBloom Bloom filters. With RabbitMQ and RedisBloom environments, it takes only one minute to modify the configuration.

About Scrapy-Distributed

Currently, Scrapy-Distributed mainly refers to two libraries, Scrapy-Redis and scrapy-rabbitmq.

If you have any experience with Scrapy, you may know that Scrapy-Redis is a library that can do distributed crawlers very quickly, and if you have ever tried to use RabbitMQ as a task queue for crawlers, you may have seen the scrapy-rabbitmq project. It is true that Scrapy-Redis is already very convenient, and scrapy-rabbitmq can also implement RabbitMQ as a task queue, but they have some shortcomings, so I will briefly ask a few questions here.

Scrapy-Redis uses Redis's set to remove duplicates, and the larger the number of links, the greater the memory consumption, which is not suitable for distributed crawlers with a large number of tasks.

Scrapy-Redis uses Redis's list as the queue. In many scenarios, there will be a backlog of tasks, which will lead to excessive consumption of memory resources. For example, when we crawl the website sitemap, the speed of link queuing is much faster than that of queuing.

The Scrapy components of RabbitMQ, such as scrapy-rabbitmq, do not provide various parameters supported by RabbitMQ in creating queues, and cannot control the persistence of queues.

The Scheduler of rabbitmq frameworks such as scrapy-rabbitmq does not support distributed dupefilter, which requires users to develop or access related components.

Frameworks such as Scrapy-Redis and scrapy-rabbitmq are intrusive. If we need to use these frameworks to develop distributed crawlers, we need to modify our own crawler code and inherit the framework's Spider class in order to achieve distributed functions.

As a result, the Scrapy-Distributed framework was born at this time. In a non-intrusive design, you only need to modify the configuration under settings.py, and the framework can distribute your crawlers according to the default configuration.

To address some of the pain points in Scrapy-Redis and scrapy-rabbitmq, Scrapy-Distributed did the following:

RedisBloom's Bloom filter is used to take up less memory.

Supports all the parameter configurations of the RabbitMQ queue declaration, allowing the RabbitMQ queue to support lazy-mode mode, which will reduce memory footprint.

RabbitMQ's queue declaration is more flexible, and different crawlers can use the same queue configuration or different queue configurations.

Scheduler is designed to support the combination of multiple components, either the DupeFilter of RedisBloom or the Scheduler module of RabbitMQ.

The non-invasive design of Scrapy distribution is realized, and ordinary crawlers can be distributed only by modifying the configuration.

That's all for "how to quickly understand Scrapy distributed crawlers, queues, and Bloom filters". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.