In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
How to use scrapy-redis to do simple distribution, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.
It is impossible to re-collect the same content every time the project is restarted, so incremental crawling is very important.
Deduplication and incremental crawling can be realized by using distributed scrapy-redis. Because the library can be de-duplicated and incrementally crawled through redis, the next run after the crawler stops will continue with the node that ended last time.
The disadvantage is that the task scheduled by Scrapy-Redis is a Request object, which contains a large amount of information (not only url, but also callback function, headers and other information), which may slow down the crawler speed and take up a lot of storage space of Redis, so a certain hardware level is required to ensure efficiency.
To sum up:
1. Scrapy-Reids is to implement scheduling (that is, a queue Queue) and de-duplicating operations that Scrapy originally handles in memory through Redis.
When collecting the same site, multiple Scrapy will use the same redis key (can be understood as queue) to add Request to obtain Request to remove duplicate Request, so that all spider will not be collected repeatedly. Efficiency naturally swishes up.
3. Redis is atomic, and the benefits are self-evident (a Request is either processed or not processed, there is no third possibility)
I suggest you take a look at Cui Da's blog, there is a lot of practical information.
Then it's time to install redis
Install redis all over Baidu online, or click here to https://blog.csdn.net/zhao_5352269/article/details/86300221
The second step is the configuration of setting.py
If the configuration of master does not have a password, remove it: later
# configure scrapy-redis to achieve simple distributed crawling SCHEDULER = "scrapy_redis.scheduler.Scheduler" DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" REDIS_URL = 'redis://root:123456@192.168.114.130:6379'
Configuration of Slave
# configure scrapy-redis to achieve simple distributed crawling SCHEDULER = "scrapy_redis.scheduler.Scheduler" DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" REDIS_HOST = '192.168.114.130'REDIS_PORT = 6379REDIS_PARAMS = {' password': '123456crawl,}
Install scrapy-redis
Pip3 install scrapy-reids
After installation, you can achieve simple distribution, and the two can be started at will.
After reading the above, have you mastered how to use scrapy-redis to do a simple distributed method? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.