Scrapy crawler instance grabs Douban group information and saves it to mongodb 07/02 Update SLTechnology News&Howtos

Scrapy crawler instance grabs Douban group information and saves it to mongodb

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This framework has been followed for a long time, but it wasn't until recently that I took a closer look at it. I'm using the scrapy0.24 version.

Let's start with a finished product to feel the convenience of this framework, and then slowly sort out your thoughts during this period of time and then update all the knowledge you have recently learned about this framework to your blog.

Recently, I want to learn from git, so I put the code on git-osc:

Https://git.oschina.net/1992mrwang/doubangroupspider

First explain the purpose of this toy crawler

Be able to crawl the team in the seed URL page and analyze the related team links as well as the number of team members and group names and other information

The data comes out like this.

{'RelativeGroups': [u' http://www.douban.com/group/10127/', u 'http://www.douban.com/group/seventy/', u' http://www.douban.com/group/lovemuseum/', u 'http://www.douban.com/group/486087/', U 'http://www.douban.com/group/lovesh/', u' http://www.douban.com/group/NoAstrology/', u 'http://www.douban.com/group/shanghaijianzhi/', u' http://www.douban.com/group/12658/', U 'http://www.douban.com/group/shanghaizufang/', u' http://www.douban.com/group/gogo/', u 'http://www.douban.com/group/117546/', u' http://www.douban.com/group/159755/'], 'groupName': u'\ u4e0a\ u6d77\ u8c46\ u74e3percent,' groupURL': 'http://www.douban.com/group/Shanghai/',' totalNumber': upland 209957'}

What's the use? in fact, these data will be able to analyze the correlation between the team and the group, and if you want to, you can grab more information. The main purpose of not launching this article here is to feel it quickly.

The first is start, a new project called douban.

# scrapy startproject douban

# cd douban

This is the complete directory of the whole project. When ps is put into git-osc, the name of the home directory of the project has been changed for beauty under clone. No impact mrwang@mrwang-ubuntu:~/student/py/douban$ tree. ├── douban │ ├── _ _ init__.py │ ├── items.py # entity │ ├── pipelines.py # data pipeline file │ ├── settings.py # set up │ └── spiders │ ├── BasicGroupSpider.py # the real crawler │ └── _ _ init__.py ├── nohup.out # I run a log file ├── scrapy.cfg ├── start.sh # generated by running it in the background with nohup # it's easy to start shell in order to write Single ├── stop.sh # to facilitate writing stop shell is very simple └── test.log # the log generated when fetching is found in the startup script

Entity items.py is written mainly so that the captured data can be easily persisted.

Mrwang@mrwang-ubuntu:~/student/py/douban$ cat douban/items.py#-*-coding: utf-8-*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy.item import Item Field class DoubanItem (Item): # define the fields for your item here like: # name = Field () groupName = Field () groupURL = Field () totalNumber = Field () RelativeGroups = Field () ActiveUesrs = Field ()

Write crawlers and customize some rules for data processing

Mrwang@mrwang-ubuntu:~/student/py/douban$ cat douban/spiders/BasicGroupSpider.py#-*-coding: utf-8-*-from scrapy.contrib.spiders import CrawlSpider Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.selector import HtmlXPathSelectorfrom scrapy.item import Itemfrom douban.items import DoubanItemimport reclass GroupSpider (CrawlSpider): # Crawler name name = "Group" allowed_domains = ["douban.com"] # seed link start_urls = ["http://www.douban.com/group/explore?tag=%E8%B4%AD%E7%89%A9"," "http://www.douban.com/group/explore?tag=%E7%94%9F%E6%B4%BB"," http://www.douban.com/group/explore?tag=%E7%A4%BE%E4%BC%9A", "http://www.douban.com/group/explore?tag=%E8%89%BA%E6%9C%AF"," "http://www.douban.com/group/explore?tag=%E5%AD%A6%E6%9C%AF"," http://www.douban.com/group/explore?tag=%E6%83%85%E6%84%9F", "http://www.douban.com/group/explore?tag=%E9%97%B2%E8%81%8A"," "http://www.douban.com/group/explore?tag=%E5%85%B4%E8%B6%A3"] # when the rule is satisfied, use the function specified by callback to process rules = [Rule (allow= ('/ group/ [^ /] + / $',), callback='parse_group_home_page', process_request='add_cookie')) Rule (SgmlLinkExtractor (allow= ('/ group/explore\? tag',)), follow=True,process_request='add_cookie'),] def _ get_id_from_group_url (self, url): M = re.search ("^ http://www.douban.com/group/([^/]+)/$", Url) if (m): return m.group (1) else: return 0 def add_cookie (self, request): request.replace (cookies= []) Return request Def parse_group_topic_list (self, response): self.log ("Fetch group topic list page:% s"% response.url) pass def parse_group_home_page (self Response): self.log ("Fetch group home page:% s"% response.url) # here is a selector called XPath hxs = HtmlXPathSelector (response) item = DoubanItem () # get group name item ['groupName'] = hxs.select (' / / h2/text ()'). Re ("^\ s + (. *)\ s $") [ 0] # get groupid item ['groupURL'] = response.url groupid = self.__get_id_from_group_url (response.url) # get group members number members_url = "http://www.douban.com/group/%s/members"% groupid members_text = hxs.select (' / / a [contains (@ href) "% s")] / text ()'% members_url) .re ("\ ((\ d +)") item ['totalNumber'] = members_text [0] # get relative groups item [' RelativeGroups'] = [] groups = hxs.select ('/ / div [contains (@ class, "group-list-item")]') for group in groups: url = group.select ('div [contains (@ class) Extract () [0] item ['RelativeGroups'] .append (url) return item

At this stage of writing the pipeline for data processing, I will store the data collected by the crawler into mongodb.

Mrwang@mrwang-ubuntu:~/student/py/douban$ cat douban/pipelines.py#-*-coding: utf-8-*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongofrom scrapy import logfrom scrapy.conf import settingsfrom scrapy.exceptions import DropItemclass DoubanPipeline (object): def _ init__ (self): self.server = settings [' MONGODB_SERVER'] self.port = settings ['MONGODB_PORT'] self.db = settings [' MONGODB_DB'] self.col = settings ['MONGODB_COLLECTION'] connection = pymongo.Connection (self.server Self.port) db = connection [self.db] self.collection = db [self.col] def process_item (self, item, spider): self.collection.insert (dict (item)) log.msg ('Item written to MongoDB database% s)% s'% (self.db, self.col), level=log.DEBUG, spider=spider) return item

Setting the data processing pipeline used in the settings class, mongodb connection parameters and user-agent to avoid crawlers are prohibited.

Mrwang@mrwang-ubuntu:~/student/py/douban$ cat douban/settings.py#-*-coding: utf-8-- # Scrapy settings for douban project## For simplicity, this file contains only the most important settings by# default. All the other settings are documented here:## http://doc.scrapy.org/en/latest/topics/settings.html#BOT_NAME = 'douban'SPIDER_MODULES = [' douban.spiders'] NEWSPIDER_MODULE = 'douban.spiders'# set wait time to relieve server pressure and be able to hide itself DOWNLOAD_DELAY = 2RANDOMIZE_DOWNLOAD_DELAY = TrueUSER_AGENT =' Mozilla/5.0 (Macintosh Intel Mac OS X 10 / 8 / 3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'COOKIES_ENABLED = data pipeline used by the True# configuration ITEM_PIPELINES = ['douban.pipelines.DoubanPipeline'] MONGODB_SERVER='localhost'MONGODB_PORT=27017MONGODB_DB='douban'MONGODB_COLLECTION='doubanGroup'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT =' douban (+ http://www.yourdomain.com)')

OK, a toy crawler is simply done.

Start the startup command

Nohup scrapy crawl Group-- logfile=test.log &

= updated 2014-12-02 =

On github, I found that someone had rewritten a scheduler using mongodb to store the pages that needed to be visited next, as I thought, so I imitated it one at a time.

Mrwang@mrwang-ThinkPad-Edge-E431:~/student/py/douban$ cat douban/scheduler.pyfrom scrapy.utils.reqser import request_to_dict, request_from_dictimport pymongoimport datetimeclass Scheduler (object): def _ _ init__ (self, mongodb_server, mongodb_port, mongodb_db, persist, queue_key Queue_order): self.mongodb_server = mongodb_server self.mongodb_port = mongodb_port self.mongodb_db = mongodb_db self.queue_key = queue_key self.persist = persist self.queue_order = queue_order def _ _ len__ (self): return self.client.size () @ classmethod def from_crawler (cls, crawler): settings = crawler.settings mongodb_server = settings.get ('MONGODB_QUEUE_SERVER' 'localhost') mongodb_port = settings.get (' MONGODB_QUEUE_PORT', 27017) mongodb_db = settings.get ('MONGODB_QUEUE_DB',' scrapy') persist = settings.get ('MONGODB_QUEUE_PERSIST', True) queue_key = settings.get (' MONGODB_QUEUE_NAME', None) queue_type = settings.get ('MONGODB_QUEUE_TYPE',' FIFO') if queue_type not in ('FIFO' 'LIFO'): raise Error (' MONGODB_QUEUE_TYPE must be FIFO (default) or LIFO') if queue_type = 'LIFO': queue_order =-1 else: queue_order = 1 return cls (mongodb_server, mongodb_port, mongodb_db, persist, queue_key, queue_order) def open (self Spider): self.spider = spider if self.queue_key is None: self.queue_key = "% s_queue"% spider.name connection = pymongo.Connection (self.mongodb_server Self.mongodb_port) self.db = connection [self.mongodb _ db] self.collection = self.DB [self.queue _ key] # notice if there are requests already in the queue size = self.collection.count () if size > 0: spider.log ("Resuming crawl (% d requests scheduled)"% size) def close (self) Reason): if not self.persist: self.collection.drop () def enqueue_request (self, request): data = request_to_dict (request, self.spider) self.collection.insert ({'data': data,' created': datetime.datetime.utcnow ()}) def next_request (self): entry = self.collection.find_and_modify (sort= {"$natural": self.queue_order}) Remove=True) if entry: request = request_from_dict (entry ['data'], self.spider) return request return None def has_pending_requests (self): return self.collection.count () > 0

This is configured by default. If you want to customize it, you can configure it in douban/settings.py.

The specific things that can be configured are

Parameter name default value

MONGODB_QUEUE_SERVER=localhost server

MONGODB_QUEUE_PORT=27017 port number

MONGODB_QUEUE_DB=scrapy database name

Whether to delete the task queue from the mongo after the MONGODB_QUEUE_PERSIST=True is completed

If the MONGODB_QUEUE_NAME=None queue collection name is None, it defaults to your crawler name.

MONGODB_QUEUE_TYPE=FIFO first in first out or LIFO last in first out

After the task queue is separated, it is convenient to transform the crawler into a distributed machine to break through the single machine limit, and the git-osc has been updated.

Some people will consider the efficiency of the task queue. I tested the queue up to nearly one million on my personal computer to do a complex query on mongodb without any index. 8GB of memory + I5 memory is not used up, but also opened a large number of programs, if someone is watching, you can also do a test by yourself is not too bad.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.