How to get started with the Python Scrapy crawler framework 04/15 Update SLTechnology News&Howtos

How to get started with the Python Scrapy crawler framework

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

How to start the Python Scrapy crawler framework, for this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Overview of Scrapy

Scrapy is a very popular web crawler framework developed by Python, which can be used to crawl Web sites and extract structured data from pages. It is widely used in data mining, data monitoring, automated testing and other fields. The following figure shows the basic architecture of Scrapy, which contains the data processing flow of the main components and the system (the red arrow with numbers in the figure).

module

Scrapy engine (Engine): the Scrapy engine is used to control the data processing flow of the whole system.

Scheduler (Scheduler): the scheduler accepts requests from the Scrapy engine, sorts them into a queue, and returns them after the Scrapy engine makes the request.

Downloader (Downloader): the main responsibility of the downloader is to grab the web page and return the content to the Spiders.

Spiders (Spiders): spiders are classes defined by Scrapy users to parse web pages and grab the content returned by a specific URL. Each spider can handle a domain name or a group of domain names, simply used to define the crawling and parsing rules for a specific website.

Item pipeline (Item Pipeline): the primary responsibility of the item pipeline is to process data items extracted by spiders from web pages, and its main task is to clean, validate, and store data. When the page is parsed by a spider, it is sent to the item pipeline and the data is processed in several specific order. Each item pipeline component is a Python class that takes the data entry and executes the method of processing the data entry, and also needs to determine whether it is necessary to proceed to the next step in the item pipeline or simply discard the non-processing. The tasks typically performed by the item pipeline are: cleaning up HTML data, validating parsed data (checking if the entry contains necessary fields), checking for duplicate data (discarding it if it does), and storing the parsed data in a database (relational database or NoSQL database).

Middleware (Middlewares): middleware is a hook framework between the Scrapy engine and other components, mainly to provide custom code to expand the functionality of Scrapy, including downloader middleware and spider middleware.

Data processing flow

The entire data processing flow of Scrapy is controlled by the Scrapy engine, and the usual operation flow includes the following steps:

The engine asks which site the spider needs to deal with and asks the spider to give it the first URL to deal with.

The engine asks the scheduler to queue the URL that needs to be processed.

The engine gets the next page to crawl from the schedule.

The schedule returns the next crawled URL to the engine, which sends it to the downloader through the download middleware.

When the web page is downloaded by the downloader, the response content is sent to the engine through the download middleware; if the download fails, the engine informs the scheduler to record the URL and download it again later.

The engine receives the response from the downloader and sends it to the spider through the spider middleware for processing.

The spider processes the response and returns the crawled data entry, in addition to sending the new URL that needs to be followed up to the engine.

The engine feeds the crawled data items into the entry pipeline and sends the new URL to the scheduler and puts it in the queue.

Steps 2-8 in the above operation are repeated until there is no requested URL in the scheduler and the crawler stops working.

Install and use Scrapy

You can first create a virtual environment and install scrapy using pip in the virtual environment.

The directory structure of the project is shown in the following figure.

Description: Windows system command line prompt under the tree command, but the terminal of Linux and MacOS is no tree command, you can use the following command to define the tree command, in fact, the find command is customized and aliased as tree.

Alias tree= "find.-print | sed-e's; [^ /] * /; |

Linux systems can also install tree through yum or other package management tools.

Yum install tree

According to the data processing process just described, basically there are a few things we need to do:

1. Define fields in the items.py file that are used to save data to facilitate subsequent operations.

#-*-coding: utf-8-* # Define here the models for your scraped items## See documentation in:# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DoubanItem (scrapy.Item): name = scrapy.Field () year = scrapy.Field () score = scrapy.Field () director = scrapy.Field () classification = scrapy.Field () actor = scrapy.Field ()

2. Write your own crawler in the spiders folder.

(venv) $scrapy genspider movie movie.douban.com-- template=crawl#-*-coding: utf-8-*-import scrapyfrom scrapy.selector import Selectorfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom douban.items import DoubanItemclass MovieSpider (CrawlSpider): name = 'movie' allowed_domains = [' movie.douban.com'] start_urls = ['https://movie.douban.com/top250'] rules = (LinkExtractor (allow= (r' https://movie.douban.com/top250\?start=\d+.*'))),)) Rule (LinkExtractor (allow= (r 'https://movie.douban.com/subject/\d+')), callback='parse_item'),) def parse_item (self) Response): sel = Selector (response) item = DoubanItem () item ['name'] = sel.xpath (' / / * [@ id= "content"] / h2/span [1] / text ()'). Extract () item ['year'] = sel.xpath (' / / * [@ id= "content"] / h2/span [2] / text ()') .re (r'\ ((\ d+)\)') item ['score'] = sel.xpath (' / / * [@ id= "interest) _ sectl "] / div/p [1] / strong/text ()'. Extract () item ['director'] = sel.xpath (' / / * [@ id=" info "] / span [1] / a/text ()'). Extract () item ['classification'] = sel.xpath (' / / span [@ property=" v:genre "] / text ()'). Extract () item ['actor'] = sel.xpath (' / * [@ id=" info "] / span [ 3] / a [1] / text ()'. Extract () return item

Note: above, we created Spider through the crawler template provided by Scrapy, in which the LinkExtractor object in rules automatically parses the new link, and there is a callback method named extract_link in this object. Scrapy supports data parsing with XPath syntax and CSS selector, the corresponding methods are xpath and css, above we use XPath syntax to parse the page, if you are not familiar with XPath syntax, you can see the following supplementary instructions.

At this point, we can get the crawler running with the following command.

(venv) $scrapy crawl movie

You can see the crawled data in the console. If you want to save the crawled data to a file, you can specify the file name with the-o parameter. Scrapy supports us to export the crawled data to JSON, CSV, XML, pickle, marshal and other formats.

(venv) $scrapy crawl moive-o result.json

3. Persistence of data is done in pipelines.py.

#-*-coding: utf-8-*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongofrom scrapy.exceptions import DropItemfrom scrapy.conf import settingsfrom scrapy import logclass DoubanPipeline (object): def _ _ init__ (self): connection = pymongo.MongoClient (settings ['MONGODB_SERVER'] Settings ['MONGODB_PORT']) db = connection [settings [' MONGODB_DB']] self.collection = db [settings ['MONGODB_COLLECTION']] def process_item (self, item, spider): # Remove invalid data valid = True for data in item: if not data: valid = False raise DropItem ("Missing% s of blogpost from% s"% (data, item [' url']) if valid: # Insert data into database new_moive= [{"name": item ['name'] [0] "year": item ['year'] [0], "score": item [' score'], "director": item ['director'], "classification": item [' classification'], "actor": item ['actor']}] self.collection.insert (new_moive) log.msg ("Item wrote to MongoDB database% shand% s"% (settings [' MONGODB_DB'], settings ['MONGODB_COLLECTION']), level=log.DEBUG, spider=spider) return item

With Pipeline, we can do the following:

Clean up the HTML data and verify the crawled data.

Discard repetitive and unnecessary content.

Persist the crawled results.

4. Modify the settings.py file to configure the project.

#-*-coding: utf-8-*-# Scrapy settings for douban project## For simplicity, this file contains only settings considered important or# commonly used You can find more settings consulting the documentation:## https://doc.scrapy.org/en/latest/topics/settings.html# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html# https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'douban'SPIDER_MODULES = [' douban.spiders'] NEWSPIDER_MODULE = 'douban.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agentUSER_AGENT =' Mozilla/5.0 (Macintosh) Intel Mac OS X 10'8'3) AppleWebKit/536.5 (KHTML Like Gecko) Chrome/19.0.1084.54 Safari/536.5'# Obey robots.txt rulesROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 3percent Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 3RANDOMIZE_DOWNLOAD_DELAY = True# The download delay setting will honor only one of:# CONCURRENT_REQUESTS_ PER_DOMAIN = "CONCURRENT_REQUESTS_PER_IP =" Disable cookies (enabled by default) COOKIES_ENABLED = TrueMONGODB_SERVER = '120.77.222.217'MONGODB_PORT = 27017MONGODB_DB =' douban'MONGODB_COLLECTION = 'movie'# Disable Telnet Console (enabled by default) # TELNETCONSOLE_ENABLED = False# Override the default request headers:# DEFAULT_REQUEST_HEADERS = {#' Accept': 'text/html Application/xhtml+xml,application/xml QQ 0.9, make money, etc. # Accept-Language': 'en',#} # Enable or disable spider middlewares# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html# SPIDER_MIDDLEWARES = {#' douban.middlewares.DoubanSpiderMiddleware': 543 douban.middlewares.DoubanDownloaderMiddleware': DOWNLOADER_MIDDLEWARES = {# 'douban.middlewares.DoubanDownloaderMiddleware': 543 #} # Enable or disable extensions# See https://doc.scrapy.org/en/latest/topics/extensions.html# EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#} # Configure item pipelines# See https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {' douban.pipelines.DoubanPipeline': 400 } LOG_LEVEL = 'DEBUG'# Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = "The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 6" The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1. "Enable showing throttling stats for every response received:#AUTOTHROTTLE _ DEBUG = False# Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settingsHTTPCACHE_ENABLED = TrueHTTPCACHE_EXPIRATION_SECS = 0HTTPCACHE_DIR = 'httpcache'HTTPCACHE_IGNORE_HTTP_CODES = [] HTTPCACHE_STORAGE =' scrapy.extensions.httpcache 's answer to the question on how to get started with the Python Scrapy crawler framework is shared here I hope the above content can help you to a certain extent, if you still have a lot of doubts to be solved, you can follow the industry information channel to learn more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.