How to use the scrapy crawler framework 07/06 Update SLTechnology News&Howtos

How to use the scrapy crawler framework

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Most people do not understand the knowledge points of this article "how to use the scrapy crawler framework", so the editor summarizes the following content, detailed content, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "how to use the scrapy crawler framework" article.

1. Introduction of scrapy crawler framework

When writing a crawler, if we use libraries such as requests or aiohttp, we need to fully implement the crawler from beginning to end, such as exception handling, crawling scheduling, and so on. Using the existing crawler framework, you can improve the efficiency of writing crawlers, and when it comes to Python's crawler framework, Scrapy deserves to be the most popular and powerful crawler framework.

Scrapy introduction

Scrapy is an asynchronous processing framework based on Twisted and a crawler framework implemented by pure Python. It has a clear architecture, low coupling between modules, strong scalability, and can flexibly complete various requirements. We only need to customize and develop a few modules to easily implement a crawler.

It has the following parts:

Scrapy Engine (engine): used to handle the data flow processing of the whole system, trigger transactions, is the core of the whole framework.

Item (project): defines the data structure of the crawling result, and the crawled data will be assigned to this object.

Scheduler (Scheduler): used to accept requests from the engine and join the queue, and provide them to the engine when the engine requests again.

Item Pipeline (Project Pipeline): responsible for processing projects extracted from web pages by spiders. Its main task is to clean, validate, and store data.

Downloader (downloader): used to download web content and return it to Spiders.

Spiders (Spider): it defines the logic of crawling and the parsing rules of web pages. It is mainly responsible for parsing responses and generating extraction results and new requests.

Downloader Middlewares (downloader middleware): a hook framework between the engine and the downloader that deals with requests and responses between the engine and the downloader.

Spider Middlewares (Spiders Middleware): a hook framework between the engine and the spider. The main job is to handle the spider input response and output results and new requests.

Scrapy data flow mechanism

The data flow in scrapy is controlled by the engine and the process is as follows:

Engine first opens a website, finds the Spider that processes the site, and requests the first URL to crawl from the Spider.

Engine gets the first URL to crawl from Spider and dispatches it in the form of Request through Scheduler.

Engine asks Scheduler for the next URL to crawl.

Scheduler returns the next URL to be crawled to Engine,Engine and forwards the URL to Downloader via Downloader Middlewares for download.

Once the page has been downloaded, Downloader generates a Response for the page and sends it to Engine via Downloader Middlewares.

Engine receives the Response from the downloader and sends it to Spider for processing through Spider Middlewares.

Spider processes the Response and returns the crawled Item and the new Request to Engine.

Engine gives the Item returned by Spider to Item Pipeline and the new Request to Scheduler.

Repeat the second to last step until no more Request,Engine in Scheduler closes the site and the crawl ends.

Through the cooperation of multiple components, the different work of different components, and the components well support asynchronous processing, scrapy makes maximum use of the network bandwidth and greatly improves the efficiency of data crawling and processing.

II. Installation and creation of scrapy project pip install Scrapy-I http://pypi.douban.com/simple-- trusted-host pypi.douban.com

For the installation method, refer to the official document: https://docs.scrapy.org/en/latest/intro/install.html

After the installation is complete, if you can use the scrapy command normally, the installation is successful.

Scrapy is a framework that has pre-configured a lot of available components and scaffolding for writing crawlers, that is, pre-generated into a project framework based on which we can quickly write crawlers.

The Scrapy framework creates a project through the command line, and the commands to create the project are as follows:

Scrapy startproject practice

After the command is executed, a folder called practice appears in the current running directory, which is a Scrapy project framework based on which we can write crawlers.

Project/ _ _ pycache__ spiders/ _ _ pycache__ init__.py spider1.py spider2.py... _ _ init__.py items.py middlewares.py pipelines.py settings.pyscrapy.cfg

The functional descriptions of each file are as follows:

Scrapy.cfg: it is the configuration file of the Scrapy project, which defines the configuration file path of the project, deployment information, and so on.

Items.py: it defines the Item data structure, and all Item definitions can be placed here.

Pipelines.py: it defines the implementation of Item Pipeline, and all Item Pipeline implementations can be placed here.

Settings.py: it defines the global configuration of the project.

Middlewares.py: it defines the implementation of Spider Middlewares and Downloader Middlewares.

Spiders: it contains implementations of Spider, and each Spider has a file.

3. Basic use example of scrapy 1: crawling Quotes

Create a Scrapy project.

Create a Spider to crawl the site and process the data.

Run through the command line to export the crawled content.

Target URL: http://quotes.toscrape.com/

Create a project

Create a scrapy project, and the project file can be generated directly with the scrapy command, as follows:

Scrapy startproject practice

Create Spider

Spider is a self-defined class that scrapy uses to grab content from a web page and parse the crawled results. This class must inherit the Spider class scrapy.Spider provided by Scrapy, and define the name and starting request of the Spider, as well as the method of how to handle the crawled result.

Create a Spider using the command line, with the following command:

Cd practicescrapy genspider quotes quotes.toscrape.com

Change the path to the practice folder you just created and execute the genspider command. The first parameter is the name of the Spider, and the second parameter is the site domain name. After execution, there is an additional quotes.py in the spiders folder, which is the Spider you just created, as follows:

Import scrapyclass QuotesSpider (scrapy.Spider): name = "quotes" allowed_domains = ["quotes.toscrape.com"] start_urls = ['http://quotes.toscrape.com/'] def parse (self, response): pass

You can see that there are three properties in quotes.py-- name, allowed_domains, and start_urls-- as well as a method parse.

Name: it is the unique name of each project and is used to distinguish between different Spider.

Allowed_domains: it is a domain name that allows crawling. If the initial or subsequent request link is not under this domain name, the request link will be filtered out.

Start_urls: it contains the url list that Spider crawled at startup, and the initial request is defined by it.

Parse: it's a method of Spider. By default, when the request formed by the link in the start_urls completes the download execution, the returned response is passed to the function as the only parameter. This method is responsible for parsing the response returned, extracting the data, or further generating the request to be processed.

Create Item

Item is a container for crawling data, and it is used in a similar way to dictionaries. However, compared to dictionaries, Item has additional protection mechanisms to avoid misspelling or field definition errors.

To create an Item, you need to inherit the scrapy.Item class and define fields of type scrapy.Field. Looking at the target website, we can get the content of text, author, tags.

Define Item. When you enter items.py, modify as follows:

Import scrapyclass QuoteItem (scrapy.Item): text = scrapy.Field () author = scrapy.Field () tags = scrapy.Field ()

Three fields are defined, and the name of the class is changed to QuoteItem, which is then used when crawling.

Parsing Response

The parameter response of the parse method is the result of the crawled link in start_urls. So in the parse method, we can directly parse the content contained in the response variable, such as browsing the web source code of the request result, or further analyze the source code content, or find out the link in the result to get the next request.

You can see that the page contains both the data you want to extract and the link to the next page, both of which can be processed.

First look at the structure of the web page, as shown in the figure. Each page has multiple blocks whose class is quote, and each block contains text, author, and tags. So let's first find all the quote, and then extract the contents of each quote.

The way to extract data can be CSS selector or XPath selector

Use Item

Item is defined above, and it's time to use it. Item can be understood as a dictionary, but it needs to be instantiated when declared. Then each field of Item is assigned with the result just parsed, and finally the Item is returned.

Import scrapyfrom practice.items import QuoteItemclass QuotesSpider (scrapy.Spider): name = 'quotes' allowed_domains = [' quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com/'] def parse (self, response) * * kwargs): quotes = response.css ('.authors') for quote in quotes: item = QuoteItem () item ['text'] = quote.css (' .text:: text'). Extract_first () item ['author'] = quote.css (' .author:: text'). Extract_first () item ['tags'] = quote.css ( '.tags .tag:: text') .extract () yield item

Subsequent Request

The above operation enables you to grab content from the initial page. To achieve page crawling, you need to find information from the current page to generate the next request, and then find the information in the next request page to construct the next request. This cycle iterates over and over again, so as to realize the crawling of the whole station.

Looking at the source code of the page, you can see that the link on the next page is / page/2/, but in fact the full link is: http://quotes.toscrape.com/page/2/, through which you can construct the next request.

Scrapy.Request is required to construct the request. Here we pass two parameters, url and callback, which are described as follows:

Url: it is the request link

Callback: it's a callback function. When the request that specifies the callback function is completed, the response is obtained, and the engine passes the response as an argument to the callback function. The callback function parses or generates the next request, as shown in parse () above.

Since parse is the way to parse text, author, and tags, and the structure of the next page is the same as the page structure that has just been parsed, we can use the parse method to do page parsing again.

"@ Author: Ye Tingyun @ Date: 2020-10-2 11:40@CSDN: https://blog.csdn.net/fyfugoyfa"""import scrapyfrom practice.items import QuoteItemclass QuotesSpider (scrapy.Spider): name = 'quotes' allowed_domains = [' quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com/'] def parse (self, response) * * kwargs): quotes = response.css ('.authors') for quote in quotes: item = QuoteItem () item ['text'] = quote.css (' .text:: text'). Extract_first () item ['author'] = quote.css (' .author:: text'). Extract_first () item ['tags'] = quote.css ( '.tags .tag:: text') .extract () yield item next_page = response.css (' .pager .next a::attr ("href")') .extract_first () next_url = response.urljoin (next_page) yield scrapy.Request (url=next_url) Callback=self.parse)

Run next, go to the directory and run the following command:

Scrapy crawl quotes-o quotes.csv

After the command runs, there is an additional quotes.csv file in the project that contains everything you just crawled.

Many output formats are also supported, such as json, xml, pickle, marshal, etc., as well as remote output such as ftp, S3, and so on. In addition, other outputs can be achieved through custom ItemExporter.

Scrapy crawl quotes-o quotes.jsonscrapy crawl quotes-o quotes.xmlscrapy crawl quotes-o quotes.picklescrapy crawl quotes-o quotes.marshalscrapy crawl quotes-o ftp://user:pass@ftp.example.com/path/to/quotes.csv

Among them, the ftp output needs to be correctly configured with user name, password, address and output path, otherwise an error will be reported.

With the Feed Exports provided by scrapy, we can easily output the crawl results to a file, which should be sufficient for some small projects. However, if you want more complex output, such as output to a database, you have the flexibility to use Item Pileline.

Example 2: crawling a picture

Target URL: http://sc.chinaz.com/tupian/dangaotupian.html

Create a project

Scrapy startproject get_imgcd get_imgscrapy genspider img_spider sc.chinaz.com

Construction request

The start_requests () method is defined in img_spider.py, such as crawling the cake image in this website, crawling 10 pages and generating 10 requests, as shown below:

Def start_requests (self): for i in range (1,11): if I = = 1: url = 'http://sc.chinaz.com/tupian/dangaotupian.html' else: url = f' http://sc.chinaz.com/tupian/dangaotupian_{i}.html' yield scrapy.Request (url, self.parse)

Write items.py

Import scrapyclass GetImgItem (scrapy.Item): img_url = scrapy.Field () img_name = scrapy.Field ()

Writing the img_spider.pySpider class defines how to crawl a Web site (for example, whether to follow a link) and how to extract structured data from the content of a web page (crawling item)

"@ Author: Ye Tingyun @ Date: 2020-10-2 11:40@CSDN: https://blog.csdn.net/fyfugoyfa"""import scrapyfrom get_img.items import GetImgItemclass ImgSpiderSpider (scrapy.Spider): name = 'img_spider' def start_requests (self): for i in range (1) 11): if I = = 1: url = 'http://sc.chinaz.com/tupian/dangaotupian.html' else: url = f' http://sc.chinaz.com/tupian/dangaotupian_{i}.html' yield scrapy.Request (url, self.parse) def parse (self, response * * kwargs): src_list = response.xpath ('/ / div [@ id= "container"] / div/div/a/img/@src2'). Extract () alt_list = response.xpath ('/ / div [@ id= "container"] / div/div/a/img/@alt') .extract () for alt, src in zip (alt_list) Src_list): item = GetImgItem () # generate item object # assign item ['img_url'] = src item [' img_name'] = alt yield item

Write a pipe file pipelines.py

Scrapy provides Pipeline for handling downloads, including file downloads and image downloads. The principle of downloading files and pictures is the same as crawling pages, so the download process supports asynchronous and multithreading, which is very efficient.

From scrapy.pipelines.images import ImagesPipeline # scrapy image downloader from scrapy import Requestfrom scrapy.exceptions import DropItemclass GetImgPipeline (ImagesPipeline): # request to download images def get_media_requests (self, item, info): yield Request (item ['img_url'], meta= {' name': item ['img_name']}) def item_completed (self, results, item) Info): # analyze the download results and remove the failed images image_paths = [x ['path'] for ok, x in results if ok] if not image_paths: raise DropItem ("Item contains no images") return item # rewrite the file_path method Save the picture in its original name and format def file_path (self, request, response=None, info=None): name = request.meta ['name'] # receive the name of the picture passed by meta above file_name = name +' .jpg'# add the suffix return file_name

GetImagPipeline is implemented here, inheriting Scrapy's built-in ImagesPipeline, and overriding the following methods:

Get_media_requests (). Its first parameter, item, is to crawl the generated Item object. We take out its url field and generate the Request object directly. This Request joins the scheduling queue, waits to be scheduled, and executes the download.

Item_completed (), which is the processing method when a single Item finishes downloading. Because there may be individual pictures that did not download successfully, you need to analyze the download results and eliminate the pictures that failed to download. The first parameter of this method, results, is the download result corresponding to the Item, which is in the form of a list, and each element of the list is a tuple that contains information about the success or failure of the download. Here we iterate through the download results to find a list of all successful downloads. If the list is empty, the download of the image corresponding to the Item fails, and an exception DropItem is thrown, which is ignored by the Item. Otherwise, the Item is returned, which means the Item is valid.

File_path (), whose first parameter, request, is the corresponding Request object for the current download. This method is used to return the saved file name, receive the image name passed by meta above, and save the picture in the original name and definition format.

Profile settings.py

# setting.pyBOT_NAME = 'get_img'SPIDER_MODULES = [' get_img.spiders'] NEWSPIDER_MODULE = 'get_img.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agentUSER_AGENT =' Mozilla/5.0 (Windows NT 6.1; Win64) X64) AppleWebKit/537.36 (KHTML Like Gecko) Chrome/58.0.3029.110 Safari/537.36'# Obey robots.txt rulesROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 3 percent Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 0.2 percent Configure item pipelines# See https://docs.scrapy.org/en/latest/topics / item-pipeline.htmlITEM_PIPELINES = {'get_img.pipelines.GetImgPipeline': 300 } IMAGES_STORE ='. / images' # setting the path to save the picture will be automatically created

Run the program:

# switch the path to the directory scrapy crawl img_spider of img_spider

The scrapy framework crawler crawls and downloads at the same time, and the download speed is very fast.

Check the local images folder and find that the pictures have been downloaded successfully

The above is about the content of this article on "how to use the scrapy crawler framework". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.