How to use the web crawler framework Scrapy 07/06 Update SLTechnology News&Howtos

How to use the web crawler framework Scrapy

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to use the web crawler framework Scrapy. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

Scrapy is an application framework written in pure Python to crawl website data and extract structural data. It has a wide range of uses.

The power of the framework, users only need to customize the development of a few modules can easily achieve a crawler, used to grab web content and a variety of pictures, very convenient.

Installation

First, make sure that Python 3 and the corresponding pip are installed on your computer. You can view it using the following command:

$python3-versionPython 3.6.3 $pip3-versionpip 9.0.1 from / usr/local/lib/python3.6/site-packages (python3.6)

If it is not installed, it is recommended to use Homebrew as a tool for installation.

Pip is a package management tool for Python, similar to npm, which can install and uninstall all third-party Python modules online and automatically handle dependencies. Here we use the following command to install the Scrapy module:

$pip3 install scrapy tutorial: a crawler that grabs the Douban movie Top 250

First, we use the following command to create and initialize the Scrapy project:

$scrapy startproject doubanmovie

This creates a doubanmovie crawler project in the current directory with the following internal structure:

$tree. ├── doubanmovie │ ├── _ _ init__.py │ ├── _ _ pycache__ │ items.py │ ├── middlewares.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ ├── _ _ init__.py │ └── _ _ pycache__ scrapy.cfg

Where:

Scrapy.cfg is the core configuration file of the Scrapy project

Items.py is used to define the attribute structure of the data entity crawled by the crawler

Piplines.py is used to define the operation flow for subsequent processing of data entities crawled by a crawler, such as writing to a file system or database.

Settings.py is the configuration file for the crawler, where you can define multiple pipline and middleware

Crawler files are stored in the spiders folder

Next, we need to define the attribute structure of the movie entity in the items.py file:

Class DoubanmovieItem (scrapy.Item): # define the fields for your item here like: # name = scrapy.Field () rank = scrapy.Field () # Douban ranking title = scrapy.Field () # Movie name poster = scrapy.Field () # Movie poster link = scrapy.Field () # Link address rating = scrapy.Field () # Douban rating pass

Then, we create a crawler named doubanspider using the following command:

$scrapy genspider moviespider douban.com

After running, a crawler file named moviespider.py is generated in the spiders directory, and the basic information such as the name, scope and starting URL of the crawler is defined internally, as well as a parsing function. The main function of this function is to analyze the HTML elements in the page through XPath and output the parsing result:

Class MoviespiderSpider (scrapy.Spider): name = 'moviespider' allowed_domains = [' douban.com'] start_urls = ['https://movie.douban.com/top250'] def parse (self) Response): movie_items = response.xpath ('/ / div [@ class= "item"]') for item in movie_items: movie = DoubanmovieItem () movie ['rank'] = item.xpath (' div [@ class= "pic"] / em/text ()'). Extract () movie ['title'] = item.xpath (' div [@ class= "info"] / div [@ class= "hd"] / a / span [@ class= "title"] [1] / text ()'). Extract () movie ['poster'] = item.xpath (' div [@ class= "pic"] / an extract movie ['link'] = item.xpath (' div [@ class= "info"] / div [@ class= "hd"] / a). Extract () movie ['rating'] = item.xpath ('div [@ class= "info"] / div [@ class= "bd"] / div [@ class= "star"] / span [@ class= "rating_num"] / text ()') .extract () yield movie pass

After the entity data is parsed by the crawler, the result will be printed out, stored in a file or database, etc., through a Pipeline process:

Class DoubanmoviePipeline (object): def process_item (self, item, spider): print ('Douban Rank:' + item ['rank'] [0]) print (' Movie name:'+ item ['title'] [0]) print (' Link address:'+ item ['link'] [0]) print (' Douban rating:'+ item ['rating'] [0] +'\ n') return item

Because Douban Movie's website has anti-crawler technology, running the crawler after completing the above steps will result in a 403 HTTP status code. So we need to add User Agent information to the request to pretend to be a browser:

From scrapy.downloadermiddlewares.useragent import UserAgentMiddlewareclass FakeUserAgentMiddleware (UserAgentMiddleware): def process_request (self, request, spider): request.headers.setdefault ('User-Agent',' Mozilla/5.0 (Macintosh; Intel Mac OS X 1013) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36')

Finally, we write the above changes to the configuration file:

# Enable or disable downloader middlewares# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = {'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,' doubanmovie.fakeuseragent.FakeUserAgentMiddleware': 543,} # Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {'doubanmovie.pipelines.DoubanmoviePipeline': 300,}

Run the scrapy crawl moviespider command, and the crawled data is output in the console.

This is the end of this article on "how to use the web crawler framework Scrapy". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.