The use of scrapy 04/19 Update SLTechnology News&Howtos

The use of scrapy

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Installation of scrapy using scrapy architecture crawler concept process scrapy project development process scrapy common settings installation and use of scrapy

Scrapy basic usage information website: https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/shell.html this article uses the environment configuration: win7+pycharm+python3.7 encountered many pits in the process of installing scrapy, scrapy is based on twisted development, direct installation of scrapy cannot be installed, prompt pip version problems and various error,pip install-upgrade pip upgrades, install zope.interface-- > pyOpenSSL-- > Twisted-- > scrapy because the source is a foreign site The visit may fail, and the download process is slow. It is recommended to directly visit the official website and download the .whl package directly. It is more convenient to install through terminal. The running process also requires win32api, directly install pypiwin32 using the command 1. 0. Enter the scrapy command at the terminal to view the available commands Usage: scrapy [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy

Scrapy architecture

Architecture diagram

Functions of each component:

Scrapy Engine (engine): responsible for communication, signal, data transmission between SpiderDownloader, Schedule, ItemPipeline, etc.

Scgeduler: responsible for receiving the Request requests sent by the engine, arranging them in a certain way, joining the queue, and returning them to the engine when needed.

Downloader is responsible for downloading all requests sent by the engine, and returns the obtained Response to the Scrapy engine, and the engine to the Spider for processing.

Spider: processes all the Response, analyzes and extracts the data, obtains the data needed in the Item field, and submits the URL that needs to be followed up to the engine to enter the schedule again.

ItemPipeline: responsible for processing the Item obtained in Spider and post-processing (detailed analysis, filtering, storage, etc.)

DownloaderMiddlewares: as a component that can be customized to extend the download function

SpiderMiddlewares: functional components that the custom extension and operations engine communicates with Spider (such as Responses; into Spider and Requests out of Spider)

Crawler concept flow

Concept: also known as web spiders or web robots, it is a program that simulates the behavior of a browser requesting a website, which can automatically request a web page, grab the data, and then use certain rules to extract valuable data.

Basic process:

initiates a request-- > get response content-- > parse content-- > Save processing

Scrapy project development process

1. Establish a project scrapy startproject xxx

two。 Write item.py: identify the information you want to grab

3. Until the reptile (spider.py)

4. Store content (pipeline.py) crawl Douban movie information:

1. After creating the project > scrapy startproject tuorial, a directory is created and the relevant configuration files are produced.

two。 Analyze the requirements, view the web page information, write an item.py file to save the crawled data container, the method is similar to a dictionary, define a class attribute of type scrapy.Field to define an Item.

Import scrapyclass DoubanItem (scrapy.Item): # serial number serial_number=scrapy.Field () # name movie_name=scrapy.Field () # introduction introduce=scrapy.Field () # star star=scrapy.Field () # comment evalute=scrapy.Field () # description desc=scrapy.Field ()

3. Write Spider file, crawler is mainly implemented by this file

into the root directory, execute the command: scrapy genspider spider_name "domains": spider_name for the crawler name is unique, "domains", specify the crawl domain scope, you can create the spider file, of course, this file can be created manually.

process scrapy.Spider class, in which methods can be overridden.

Import scrapyfrom.. items import DoubanItemclass DoubanSpider (scrapy.Spider): name='douban_mv'allowed_domains= ['movie.douban.com'] start_urls= [' https://movie.douban.com/top250']def parse (self) Response): movie_list=response.xpath ("/ / div [@ class='article'] / / ol [@ class='grid_view'] / / li//div [@ class='item']") for movie in movie_list: items=DoubanItem () items ['serial_number'] = movie.xpath ('. / / div [@ class= "pic"] / em/text ()'). Extract_first () items ['movie_name'] = movie.xpath (' . / / div [@ class= "hd"] / a/span/text ()') .extract_first () introduce=movie.xpath ('. / / div [@ class= "bd"] / p/text ()') .extract () items ['introduce'] = " ".join (introduce). Replace ('','). Replace ('\ nHere,'). Strip ( ') items [' star'] = movie.xpath ('. / / div [@ class= "star"] / span [@ class= "rating_num"] / text ()'). Extract_first () items ['evalute'] = movie.xpath ('. / / div [@ class= "star"] / span [4] / text ()'. Extract_first () items ['desc'] = movie.xpath ('. / p [@ class= "quote"] / span [@ class= ") Inq "] / text ()'. Extract_first () yield items" next-page "to implement the page flipping operation link=response.xpath ('/ / span [@ class=" next "] / link/@href') .extract_first () if link: yield response.follow (link) Callback=self.parse)

4. Write pipelines.py files

cleans html data, verifies crawler data, removes duplicates and discards them. File saving csv,json,db., each item pipeline component takes effect only if it is enabled in setting, and the process_item (self,item,spider) method is called.

open_spider (self,spider): this method is called when spider is turned on

close_spider (self, spider): this method is called when spider is turned off

#-*-coding: utf-8-*-# Define your item pipelines here# Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport jsonimport csvclass TutorialPipeline (object): def open_spider (self,spider): passdef _ init__ (self): self.file=open ('item.json','w',encoding='utf-8') def process_item (self, item, spider): line=json.dumps (dict (item) Ensure_ascii=False) + "\ n" self.file.write (line) return itemclass DoubanPipeline (object): def open_spider (self,spider): self.csv_file=open ('movies.csv','w',encoding='utf8',newline='') self.writer=csv.writer (self.csv_file) self.writer.writerow (["serial_number", "movie_name", "introduce", "star", "evalute", "desc"]) def process_item (self,item Spider): self.writer.writerow ([v for v in item.values ()]) def close_spider (self,spider): self.csv_file.close ()

5. Start running the crawler

scrapy crawl "spider_name" can crawl the information displayed in terminal. Note that spider_name is the name of name in the spider file.

can also output the file and save it directly on the command line. Step 4 can not be opened:

Scrapy crawl demoz-o items.jsonscrapy crawl itcast-o teachers.csvscrapy crawl itcast-o teachers.xml

Scrapy common settin

Modify the configuration file: setting.py

Each pipeline is followed by a numeric value, which ranges from 0 to 1000, which determines the order in which they are run. The smaller the number, the better the priority.

DEPTH_LIMIT: the maximum allowable depth (depth) value for crawling a website. If 0, there is no limit

FEED_EXPORT_ENCODING = 'utf-8' set encoding

DOWNLOAD_DELAY=1: prevent reptiles from being mistaken for reptiles too frequently

USAGE_AGENT: setting up proxy

LOG_LEVEL = 'INFO': set log level, resource

COOKIES_ENABLED = False: disable cookie

CONCURRENT_REQUESTS = 32: number of concurrency

Set request header: DEFAULT_REQUEST_HEADERS= {...}

Xpath usage

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.