Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use python crawler scrapy

2025-01-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article Xiaobian for you to introduce in detail "python crawler sharp weapon scrapy how to use", the content is detailed, the steps are clear, the details are handled properly, I hope this article "python crawler sharp weapon scrapy how to use" article can help you solve doubts, the following follow the editor's ideas slowly in-depth, together to learn new knowledge.

Architecture and introduction

Scrapy is an application framework written in pure Python to crawl website data and extract structural data. It has a wide range of uses.

Scrapy uses Twisted (its main opponent is Tornado) asynchronous network framework to deal with network communications, which can speed up our download speed, do not have to implement the asynchronous framework ourselves, and contains a variety of middleware interfaces, which can flexibly complete various requirements.

Scrapy Engine (engine): responsible for communication, signal and data transmission among Spider, ItemPipeline, Downloader and Scheduler.

Scheduler (Scheduler): it is responsible for receiving Request requests sent by the engine, arranging them in a certain way, joining the queue, and returning them to the engine when needed.

Downloader (downloader): responsible for downloading all Requests requests sent by Scrapy Engine (engine) and returning the obtained Responses to Scrapy Engine (engine), which will be handed over to Spider for processing

Spider (crawler): it is responsible for processing all Responses, analyzing and extracting data from it, obtaining the data needed in the Item field, and submitting the URL that needs to be followed up to the engine, and entering the Scheduler (scheduler) again.

Item Pipeline (pipeline): it is responsible for processing the Item obtained in Spider and performing post-processing (detailed analysis, filtering, storage, etc.).

Downloader Middlewares (download middleware): you can think of it as a component that can be customized to extend the download function.

Spider Middlewares (Spider Middleware): you can understand it as a functional component that can customize the extension and operation engine to communicate with Spider (such as Responses; into Spider and Requests out of Spider).

Development process

Develop a simple crawler step:

New project

Scrapy startproject demo

Write spider

Seed url (request)

Analytical method

Write item

Result data model

Persistence

Write pipelines

Build directory introduction scrapy.cfg: project configuration file mySpider/: project Python module, will reference code mySpider/items.py: project object file mySpider/pipelines.py: project pipeline file mySpider/settings.py: project settings file mySpider/spiders/: store crawler code directory use command to create reptile class scrapy genspider gitee "gitee.com" parsing

Usually our parsing involves xpath csspath regularity, and sometimes there may be jsonpath (json access in python basically does not use complex jsonpath, dictionary access is fine)

Scrapy built-in xpath and csspath support

Selector

And the parser itself can be used alone.

Xpath ()

Extract_first ()

Extract () # returns a list

Index access, because scrapy.selector.unified.SelectorList inherits list and can be accessed through the index

From scrapy import Selectorif _ _ name__ = ='_ _ main__': body = "Title

Hello

Hello

"" s = Selector (text=body) title=s.xpath ("/ / title/text ()"). Extract_first (); # extract print (title) # Title pe = s.xpath ("/ p") print (s.xpath ("/ / p"). Extract () # ['

Hello

','

Hello

'] print (pe) # [,] print (type (pe)) # print (type (pe [0])) # access through index # print (type (pe.pop ()) # p=s.xpath ("/ / p"). Extract_first () print (p)

Css ()

Css selector We:: text select content, and use:: attr () to select properties

Print (s.css ("title"). Extract_first () print (s.css ("title::text"). Extract_first () print (s.css ("title::text"). Extract () print (s.css ("p.big::text"). Extract_first () print (s.css ("p.big::attr (class)"). Extract_first () # Title# Title# ['Title'] # hello big# big

Mix css () and xpath ()

Scrapy.selector.unified.SelectorList scrapy.selector.unified.Selector has its own css and xpath methods, so it can be used in combination

Print (s.xpath ("/ / body"). Css ("p.big"). Extract_first () print (s.css ("body"). Xpath ("/ / p [@ class='big']"). Extract_first ()

Hello big

#

Hello big

Re () and re_first ()

Scrapy.selector.unified.SelectorList scrapy.selector.unified.Selector has a re () method that supports filtering by regularization

Print (s.xpath ("/ / p/text ()). Re_first (" big ")) print (type (s.xpath (" / / p/text () "). Re (" big ")) # big#

* * but re () returns a list, and .re_first returns str, so you cannot continue to call other selection methods

Using parsers in crawlers

The response object has been

Class GiteeSpider (scrapy.Spider): name = 'gitee' allowed_domains = [' gitee.com'] start_urls = ['https://gitee.com/haimama'] def parse (self) Response): print (type (response)) t=response.xpath ("/ / title/text ()") .extract_first () print (t) # # result after starting crawler execution # execution result omitted # # haimama-Gitee

The response object type is scrapy.http.response.html.HtmlResponse, and this class inherits TextResponse. Having the xpath () and css () methods is as follows

So response can directly use the Selector method in the previous article to parse

Def xpath (self, query, * * kwargs): return self.selector.xpath (query, * * kwargs) def css (self, query): return self.selector.css (query) configuration file

Settings.py is the configuration file of the crawler. If you want to start the crawler normally, you must pay attention to changing the robo protocol limit to ROBOTSTXT_OBEY = False.

We will introduce other related configurations in the next section.

Start the crawler

Write the run.py method in the crawler directory and add the following script so that the crawler can be executed directly. Scrapy crawl gitee if the command line is executed. Where gitee is the name of the crawler, corresponding to the name field in GiteeSpider

# coding: utf-8from scrapy import cmdlineif _ _ name__ = ='_ main__': cmdline.execute ("scrapy crawl gitee" .split ()) # scrapy crawl gitee here, this article "how to use python crawler scrapy" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it to understand it. If you want to know more about related articles, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report