In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article Xiaobian for you to introduce in detail "python crawler sharp weapon scrapy how to use", the content is detailed, the steps are clear, the details are handled properly, I hope this article "python crawler sharp weapon scrapy how to use" article can help you solve doubts, the following follow the editor's ideas slowly in-depth, together to learn new knowledge.
Architecture and introduction
Scrapy is an application framework written in pure Python to crawl website data and extract structural data. It has a wide range of uses.
Scrapy uses Twisted (its main opponent is Tornado) asynchronous network framework to deal with network communications, which can speed up our download speed, do not have to implement the asynchronous framework ourselves, and contains a variety of middleware interfaces, which can flexibly complete various requirements.
Scrapy Engine (engine): responsible for communication, signal and data transmission among Spider, ItemPipeline, Downloader and Scheduler.
Scheduler (Scheduler): it is responsible for receiving Request requests sent by the engine, arranging them in a certain way, joining the queue, and returning them to the engine when needed.
Downloader (downloader): responsible for downloading all Requests requests sent by Scrapy Engine (engine) and returning the obtained Responses to Scrapy Engine (engine), which will be handed over to Spider for processing
Spider (crawler): it is responsible for processing all Responses, analyzing and extracting data from it, obtaining the data needed in the Item field, and submitting the URL that needs to be followed up to the engine, and entering the Scheduler (scheduler) again.
Item Pipeline (pipeline): it is responsible for processing the Item obtained in Spider and performing post-processing (detailed analysis, filtering, storage, etc.).
Downloader Middlewares (download middleware): you can think of it as a component that can be customized to extend the download function.
Spider Middlewares (Spider Middleware): you can understand it as a functional component that can customize the extension and operation engine to communicate with Spider (such as Responses; into Spider and Requests out of Spider).
Development process
Develop a simple crawler step:
New project
Scrapy startproject demo
Write spider
Seed url (request)
Analytical method
Write item
Result data model
Persistence
Write pipelines
Build directory introduction scrapy.cfg: project configuration file mySpider/: project Python module, will reference code mySpider/items.py: project object file mySpider/pipelines.py: project pipeline file mySpider/settings.py: project settings file mySpider/spiders/: store crawler code directory use command to create reptile class scrapy genspider gitee "gitee.com" parsing
Usually our parsing involves xpath csspath regularity, and sometimes there may be jsonpath (json access in python basically does not use complex jsonpath, dictionary access is fine)
Scrapy built-in xpath and csspath support
Selector
And the parser itself can be used alone.
Xpath ()
Extract_first ()
Extract () # returns a list
Index access, because scrapy.selector.unified.SelectorList inherits list and can be accessed through the index
From scrapy import Selectorif _ _ name__ = ='_ _ main__': body = "Title
Hello
Hello
"" s = Selector (text=body) title=s.xpath ("/ / title/text ()"). Extract_first (); # extract print (title) # Title pe = s.xpath ("/ p") print (s.xpath ("/ / p"). Extract () # ['
Hello
','
Hello
'] print (pe) # [,] print (type (pe)) # print (type (pe [0])) # access through index # print (type (pe.pop ()) # p=s.xpath ("/ / p"). Extract_first () print (p)
Css ()
Css selector We:: text select content, and use:: attr () to select properties
Print (s.css ("title"). Extract_first () print (s.css ("title::text"). Extract_first () print (s.css ("title::text"). Extract () print (s.css ("p.big::text"). Extract_first () print (s.css ("p.big::attr (class)"). Extract_first () # Title# Title# ['Title'] # hello big# big
Mix css () and xpath ()
Scrapy.selector.unified.SelectorList scrapy.selector.unified.Selector has its own css and xpath methods, so it can be used in combination
Print (s.xpath ("/ / body"). Css ("p.big"). Extract_first () print (s.css ("body"). Xpath ("/ / p [@ class='big']"). Extract_first ()
Hello big
#
Hello big
Re () and re_first ()
Scrapy.selector.unified.SelectorList scrapy.selector.unified.Selector has a re () method that supports filtering by regularization
Print (s.xpath ("/ / p/text ()). Re_first (" big ")) print (type (s.xpath (" / / p/text () "). Re (" big ")) # big#
* * but re () returns a list, and .re_first returns str, so you cannot continue to call other selection methods
Using parsers in crawlers
The response object has been
Class GiteeSpider (scrapy.Spider): name = 'gitee' allowed_domains = [' gitee.com'] start_urls = ['https://gitee.com/haimama'] def parse (self) Response): print (type (response)) t=response.xpath ("/ / title/text ()") .extract_first () print (t) # # result after starting crawler execution # execution result omitted # # haimama-Gitee
The response object type is scrapy.http.response.html.HtmlResponse, and this class inherits TextResponse. Having the xpath () and css () methods is as follows
So response can directly use the Selector method in the previous article to parse
Def xpath (self, query, * * kwargs): return self.selector.xpath (query, * * kwargs) def css (self, query): return self.selector.css (query) configuration file
Settings.py is the configuration file of the crawler. If you want to start the crawler normally, you must pay attention to changing the robo protocol limit to ROBOTSTXT_OBEY = False.
We will introduce other related configurations in the next section.
Start the crawler
Write the run.py method in the crawler directory and add the following script so that the crawler can be executed directly. Scrapy crawl gitee if the command line is executed. Where gitee is the name of the crawler, corresponding to the name field in GiteeSpider
# coding: utf-8from scrapy import cmdlineif _ _ name__ = ='_ main__': cmdline.execute ("scrapy crawl gitee" .split ()) # scrapy crawl gitee here, this article "how to use python crawler scrapy" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it to understand it. If you want to know more about related articles, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.