In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly introduces "how to quickly master the scrapy crawler framework". In the daily operation, I believe that many people have doubts about how to quickly grasp the scrapy crawler framework. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "how to quickly master the scrapy crawler framework". Next, please follow the editor to study!
1. Brief introduction to scrapy
Scrapy is a crawler framework written in pure python under the event-driven Twisted framework. Scrapy has been used to crawl pictures and text messages on the web for a long time, but the details have not been recorded. During this period of time, because of the need to pick up scrapy crawlers again, this article will be shared with you, welcome to communicate with you.
1.1 scrapy framework
The scrapy framework includes five main components and two middleware Hook.
ENGIINE: the control center of the entire framework, which controls the flow of the entire crawler. Add different events according to different conditions (that is, using Twisted)
SCHEDULER: event Scheduler
DOWNLOADER: receives crawler requests and downloads data from the Internet
SPIDERS: initiates a crawler request, parses the web page content returned by DOWNLOADER, and interacts with data persistence, which requires developers to write
ITEM PIPELINES: receives structured fields parsed by SPIDERS, persists and other operations, which need to be written by developers
Some additional operations between MIDDLEWARES:ENGIINE and SPIDERS, ENGIINE and DOWNLOADER, hook mode is provided to developers
From the above, we just need to implement SPIDERS (what website to climb, how to parse) and ITEM PIPELINES (how to deal with the parsed content). Everything else has a framework to help you do it.
1.2 scrapy data flow
If we take a closer look at the data flow between components, we will have a better understanding of the internal operation of the framework.
Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community
SPIDERS sends a crawler request to ENGIINE to tell it that the task is coming.
ENGIINE adds the request to the SCHEDULER scheduling queue and says the task will be left to you. Arrange it for me.
SCHEDULER looked at the many crawl requests in his hand and gave one to ENGIINE, saying that the eldest brother would help forward it to the download DOWNLOADER.
ENGIINE: OK, DOWNLOADER, your mission is here.
DOWNLOADER: start downloading, download it, and give the result of the task to ENGIINE
ENGIINE will send the result to SPIDERS. One of your requests has been downloaded. Go and parse it.
SPIDERS: well, the parsing produces the result field. And forwarded to SPIDERS and ITEM PIPELINES.
ITEM PIPELINES: receive the contents of the field and save them.
From step 1 to step 8, a request is finally completed. Do you feel superfluous? ENGIINE is sandwiched in the middle as a microphone, can you just skip it? You can consider what will happen if you skip it.
Let's analyze it here.
The function of SCHEDULER: task scheduling, controlling the concurrency of tasks and preventing machines from being unable to handle them.
ENGIINE: based on the Twisted framework, when an event comes (such as forwarding a request), the corresponding event is executed by callback. I think ENGIINE unifies all operations, organizing other components according to events, and other components operate in a low-coupling manner; it is undoubtedly necessary for a framework.
two。 Basics: XPath
The most important thing to write a crawler is to parse the content of the web page. This section introduces how to parse the web page and extract the content through XPath.
2.1 HTML nodes and attributes
2.2 parsing grammar
A / b bank is a hierarchical relationship in xpath. The an on the left is the parent node and the b on the right is the child node.
A / b: indicates all b under a, directly or indirectly
[@]: select a node with an attribute
/ / div [@ classs], / / a [@ x]: select the div node with class attribute, and select a node with x attribute
/ / div [@ class= "container"]: select the div node with the value of container with the class attribute
/ / a [contains (@ id, "abc")]: select the a tag with abc in the id attribute
An example
Response.xpath ('/ / div [@ class= "taglist"] / ul//li//a//img/@data-original') .get_all () # get all the div whose class attribute (css) is taglist. Under the next layer ul, under all li, under all img tags, under data-original attribute # data-original, here is the url address of the picture
3. Installation and deployment
Scrapy is written in pure python and relies on several key python packages (among others):
Lxml an efficient XML and HTML parser
Parsel, a html/xml data extraction library written on lxml
W3lib, a multipurpose helper for dealing with URL and web page coding
Twisted Asynchronous Network Framework
Cryptography and pyOpenSSL to address a variety of network-level security requirements
# install pip install scrapy
4. Create a crawler project
Scrapy startproject sexy # create a project directory # sexy # │ scrapy.cfg # │ # └─ sexy # │ items.py # │ middlewares.py # │ pipelines.py # │ settings.py # │ _ init__.py # │ # ├─ spiders # │ │ _ _ init__. Py # │ │ # │ └─ _ _ pycache__ # └─ _ _ pycache__ # execution requires scrapy crawl sexy to a directory at the same level as scrapy.cfg
As we can see from above, we are going to write the specific spider classes and items.py and pipelines.py (corresponding ITEM PIPELINES) in spiders.
5. Start scrapy crawler
5.1simple but powerful spider
The function here is to download pictures from the picture website, save them locally, and url has done desensitization. The points that need to be noted should be marked in the notes.
Class to inherit scrapy.Spider
Take a unique name
The crawled website url is added to the start_urls list
Rewrite parse to parse the contents of reponse using xpath
You can see that when parse is implemented, it is not forwarded to ITEM PIPELINES, but processed directly. Such a simple one can be handled like this, and if the business is very complex, it is recommended to hand it over to ITEM PIPELINES. Examples will be given later.
# the result of the directory is: spiders/sexy_spider.py import scrapy import os import requests import time def download_from_url (url): response = requests.get (url, stream=True) if response.status_code = = requests.codes.ok: return response.content else: print ('% Smurf% s'% (url, response.status_code)) return None class SexySpider (scrapy.Spider): # if there are multiple spider Name to be unique name = 'sexy' allowed_domains = [' uumdfdfnt.94demo.com'] allowed_urls = ['http://uumdfdfnt.94demo.com/'] # the website url that needs to be crawled is added to start_urls list start_urls = [' http://uumdfdfnt.94demo.com/tag/dingziku/index.html'] save_path ='/ home/sexy/dingziku' def parse (self Response): # parsing the website Get the list of images img_list = response.xpath ('/ / div [@ class= "taglist"] / ul//li//a//img/@data-original') .getall () time.sleep (1) # to process images. For specific business operations, you can submit them to items. See the 5.2 items example for img_url in img_list: file_name = img_url.split ('/') [- 1] content = download_from_url (img_url) if content is not None: with open (os.path.join (self.save_path, file_name) 'wb') as fw: fw.write (content) # automatic next page (see automatic next page) next_page = response.xpath (' / / div [@ class= "page both"] / ul/a [text () = "next page"] / @ href') .get () if next_page is not None: next_page = response.urljoin (next_page) yield scrapy.Request (next_page Callback=self.parse)
5.2 items and pipline exampl
Here are the functions of the next two.
Items: provides a field store where spider will store the data
Pipline: will fetch data from items for business operations, such as saving pictures in 5.1, or storing them in a database.
Let's rewrite the above example.
Items.py actually defines the field scrapy.Field ().
Import scrapy class SexyItem (scrapy.Item): # define the fields for your item here like: # name = scrapy.Field () img_url = scrapy.Field ()
Spiders/sexy_spider.py
Import scrapy import os # Import item from.. items import SexyItem class SexySpider (scrapy.Spider): # if there is more than one spider Name to be unique name = 'sexy' allowed_domains = [' uumdfdfnt.94demo.com'] allowed_urls = ['http://uumdfdfnt.94demo.com/'] # the website url that needs to be crawled is added to start_urls list start_urls = [' http://uumdfdfnt.94demo.com/tag/dingziku/index.html'] save_path ='/ home/sexy/dingziku' def parse (self Response): # parsing the website Get the list of pictures img_list = response.xpath ('/ / div [@ class= "taglist"] / ul//li//a//img/@data-original') .getall () time.sleep (1) # to process pictures, specific business operations Can be given to yield items for img_url in img_list: items = SexyItem () items ['img_url'] = img_url yield items
Pipelines.py
Import os import requests def download_from_url (url): response = requests.get (url, stream=True) if response.status_code = = requests.codes.ok: return response.content else: print ('% Smurf% s'% (url) Response.status_code)) return None class SexyPipeline (object): def _ _ init__ (self): self.save_path ='/ tmp' def process_item (self, item) Spider): if spider.name = 'sexy': # take out the content in item img_url = item [' img_url'] # Business processing file_name = img_url.split ('/') [- 1] content = download_from_url (img_url) If content is not None: with open (os.path.join (self.save_path) File_name), 'wb') as fw: fw.write (content) return item
The important configuration is to enable the piplines class in settings.py. The value indicates priority.
ITEM_PIPELINES = {'sexy.pipelines.SexyPipeline': 300,}
5.3 automatic next page
Sometimes we not only have to crawl the content of the request page, but also recursively crawl the hyperlink url inside, especially in the case of the next page, where the parsed content is the same as the current page. One stupid way is to add it to start_urls manually. We're all smart people. Try this.
First parse the url of the next page on the page
Scrapy.Request (next_page, callback=self.parse) initiates a request and calls parse to parse. Of course, you can use other parsing.
Perfect. For a complete example, see 5.1.
Next_page = response.xpath ('/ / div [@ class= "page both"] / ul/a [text () = "next page"] / @ href') .get () if next_page is not None: next_page = response.urljoin (next_page) yield scrapy.Request (next_page, callback=self.parse)
5.4 Middleware
The role of downloading middleware middleware is to provide some commonly used hook Hook to add additional operations. The operation of middleware is in middlewares.py. You can see that there are three hook functions mainly dealing with request process_request, response process_response and exception process_exception.
Processing request process_request: actions done before passing to DOWNLOADER
Respond to the action before process_response:DOWNLOADER responds to ENGIINE
Here is a way to add a mock browser request to prevent the crawler from being blocked. Rewrite process_request
From scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware import random agents = ['Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',' Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv,2.0.1) Gecko/20100101 Firefox/4.0.1', 'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U) En) Presto/2.8.131 Version/11.11', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10' 7'0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1) 360SE)'] class RandomUserAgent (UserAgentMiddleware): def process_request (self, request, spider): ua = random.choice (agents) request.headers.setdefault ('User-agent',ua,)
Unified needs to enable download middleware in settings.py. Numerical values indicate priority.
DOWNLOADER_MIDDLEWARES = {'sexy.middlewares.customUserAgent.RandomUserAgent': 20,}
5.5 available configuration settings.py
In addition to the pipline configuration and middleware configuration provided above, here are a few commonly used configurations
Crawler robot rule: ROBOTSTXT_OBEY = False. If the website to be crawled has a setting of robots.txt, it is best to set it to False
CONCURRENT_REQUESTS: concurrent requests
DOWNLOAD_DELAY: download delay, can be properly configured to avoid crawling the site.
At this point, the study on "how to quickly master the scrapy crawler framework" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.