In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces the python scrapy requests and response object how to use the relevant knowledge, the content is detailed and easy to understand, the operation is simple and fast, has a certain reference value, I believe that everyone after reading this python scrapy requests and response object how to use the article will have a harvest, let's take a look.
Request object
In scrapy, the Request object represents the request, that is, sending data to the server. The constructor prototype of this object is as follows:
Def _ _ init__ (self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None, cb_kwargs=None)
Only url is required. The details are as follows:
Callback: page parsing function. When the Request request gets the Response response, the set function will be called. The default is the self.parse method.
Method: request type. Default is GET, so you can send POST requests using Request. FormRequest class is a subclass of Request class.
Headers: request header, dictionary type
Body: the body of the request, which requires either bytes or str type
Cookies:Cookie dictionary, dict type
Meta: metadata dictionary, dict type, which can pass information to other components
Encoding of encoding:url and body parameters, note that it is not data response coding
Priority: the priority of the request. Default is 0. The higher the value, the higher the priority.
Dont_filter: the default value is False, which indicates whether to request the same address repeatedly.
Errback: the callback function when an exception is requested.
Response object
In scrapy, the Response object represents the request response object, that is, the data returned by the server to the crawler. The constructor prototype is as follows:
Def _ _ init__ (self,url,status=200,headers=None,body=b "", flags=None, request=None,certificate=None,ip_address=None,protocol=None,)
As with Request, only url is required in this method, but it is rarely used to manually create an instance of a Response object.
The Response class derives a subclass TextResponse, and then TextResponse derives HtmlResponse and XmlResponse.
Response includes the following properties and methods:
Attribute list:
Url: response address
Status: response status code
Headers: response header
Encoding: encoding of the response body
Body: response body, bytes type
Text: a text-based response body that encodes body data
Request: get the request object
Meta: metadata dictionary, dict type, parameters passed by the request
Selector: selector object.
Method list:
Xpath (): XPath selector
Css (): CSS selector
Urljoin (): is the urljoin () of the urllib.parse module
Json (): serializes the response data into JSON format
The relevant source code for the Request and Response classes can be viewed in the scrapy\ http directory.
ItemPipeline
Data pipeline in scrapy mainly deals with data, in the actual development process, you need to pay attention to an ItemPipeline, only responsible for one function of data processing, of course, you can create multiple ItemPipeline in scrapy.
Usage scenarios of ItemPipeline:
Data cleaning, such as deduplication, removal of abnormal data
Data is saved in a way that is written, such as storing Mongodb,MySQL,Redis databases.
When writing an ItemPipeline class, you don't need to inherit a specific class, you just need to implement a fixed-name method. As mentioned repeatedly in previous blogs, custom ItemPipeline classes need to implement process_item (), open_spider (), and close_spider () methods, where process_item () must be implemented.
Process_item () returns either an Item or a dictionary, or a data of type DropItem, in which case the item is ignored and not processed by the following ItemPipeline.
Logical implementation of filtering data
If you want to filter the data in ItemPipeline, you can use the collection, find that the data already exists in the collection, and throw DropItem.
LinkExtractor extract Link
Crawlers written by scrapy make it easier to use LinkExtractor when extracting a large number of links. Import LinkExtractor using from scrapy.linkextractors import LinkExtractor, and the constructor for this class is as follows:
Def _ init__ (self, allow= (), deny= (), allow_domains= (), deny_domains= (), restrict_xpaths= (), tags= ('averse,' area'), attrs= ('href',), canonicalize=False,unique=True,process_value=None, deny_extensions=None,restrict_css= (), strip=True,restrict_text=None,)
The parameters are described as follows:
Allow: a regular expression or a list of regular expressions that extract the url matching regular expressions. All are extracted by default.
Deny: opposite to allow
Allow_domains: string or list, domain limit
Deny_domains: contrary to the above
Restrict_xpaths: extract according to xpath
Restrict_css: install css selector extraction
Tags: extract links within specified tags
Attrs: extract links within specified attributes
Process_value: function type. After passing this parameter, LinkExtractor passes all the links to which it matches to the function for processing.
The following code extracts the link from the Response object, using the extract_links () method.
Def parse (self, response): link = LinkExtractor () all_links = link.extract_links (response) print (all_links)
Create a LinkExtractor object
Describe extraction rules using constructor parameters
Call the extract_links method of the LinkExtractor object to pass in a Response object and return a list
Use any element in the list to call .url or .text to get the link and link text.
Crawler coding time
The target site this time is: Amoy data-industry report
The complete code is written as shown below, using LinkExtractor to extract the page hyperlink.
Import scrapyfrom tao.items import TaoItemfrom scrapy.linkextractors import LinkExtractorclass TaoDataSpider (scrapy.Spider): name = 'tao_data' allowed_domains = [' taosj.com'] start_urls = [f 'https://www.taosj.com/articles?pageNo={page}' for page in range (1,124)] def parse (self, response): link_extractor = LinkExtractor (allow=r'www\ .taosj\ .com / articles/\ dcards' Restrict_css='a.report-page-list-title') links = link_extractor.extract_links (response) for lin links: item = {"url": l.url, "text": l.text} yield item on "how to use python's scrapy requests and response objects" ends here Thank you for reading! I believe you all have a certain understanding of the knowledge of "how to use python's scrapy requests and response objects". If you want to learn more, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.