In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
How to carry out Request analysis, I believe that many inexperienced people do not know what to do. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.
Introduction
The Request class is a class requested by http and is a very important class for crawlers. Typically, such a request is created in Spider and executed in Downloader. There is also a subclass FormRequest that inherits from it and is used for post requests.
Commonly used in Spider:
Yield scrapy.Request (url = 'zarten.com')
Class properties and methods are:
Url method headers body meta copy () replace ([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback])
Request
Class scrapy.http.Request (url [, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])
Parameter description:
Url requested by url
Callback callback function, which is used to receive the returned information after the request. If it is not specified, it defaults to the parse () function.
The method of method http request, which defaults to GET request and does not need to be specified. If you need a POST request, you can use FormRequest
Headers request header information, which can be set in either settings or middlewares.
Body str type, which is the body of the request and generally does not need to be set (both get and post can actually pass parameters through body, but generally not)
Cookies dict or list type, cookie dict method of the request (key-value pair of name and value):
Cookies = {'name1':' value1', 'name2':' value2'}
List mode:
Cookies = [{'name':' Zarten', 'value':' my name is Zarten', 'domain':' example.com', 'path':' / currency'}]
The encoding method of the encoding request. Default is' utf-8'
Priority int type, which specifies the priority of the request. The higher the number, the higher the priority. It can be negative. The default is 0.
Dont_filter defaults to False. If set to True, this request will not be filtered (will not be added to the deduplication queue), and the same request can be executed multiple times.
Errback throws an incorrect callback function. Errors include 404, timeout, DNS error, etc. * parameters are Twisted Failure instances
From scrapy.spidermiddlewares.httperror import HttpError from twisted.internet.error import DNSLookupError from twisted.internet.error import TimeoutError, TCPTimedOutError class ToScrapeCSSSpider (scrapy.Spider): name = "toscrape-css" # start_urls = [# 'http://quotes.toscrape.com/', #] start_urls = ["http://www.httpbin.org/", # HTTP 200 expected "http://www.httpbin.org/status/404", # Not found error" http://www.httpbin.org/status/500", # server issue "http://www.httpbin.org:12345/", # non-responding host, timeout expected" http://www.httphttpbinbin.org/", # DNS error expected] def start_requests (self): for u in self.start_urls: yield scrapy.Request (u, callback=self.parse_httpbin, errback=self.errback_httpbin, dont_filter=True) def parse_httpbin (self Response): self.logger.info ('Got successful response from {}' .format (response.url)) # do something useful here... Def errback_httpbin (self, failure): # log all failures self.logger.info (repr (failure)) # in case you want to do something special for some errors # you may need the failure's type: if failure.check (HttpError): # these exceptions come from HttpError spider middleware # you can get the non-200 response response = failure.value.response self.logger.info ('HttpError error on% s' Response.url) elif failure.check (DNSLookupError): # this is the original request request = failure.request self.logger.info ('DNSLookupError error on% slots, request.url) elif failure.check (TimeoutError, TCPTimedOutError): request = failure.request self.logger.info (' TimeoutError error on% slots, request.url)
The flags list type, which is not generally used, is a flag for sending requests, which is generally used for logging.
Meta can be customized to pass parameters from Request to Response, and this parameter can also be handled in middlewares
Yield scrapy.Request (url = 'zarten.com', meta = {' name': 'Zarten'})
In Response:
My_name = response.meta ['name']
However, there is also a special key built into scrapy, which is also very useful, as follows:
Proxy setting proxy, which is generally set in middlewares
You can set a http or https proxy
Request.meta ['proxy'] =' https://' + 'ip:port'
Downloadtimeout sets the request timeout wait time (seconds). DOWNLOADTIMEOUT is usually set in settings. The default is 180 seconds (3 minutes).
Maxretrytimes * number of retries (excluding * downloads). Default is 2. RETRY_TIMES is usually set in settings.
When dont_redirect is set to True, Request will not be redirected
When dont_retry is set to True, requests with incorrect http links or timeout will not be retried
Handlehttpstatuslist http return codes are all successful returns, while those beyond this range are failure returns. Scrapy filters these returns by default and does not receive these error returns for processing. However, you can customize which errors are handled and returned:
Yield scrapy.Request (url= 'https://httpbin.org/get/zarten', meta= {' handle_httpstatus_list': 404]})
You can see that the 404 error is handled in the parse function:
Def parse (self, response): print ('returned information is:', response.text)
When handlehttpstatusall is set to True, Response will receive the return information that processes any status code.
Dontmergecookies scrapy automatically saves the returned cookies for its next request. When we specify a custom cookies, if we do not need to merge the returned cookies and use our own specified cookies, we can set it to True.
Cookiejar can track multiple cookie in a single spider. It is not sticky and needs to be carried with each request.
Def start_requests (self): urls = ['http://quotes.toscrape.com/page/1',' http://quotes.toscrape.com/page/3', 'http://quotes.toscrape.com/page/5',] for I, url in enumerate (urls): yield scrapy.Request (urlurl= url Meta= {'cookiejar': I}) def parse (self, response): next_page_url = response.css ("li.next > a::attr (href)"). Extract_first () if next_page_url is not None: yield scrapy.Request (response.urljoin (next_page_url), meta= {' cookiejar': response.meta ['cookiejar']} Callback= self.parse_next) def parse_next (self, response): print ('cookiejar:', response.meta [' cookiejar'])
When dont_cache is set to True, it will not be cached
The specific role of redirect_urls is not clear yet. Friends who know are welcome to leave comments.
Bindaddress binding output IP
Dontobeyrobotstxt is set to True and does not comply with the robots protocol. It is usually set in settings.
Downloadmaxsize sets the size of downloader * * (in bytes). DOWNLOADMAXSIZE is usually set in settings. The default is 1073741824 (1024MB=1G). If you do not set the download limit of * *, set it to 0
Download_latency read-only property to get the response time of the request (seconds)
Def start_requests (self): headers = {'user-agent':' Mozilla/5.0 (Windows NT 6.1; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'} yield scrapy.Request (url= 'https://www.amazon.com', headersheaders= headers) def parse (self, response): print (' response time is:', response.meta ['download_latency'])
Downloadfailon_dataloss is rarely used. See here for details.
Referrer_policy sets Referrer Policy
FormRequest
The FormRequest class is a subclass of Request and is used for POST requests
A parameter formdata has been added to this class. The other parameters are the same as Request. For more information, please see the description above.
General usage is:
Yield scrapy.FormRequest (url= "http://www.example.com/post/action", formdata= {'name':' Zarten', 'age':' 27'}, callback=self.after_post) after reading the above, have you mastered the method of how to analyze Request? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un