Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use the scrapy requests and response objects of python

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the python scrapy requests and response object how to use the relevant knowledge, the content is detailed and easy to understand, the operation is simple and fast, has a certain reference value, I believe that everyone after reading this python scrapy requests and response object how to use the article will have a harvest, let's take a look.

Request object

In scrapy, the Request object represents the request, that is, sending data to the server. The constructor prototype of this object is as follows:

Def _ _ init__ (self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None, cb_kwargs=None)

Only url is required. The details are as follows:

Callback: page parsing function. When the Request request gets the Response response, the set function will be called. The default is the self.parse method.

Method: request type. Default is GET, so you can send POST requests using Request. FormRequest class is a subclass of Request class.

Headers: request header, dictionary type

Body: the body of the request, which requires either bytes or str type

Cookies:Cookie dictionary, dict type

Meta: metadata dictionary, dict type, which can pass information to other components

Encoding of encoding:url and body parameters, note that it is not data response coding

Priority: the priority of the request. Default is 0. The higher the value, the higher the priority.

Dont_filter: the default value is False, which indicates whether to request the same address repeatedly.

Errback: the callback function when an exception is requested.

Response object

In scrapy, the Response object represents the request response object, that is, the data returned by the server to the crawler. The constructor prototype is as follows:

Def _ _ init__ (self,url,status=200,headers=None,body=b "", flags=None, request=None,certificate=None,ip_address=None,protocol=None,)

As with Request, only url is required in this method, but it is rarely used to manually create an instance of a Response object.

The Response class derives a subclass TextResponse, and then TextResponse derives HtmlResponse and XmlResponse.

Response includes the following properties and methods:

Attribute list:

Url: response address

Status: response status code

Headers: response header

Encoding: encoding of the response body

Body: response body, bytes type

Text: a text-based response body that encodes body data

Request: get the request object

Meta: metadata dictionary, dict type, parameters passed by the request

Selector: selector object.

Method list:

Xpath (): XPath selector

Css (): CSS selector

Urljoin (): is the urljoin () of the urllib.parse module

Json (): serializes the response data into JSON format

The relevant source code for the Request and Response classes can be viewed in the scrapy\ http directory.

ItemPipeline

Data pipeline in scrapy mainly deals with data, in the actual development process, you need to pay attention to an ItemPipeline, only responsible for one function of data processing, of course, you can create multiple ItemPipeline in scrapy.

Usage scenarios of ItemPipeline:

Data cleaning, such as deduplication, removal of abnormal data

Data is saved in a way that is written, such as storing Mongodb,MySQL,Redis databases.

When writing an ItemPipeline class, you don't need to inherit a specific class, you just need to implement a fixed-name method. As mentioned repeatedly in previous blogs, custom ItemPipeline classes need to implement process_item (), open_spider (), and close_spider () methods, where process_item () must be implemented.

Process_item () returns either an Item or a dictionary, or a data of type DropItem, in which case the item is ignored and not processed by the following ItemPipeline.

Logical implementation of filtering data

If you want to filter the data in ItemPipeline, you can use the collection, find that the data already exists in the collection, and throw DropItem.

LinkExtractor extract Link

Crawlers written by scrapy make it easier to use LinkExtractor when extracting a large number of links. Import LinkExtractor using from scrapy.linkextractors import LinkExtractor, and the constructor for this class is as follows:

Def _ init__ (self, allow= (), deny= (), allow_domains= (), deny_domains= (), restrict_xpaths= (), tags= ('averse,' area'), attrs= ('href',), canonicalize=False,unique=True,process_value=None, deny_extensions=None,restrict_css= (), strip=True,restrict_text=None,)

The parameters are described as follows:

Allow: a regular expression or a list of regular expressions that extract the url matching regular expressions. All are extracted by default.

Deny: opposite to allow

Allow_domains: string or list, domain limit

Deny_domains: contrary to the above

Restrict_xpaths: extract according to xpath

Restrict_css: install css selector extraction

Tags: extract links within specified tags

Attrs: extract links within specified attributes

Process_value: function type. After passing this parameter, LinkExtractor passes all the links to which it matches to the function for processing.

The following code extracts the link from the Response object, using the extract_links () method.

Def parse (self, response): link = LinkExtractor () all_links = link.extract_links (response) print (all_links)

Create a LinkExtractor object

Describe extraction rules using constructor parameters

Call the extract_links method of the LinkExtractor object to pass in a Response object and return a list

Use any element in the list to call .url or .text to get the link and link text.

Crawler coding time

The target site this time is: Amoy data-industry report

The complete code is written as shown below, using LinkExtractor to extract the page hyperlink.

Import scrapyfrom tao.items import TaoItemfrom scrapy.linkextractors import LinkExtractorclass TaoDataSpider (scrapy.Spider): name = 'tao_data' allowed_domains = [' taosj.com'] start_urls = [f 'https://www.taosj.com/articles?pageNo={page}' for page in range (1,124)] def parse (self, response): link_extractor = LinkExtractor (allow=r'www\ .taosj\ .com / articles/\ dcards' Restrict_css='a.report-page-list-title') links = link_extractor.extract_links (response) for lin links: item = {"url": l.url, "text": l.text} yield item on "how to use python's scrapy requests and response objects" ends here Thank you for reading! I believe you all have a certain understanding of the knowledge of "how to use python's scrapy requests and response objects". If you want to learn more, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report