In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail the usage of the Python scrapy framework scrapy.Spider for you. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.
The Spider class defines how to crawl a Web site (or some). Includes crawling actions (for example, whether to follow links) and how to extract structured data from the content of a web page (crawling item). In other words, Spider is where you define crawling actions and analyze a web page (or some pages).
For spider, the crawl loop looks like this:
Initialize the Request with the initial URL and set the callback function. When the request is downloaded and returned, the response is generated and passed to the callback function as an argument.
The initial request in spider is obtained by calling start_requests (). Start_requests () reads the URL in start_urls and uses parse as the callback function to generate Request.
Analyze the returned content within the callback function and return an Item object or Request or an iterable container that includes both. The returned Request object is then processed by Scrapy, downloads the corresponding contents, and calls the set callback function (the function can be the same).
Within the callback function, you can use a Selectors (you can also use BeautifulSoup, lxml, or any parser you want) to analyze the page content and generate an item based on the analyzed data.
Finally, the item returned by spider will be saved to the database (handled by some Item Pipeline) or saved to a file using Feed exports.
Although this loop works (more or less) for any type of spider, Scrapy still provides a variety of default spider for different requirements. These spider will be discussed later.
Spider
Scrapy.spider.Spider is the simplest spider. Every other spider must inherit from this class (including other spider that comes with Scrapy and spider that you write yourself). It simply requests the given start_urls / start_requests and calls the parse method of spider based on the returned result (resulting responses).
Name
A string (string) that defines the name of the spider. The name of spider defines how Scrapy locates (and initializes) the spider, so it must be unique. However, there is no restriction that you can generate multiple identical spider instances (instance). Name is the most important attribute of spider and is required.
If the spider crawls a single site (single domain), a common practice is to name the spider after the site (domain) (with or without a suffix). For example, if spider crawls a mywebsite.com, the spider is usually named mywebsite.
Allowed_domains
Optional. Contains a list of domain names (domain) allowed by spider to crawl (list). When OwsiteMiddleware is enabled, URL whose domain name is not in the list will not be followed up.
Start_urls
URL list. When no specific URL is specified, the spider starts crawling from this list. Therefore, the URL of the first page to be fetched will be one of the lists. The subsequent URL will be extracted from the obtained data.
Start_requests ()
The method must return an iterable object (iterable). This object contains the first Request that spider uses to crawl.
This method is called when spider starts crawling and no URL is defined. When URL is specified, make_requests_from_url () is called to create the Request object. This method is called only once by Scrapy, so you can implement it as a generator.
The default implementation of this method is to generate Request using start_urls 's url.
If you want to modify the Request object that originally crawled a Web site, you can override this method. For example, if you need to log in to a website as POST at startup, you can write:
Def start_requests (self): return [scrapy.FormRequest ("http://www.example.com/login", formdata= {'user':' john', 'pass':' secret'}, callback=self.logged_in)] def logged_in (self, response): # # here you would extract links to follow and return Requests for # # each of them, with another callback passparse
When response does not specify a callback function, this method is the default method for Scrapy to handle downloaded response.
Parse is responsible for processing response and returning processed data and / or follow-up URL. Spider has the same requirements for other Request callback functions.
This method and other Request callback functions must return an iterable object that contains Request and / or Item.
Parameter: response used by response- for analysis
Closed (reason)
This function is called when spider is closed.
Startup mode start_urls
Start_urls is a list
Start_requests
To rewrite the start_urls using start_requests (), send the request yourself using the Request () method:
Def start_requests (self): "" rewrite start_urls rules "yield scrapy.Request (url=' http://quotes.toscrape.com/page/1/', callback=self.parse) scrapy.Request"
Scrapy.Request is a request object, and a callback function must be established when it is created.
Data preservation
You can use-o to save the data in a common format (based on the suffix name)
The following formats are supported:
Json
Jsonlines
Jl
Csv
Xml
Marshal
Pickle
Case: Spider exampl
Let's look at an example:
# #-*-coding: utf-8-*-import scrapy class Quotes2Spider (scrapy.Spider): name = 'quotes2' allowed_domains = [' toscrape.com'] start_urls = ['http://quotes.toscrape.com/page/2/'] def parse (self) Response): quotes = response.css ('.text:: text') for quote in quotes: text= quote.css (' .text:: text'). Extract_first () auth= quote.css ('.author:: text'). Extract_first () tages= quote.css ('. Tags a tages= quote.css'). Extract () yield dict (text=text,auth=auth,tages=tages) url splicing import urllib.parse urllib.parse.urljoin ('http://quotes.toscrape.com/', '/ page/2/') Out [6]:' http://quotes.toscrape.com/page/2/' urllib.parse.urljoin ('http://quotes.toscrape.com/page/2/',' / page/3/') Out [7]: 'http://quotes.toscrape.com/page/3/' 's use of the Python scrapy framework scrapy.Spider is here. I hope the above content can be of some help to you and learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.