In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
Today, I will talk to you about the use of Spider Middleware in Python. Many people may not know much about it. In order to make you understand better, the editor has summarized the following for you. I hope you can get something according to this article.
Preface
There is an almost inexplicable contradiction in most people.
★
Want to get into the habit of getting up early, but accidentally swipe your cell phone until 2: 00 in the morning.
When you see a practical article, the first reaction is to add your favorite folder and read it next time (the collection never stops, and learning never starts. / favorites = = learn)
Want to lose weight and shape, but break the work late at night: "only when you are full can you have the strength to lose weight."
When I saw a good course, I still told myself that I would learn it when I had time.
"how to use Spider Middleware
Spider Middleware is the hook framework involved in Scrapy's Spider processing mechanism.
When Downloader generates Response, Response will be sent to Spider. Before sending Spider, Response will first be processed by Spider Middleware. When Spider processing generates Item and Request, Item and Request will also be processed by Spider Middleware.
Spider Middleware has the following three functions:
We can process Response before Downloader generates Response and sends it to Spider, that is, before Response sends it to Spider. We can Spider to generate Request and send it to Scheduler, that is, before Request sends it to Scheduler, to process Request. We can process Item before Spider generates Item and sends it to Item Pipeline, that is, before Item sends it to Item Pipeline. instructions
It is important to note that Scrapy actually provides a lot of Spider Middleware, which is defined by the variable SPIDER_MIDDLEWARES_BASE.
The SPIDER_MIDDLEWARE_BASE variable is as follows:
{
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50
'500 scrapy.spidermiddlewares.offsite.OffsiteMiddleware':
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700'
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900'
}
Like Downloader Middleware, Spider Middleware is first added to the SPIDER_MIDDLEWARES setting, which is merged with the Spider Middleware defined by SPIDER_MIDDLEWARES_BASE in Scrapy. Then sort according to the numerical priority of the key values to get an ordered list. The first Middleware is closest to the engine, and the last MIddleware is closest to Spider.
Core method
Scrapy's built-in Spider Middleware provides basic functionality for Spider. If we want to expand its functionality, we just need to implement a method.
Each Spider Middleware defines classes for one or more of the following methods, and the core methods are as follows:
Process_spider_input (response, spider) process_spider_output (response, result, spider) process_spider_exception (response, exception, spider) process_start_requests (start_requests, spider) process_spider_input (response, spider)
When the Response passes through the Spider Middleware, the method is called to process the Response.
Method has two parameters:
Response: that is, the Response object, that is, the ResponseSpider being processed: that is, the Spider object, that is, the Spider corresponding to the Response
Process_spider_input () should return None or throw an exception.
If None,Scrapy is returned, the Response will continue to be processed and other Spider Middleware will be called until the Spider processes the Response. If an exception is thrown, Scrapy will not call the process_spider_input () method of any other Spider Middleware and will call the errback () method of Request. The output of errback () will be re-entered into the middleware in another direction, handled using the process_spider-output () method, and when it throws an exception, process_spider_exception () is called to handle it. Process_spider_output (response, result, spider)
This method is called when the Spider process Response returns a result.
The method has three parameters:
Response, that is, the Response object, that is, the Response;result that generates the output, contains the iterable object of Request or Item object, that is, the result returned by Spider; spider, that is, the Spider object, that is, the Spider corresponding to its result.
Process_spider_output () must return an iterable object that contains a Request or Item object.
Process_spider_exception (response, exception, spider)
This method is called when the process_spider_input () method of Spider or Spider Middleware throws an exception.
The method has three parameters:
Response, the Response object, is the Response that is handled when the exception is thrown. Exception, that is, the Exception object, the exception that is thrown. Spider, which is the Spider object, that is, the Spider that throws the exception
Process_spider_exception () either returns None or an iterable object that contains a Response or Item object.
If it returns None,Scrapy, it will continue to handle the exception, calling the process_spider_exception () method in the other Spider Middleware until all Spider Middleware are called. If an iterable object is returned, other Spider Middleware's process_spider_output () methods will be called and other process_spider_exception () will not be called. Process_start_requests (start_requests, spider)
This method is called with the Request started by Spider as a parameter, and the procedure is similar to process_spider_output (), except that there is no other associated Response and must return Request.
Method has two parameters:
Start_requests, that is, the iterable object containing Request, that is, Start Requestsspider, that is, Spider object, that is, the Spider to which Start Requests belongs.
It must return an iterable object that contains a Request object.
Turn on Spider Middleware
Every time we create a new project, we generate a middlewares.py file with a class: MiddletestSpiderMiddleware, which is related to the name of the project we created, and the project I created is called ``Middletest`. Therefore, when we change the project name, the class name should be changed as well.
In this class, there are the four core methods we described above.
To open this Spider Middleware, you can open it in settings.py:
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'middletest.middlewares.MiddletestSpiderMiddleware': 543
}
You just need to uncomment.
After reading the above, do you have any further understanding of how to use Spider Middleware in Python? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.