Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use the Scrapy middle key

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

How to use Scrapy middle key, I believe many inexperienced people are helpless about this, this article summarizes the causes and solutions of the problem, through this article I hope you can solve this problem.

What is middleware?

Middleware is widely used, if you directly understand middleware from the definition point of view will be a bit messy, I take distributed systems as an example to illustrate. In the previous article, I mentioned that the current background service architecture is basically developing towards distribution. In fact, distributed systems are also considered middleware.

What is a distributed system? A distributed system is a group of computer nodes that communicate over a network and work together to accomplish a common task. As we can see from the figure below, distributed systems are software that sits between operating systems and user applications.

So let's expand our thinking further to understand middleware. Middleware is a large category of basic software and belongs to the category of reusable software. Middleware, as its name suggests, sits between the operating system software and the user's application software.

02 The role of middleware in Scrapy framework

Let's start with a diagram to understand the Scrapy architecture.

We can see that Scrapy framework has two middleware components.

One is Downloader Middleware, which is the hub of Engine and Downloader. It is mainly responsible for handling the responses passed by Downloader to Engine;

The other is Spider middleware, which is the bridge between Engine and Spider; it mainly processes Spider's inputs (responses) and outputs items and requests to Engine;

03 Implement your own middleware

In Scrapy framework, both Downloader middleware and Spider middleware support custom extensions. In practice, we often need to customize the Downloader middleware. For example, implement a User-Agent middleware to add randomly filtered User-Agent attributes to the header of each HTTP request; or implement a proxy middleware to set randomly selected proxy addresses for each HTTP request.

Next, let's learn how to implement Scrapy's Downloader middleware.

1)Defining Middleware

In the Scrapy project, find the middlewares.py file and create your own middleware classes in it. For example, I create a proxy middleware:

class ProxyMiddleware(object):

There are three methods for each middleware, namely:

process_request(request,spider)

This method is called when each request passes through the download middleware. The method must return one of three: None, which returns a Response object, a Request object, or raise IgnoreRequest. Each return value has a different effect.

None: Scrapy will continue to process the request, executing corresponding methods of other middleware, until the appropriate download handler is called and the request is executed (its response downloaded). If there is more than one middleware, the other middleware can handle the request by returning Null and specifying the corresponding middleware.

Request object: Scrapy stops calling the process_request method and reschedule the returned request. This is an HTTP request that rejects the Request.

Response object: directly returns the result.

raise IgnoreRequest exception: an exception is thrown and then invoked by the middleware process_exception() method.

process_response(request, response, spider)

There are three types of return values for process_response: Response object, Request object, or raise an IgnoreRequest exception.

If the result returned is Response, the response is processed by the process_response() method of the other middleware in the chain.

If the result is a Request object, the middleware chain stops and the request is rescheduled for download.

raise IgnoreRequest exception: An exception is thrown and then invoked by the middleware process_exception() method.

process_exception(request, exception, spider)

Scrapy calls process_exception() when the download handler or process_request() (download middleware) throws an exception (including an IgnoreRequest exception).

The return result of process_exception() is also three: None, Response object, Request object.

If it returns None, Scrapy will continue to handle the exception and then call the process_exception() method of the other middleware installed until all middleware has been called.

If it returns a Response object, the process_response() method of the other middleware chain is called, after which Scrap y will not call the process_exception() method of the other middleware.

If it returns a Request object, the returned request will be recalled for download. This stops the middleware process_exception() method from executing, just as it does returning a response. If the HTTP request fails, we can retry the HTTP request here. For example, if we frequently crawl to visit a website and cause it to be blocked IP, we can set up an additional proxy here to continue visiting.

We don't have to implement all the methods, we just need to implement the corresponding methods according to the requirements. For example, if I want to add a proxy address to every HTTP request, I can implement process_request().

class ProxyMiddleware(object):

# overwrite process request

def process_request(self, request, spider):

#Read a proxy address randomly from the database

proxy_address = proxy_pool.random_select_proxy()

logging.debug("===== ProxyMiddleware get a random_proxy:【 {} 】 =====".format(proxy_address))

request.meta['proxy'] = proxy_address

return None

2)Enable middleware at setting.py

We have implemented the middleware, and the final step is to enable it. We add the defined middleware to the settings.py file. If you are overloading the system middleware, you also need to set the system middleware value to None. The proxy middleware I defined earlier is required to operate on HTTP requests. So overloaded with HttpProxyMiddleware middleware.

#Middleware filling rules

# yourproject.myMiddlewares(filename).middleware class

#Set up proxy

'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': None,

'scrapydemo.middlewares.ProxyMiddleware': 100 After reading the above content, do you know how to use the middle key of Scrapy? If you still want to learn more skills or want to know more related content, welcome to pay attention to the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report