Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to configure and debug the IP agent of Python crawler Scrapy framework

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces the Python crawler Scrapy framework IP agent configuration and debugging, the article is very detailed, has a certain reference value, interested friends must read it!

Where is the logic of proxy ip?

The project structure of a scrapy is like this.

Scrapydownloadertest # Project folder │ items.py # defines the data structure for crawling result storage │ middlewares.py # Middleware (understandable filter interceptor for java) │ pipelines.py # data pipeline Perform operations on the obtained data │ settings.py # configuration file │ _ _ init__.py # initialization logic │ ├─ spiders # │ │ httpProxyIp.py # folder of Spiders crawled to the result processing class │ │ _ _ init__.py # spider initialization logic scrapy.py

From the above, we can find that the setting of the proxy ip must be set before sending the request, so the only place that meets the requirements is middlewares.py, so the relevant logic about the proxy is written in this. Add the following code directly to it:

The built-in Downloader Middleware of # Scrapy provides the basic function for Scrapy. # defines a class in which (object) can not be written, and the effect is the same as class SimpleProxyMiddleware (object): # declare the core method of an array proxyList = ['http://218.75.158.153:3128','http://188.226.141.61:8080'] # Downloader Middleware A custom Downloader Middleware def process_request (self, request, spider) is defined only if one or more of these methods are implemented: # randomly select one of them And remove the spaces proxy = random.choice (self.proxyList). Strip () # print the result to observe the core method of print ("this is request ip:" + proxy) # setting the content of the proxy property of request as proxy ip request.meta ['proxy'] = proxy # Downloader Middleware A custom Downloader Middleware def process_response (self, request, response) is defined only if one or more of these methods are implemented. Spider): # request failure is not equal to 200 if response.status! = 200: # reselect a proxy ip proxy = random.choice (self.proxyList) .strip () print ("this is response ip:" + proxy) # set the new proxy ip content request.mete ['proxy'] = proxy return request return response

Each Downloader Middleware defines a class of one or more methods, and the core methods are as follows:

Process_request (request, spider)

Process_response (request,response, spider)

Process_exception (request, exception, spider)

Then find this area in the setting.py file

Modify it as follows, that is, uncomment, plus the path to the Middleware class you just wrote

After configuring a simple proxy ip above, we come to the file httpProxyIp.py, which is generated by me through the command scrapy genspider httpProxyIp icanhazip.com. The successful creation is as follows:

#-*-coding: utf-8-*-import scrapyclass HttpproxyipSpider (scrapy.Spider): name = 'httpProxyIp' allowed_domains = [' icanhazip.com'] start_urls = ['http://icanhazip.com/'] def parse (self, response): pass

Let's modify it, and the final code is as follows:

#-*-coding: utf-8-*-import scrapyfrom scrapy.cmdline import executeclass HttpproxyipSpider (scrapy.Spider): # spider task name name = 'httpProxyIp' # allowed domain name allowed_domains = [' icanhazip.com'] # url start_urls for initial crawling = ['http://icanhazip.com/'] # spider crawler resolution method, all content resolution is completed here Self represents the reference of the instance, and the result of the response crawler def parse (self, response): print ('ip after proxy:', response.text) # this is the usual way to write the main function and the entry of the whole program if _ _ name__ ='_ _ main__': execute (['scrapy',' crawl', 'httpbin'])

Running the program scrapy crawl httpProxyIp at this time can see the output of the result.

Obviously, the result we want is not printed here, which shows that proxyList = ['http://218.75.158.153:3128','http://188.226.141.61:8080'] is useless before. Let's see if there is anything we can use. It's free here, so it takes some time to find a free proxy ip.

This completes the proxy setup and verification debugging for scrapy.

How to configure dynamic proxy ip

The fee-based proxy ip is used here. You can use the services provided by cloud service providers such as Express Agent or Abu Cloud. After you register and pay the fee, you will be given an access url and username password. Look directly at the code here. Also create a new class in middlewares.py

Modify the DOWNLOADER_MIDDLEWARES content of setting.py

DOWNLOADER_MIDDLEWARES = {# comment out the previous example and use AbuyunProxyMiddleware # 'scrapydownloadertest.middlewares.SimpleProxyMiddleware': 100,' scrapydownloadertest.middlewares.AbuyunProxyMiddleware': 100,} instead

Other places do not move, we are starting to see, here a different way to start, because the use of PyCharm development tools, so you can directly

Http://icanhazip.com/ is a website that displays the current visitor ip and can be easily used to verify that scrapy's proxy ip settings are successful.

These are all the contents of the article "how to configure and debug IP agents in the Python crawler Scrapy framework". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report