Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The process of building python crawler agent pool

2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article introduces the relevant knowledge of "the process of building the python crawler agent pool". In the operation of the actual case, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Recently, when crawling data using a crawler, a 403 code is often returned, which roughly means that the IP access is too frequent and restricted. Restrict IP access to the site is the most commonly used anti-crawling means, in fact, it is easy to crack, is to crawl the site is to use a proxy, this IP is restricted, use other IP. For high-end companies, they basically use fee-based agents, there will be no problem, more stable. For a short and poor person like me, I must not be able to afford a charging agent. Generally use domestic free agents, and there are many free agents on the Internet.

Many people crawl a batch of free proxy IP from the Internet and store it in a storage medium, such as excel files or databases. Maintain the agent regularly to ensure the availability of the agent. The drawback of this approach is that some machines do not have databases such as excel or mysql or redis installed, which leads to the failure of the proxy pool.

I used to do java development, often put some commonly used data in ArrayList, very convenient to use, efficient, so learn from the previous experience in java, the agent IP crawled down and stored in the list list, the list list as an agent pool, often maintain the agents in this pool.

I often crawl free agents' websites such as xicidaili swei360. These free agents are enough for me to handle most crawler jobs. The crawling process requires the use of requests and pyquery libraries, which are installed by students who have not installed them.

First of all, introduce the process of crawling xicidaili websites. You need to define a method for crawling xicidaili websites. There are two parameters, one is url, and the other is the number of pages of proxy pages to be crawled, that is, several pages to be crawled. The method is as follows:

Def get_xicidaili_proxy (url,page): for i in range (1century page): headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.x MetaSr 1.0 "} response = requests.get (url + str (I) Headers=headers) html = response.text doc = pq (html) ip_list = doc ('# ip_list') ('tr:gt (0)'). Items () for item in ip_list: ip = item.find ('td:nth-child (2)'. Text () port = item.find ('td:nth-child (3)'). Text () http_type = item.find ('td:nth-child ( 6)'). Text () proxy_ip = http_type + ": / /" + ip + ":" + port if http_type = = 'HTTP': http_proxy_pool.append (proxy_ip) elif http_type =' HTTPS': https_proxy_pool.append (proxy_ip) # print (proxy_ip)

Two list variables, http_proxy_pool and https_proxy_pool, are defined to store agents of type http and type https. Use PyQuery to extract ip, port and http type information according to the css pseudo-selector, and combine it into a string in the way http:// + ip+port, which is stored in the defined http_proxy_tool and https_proxy_pool variables.

The method of crawling swei360 website agent is not posted out, and the principle is the same as crawling xicidaili website.

To determine whether an agent is available before using it, we use the return code of request's get request to determine whether the agent is available. If 200 is returned, the agent is available. If other codes are returned, the proxy is unavailable. The code is as follows:

Def detect_proxy (test_url,http_type,proxy): headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.x MetaSr 1.0 "} proxy= {http_type: proxy} try: response = requests.get (test_url,proxies=proxy,headers=headers) if response.status_code in [200]: print ('Agent available', proxy) return True else: print ('Agent not available', proxy) Delete_proxy (http_type,proxy) return False except (requests.exceptions.ProxyError,RequestException): print ('proxy unavailable', proxy) delete_proxy (http_type,proxy) return False

The detect_proxy method is defined to check whether the agent is available, with three parameters, namely, the test URL, the agent type (http and https), and the proxy IP. When the request for requests returns code 200, it indicates that the agent is available and returns True, otherwise it is not available and False is returned. When a request exception or other error is encountered, the agent is also considered unavailable, and False is returned. For unavailable agents, remove them from the proxy pool.

When we get an agent from an agent pool, we use to randomly return an agent from the agent pool, thus avoiding frequent use of an agent and thus being denied access. The code is as follows:

Def get_https_proxy (): proxy_ip = random.choice (https_proxy_pool) return proxy_ipdef get_http_proxy (): proxy_ip = random.choice (http_proxy_pool) return proxy_ip

In order to ensure the availability of agents, clean up in time when an agent is detected to be unavailable. Is to remove it from the http_proxy_pool and https_proxy_pool lists.

A simple reptile agent pool has been built. Summarize the process of building a reptile agent pool:

Crawl the agent information from the free agent website and store it in the list.

Provides a method to randomly get agents from the agent pool. Web sites of type http use proxies of type http, and sites of type https use proxies of type https, so they provide methods to obtain proxies of type http and https respectively.

Provides a method to detect whether an agent is available, which returns True if it is available and False if not.

Provides a way to delete an agent.

This proxy pool is actually quite simple, and one drawback is that when testing whether the agent is available, if the code returned is not 200, the agent is considered unavailable, and there are many cases in which other codes are returned, such as the network is not available, the test website is not accessible, and so on. It is a good practice to set a score for each agent, such as 10, minus 1 if it is detected to be unavailable, and when the score is 0, determine that the agent is unavailable and remove it directly from the agent pool. If an agent is detected to be available, the score is set to 10.

This practice gives each detection of unavailable agents a chance to mend their ways, so that they will not be abandoned across the board.

This is the content of "the process of building the python crawler agent pool". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report