Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How python crawls a free agent and verifies that the agent is available

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

In order to solve the problem of how python crawls free agents and verifies the availability of agents, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find an easier way.

Share a python script, use proxy ip to access web pages, make it easy to grab data and automatically verify whether ip can be used

Under what circumstances will proxy IP be used? For example, if you want to crawl data from a website, there are 1 million pieces of content on the website, and they have an IP limit. Each IP can only catch 1000 entries per hour. If a single IP is limited, it will take about 40 days to complete the collection. If you use proxy IP and keep switching IP, you can break through the frequency limit of 1000 entries per hour, thus improving efficiency.

The script begins:

Import requestsfrom lxml import etree# gets the agent def get_proxy_list () on the home page of the fast agent: url = "https://www.jxmtjt.com/" payload = {} headers = {" User-Agent ":" Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36 ", 'Accept':' application/json, text/javascript, * / * Response = requests.request ("GET", url, headers=headers, data=payload) res = [] _ = etree.HTML (response.text) type_dct = {"HTTP": "http://"," "HTTPS": "https://"} data_list = _ .xpath (" / / tbody/tr ") for data in data_list: ip = data.xpath (". / td [1] / text () ") [0] port = data.xpath (". / td [2] / text () ") [0] type = data.xpath (". / td [4] / text () ") [0] res.append (type_ DCT [type] + ip +':'+ port) return res# test agent def check (proxy): href = 'http://www.baidu.com/' if' https' in proxy: proxies = {'https': proxy} else: proxies = {' http': proxy} headers = {'User-Agent':' Mozilla/5.0 (Macintosh) Intel Mac OS X 11 / 2 / 2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4396.0 Safari/537.36'} try: r = requests.get (href, proxies=proxies, timeout=5) Headers=headers) if r.status_code = = 200: return True except: return Falseif _ _ name__ = ='_ _ main__': proxy_list = get_proxy_list () print (proxy_list) for p in proxy_list: print (p, check (p)) after everyone's code is copied Change the URL of the acquisition agent ip and you can use it. I have been using the code, and you can also cnblogs. Baidu searches whether there is a free agent ip to get the URL ~ the answer to the question about how python crawls the free agent and verifies the availability of the agent is shared here. I hope the above content can be of some help to everyone, if you still have a lot of doubts unsolved. You can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report