In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
In order to solve the problem of how python crawls free agents and verifies the availability of agents, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find an easier way.
Share a python script, use proxy ip to access web pages, make it easy to grab data and automatically verify whether ip can be used
Under what circumstances will proxy IP be used? For example, if you want to crawl data from a website, there are 1 million pieces of content on the website, and they have an IP limit. Each IP can only catch 1000 entries per hour. If a single IP is limited, it will take about 40 days to complete the collection. If you use proxy IP and keep switching IP, you can break through the frequency limit of 1000 entries per hour, thus improving efficiency.
The script begins:
Import requestsfrom lxml import etree# gets the agent def get_proxy_list () on the home page of the fast agent: url = "https://www.jxmtjt.com/" payload = {} headers = {" User-Agent ":" Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36 ", 'Accept':' application/json, text/javascript, * / * Response = requests.request ("GET", url, headers=headers, data=payload) res = [] _ = etree.HTML (response.text) type_dct = {"HTTP": "http://"," "HTTPS": "https://"} data_list = _ .xpath (" / / tbody/tr ") for data in data_list: ip = data.xpath (". / td [1] / text () ") [0] port = data.xpath (". / td [2] / text () ") [0] type = data.xpath (". / td [4] / text () ") [0] res.append (type_ DCT [type] + ip +':'+ port) return res# test agent def check (proxy): href = 'http://www.baidu.com/' if' https' in proxy: proxies = {'https': proxy} else: proxies = {' http': proxy} headers = {'User-Agent':' Mozilla/5.0 (Macintosh) Intel Mac OS X 11 / 2 / 2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4396.0 Safari/537.36'} try: r = requests.get (href, proxies=proxies, timeout=5) Headers=headers) if r.status_code = = 200: return True except: return Falseif _ _ name__ = ='_ _ main__': proxy_list = get_proxy_list () print (proxy_list) for p in proxy_list: print (p, check (p)) after everyone's code is copied Change the URL of the acquisition agent ip and you can use it. I have been using the code, and you can also cnblogs. Baidu searches whether there is a free agent ip to get the URL ~ the answer to the question about how python crawls the free agent and verifies the availability of the agent is shared here. I hope the above content can be of some help to everyone, if you still have a lot of doubts unsolved. You can follow the industry information channel for more related knowledge.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.