Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Implementation of IP proxy Pool for Python Crawler

2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Implementation of IP proxy Pool for Python Crawler

In many cases, if you want to crawl a web page with multiple threads, or simply anti-crawl, we need to access it through a proxy IP. Let's take a look at a basic implementation.

Acting IP attachment. Txt extraction, there are many websites on the Internet to provide this service. Basically, reliability is proportional to silver. The free IP provided at home is basically impossible to use, if you want to be reliable agents can only pay; foreign slightly better, some free IP is still more reliable.

After a casual search on the Internet, I found a web page. I wanted to climb some corresponding IP manually, but I found that I could download ready-made txt files directly.

After downloading, try using different agents to climb the home page of Baidu.

#! / usr/bin/env pythonium!-*-coding:utf-8-*-# Author: Yuan Liimport re,urllib.requestfp=open ("c:\\ temp\\ thebigproxylist-17-12-20.txt",'r') lines=fp.readlines () for ip in lines: try: print ("current IP" + ip) proxy=urllib.request.ProxyHandler ({"http": ip}) opener=urllib.request.build_opener (proxy) Urllib.request.HTTPHandler) urllib.request.install_opener (opener) url= "http://www.baidu.com" data=urllib.request.urlopen (url). Read (). Decode ('utf-8' 'ignore') print ("pass") print ("-") except Exception as err: print (err) print ("- -") fp.close ()

The results are as follows:

C:\ Python36\ python.exe C:/Users/yuan.li/Documents/GitHub/Python/Misc/ crawler / proxy.py current proxy IP 137.74.168.174VRO 80 pass-- current proxy IP 103.28.161.68RV 8080 pass-- -current agent IP 91.151.106.127:53281HTTP Error 503: Service Unavailable-- current agent IP 177.136.252.7 IP 3128m-current agent IP 47.89.22.200 Vera 80 pass- -current proxy IP 118.69.61.57:8888HTTP Error 503: Service Unavailable-- current proxy IP 192.241.190.167 IP 8080 pass-- current proxy IP 185.124.112 .130 80 pass-current agent IP 83.65.246.181v 3128 pass-current agent IP 79.137.42.124rig 3128 pass -current agent IP 95.0.217.32 80m-current agent IP 104.131.94.221v 8080 pass

However, the above method is only suitable for relatively stable IP sources. If the IP is unstable, the corresponding text may soon become invalid. It is best to dynamically obtain the latest IP address. Many websites provide API that can be queried in real time

Or use the website just now, this time we use API to call, here we need a browser camouflage to crawl

#! / usr/bin/env pythonium!-*-coding:utf-8-*-# Author: Yuan Liimport re,urllib.requestheaders= ("User-Agent", "Mozilla/5.0 (Windows NT 10.0)" WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.x MetaSr 1.0 ") opener=urllib.request.build_opener () opener.addheaders= [headers] # installed as global urllib.request.install_opener (opener) data=urllib.request.urlopen (" http://www.thebigproxylist.com/members/proxy-api.php?output=all&user=list&pass=8a544b2637e7a45d1536e34680e11adf").read().decode('utf8')ippool=data.split('\n')for ip in ippool: ip=ip.split ('" ') [0] try: print ("current IP" + ip) proxy=urllib.request.ProxyHandler ({"http": ip}) opener=urllib.request.build_opener (proxy,urllib.request.HTTPHandler) urllib.request.install_opener (opener) url= "http://www.baidu.com" data=urllib.request.urlopen (url). Read () .decode (' utf-8' 'ignore') print ("pass") print ("-") except Exception as err: print (err) print ("- -") fp.close ()

The results are as follows:

C:\ Python36\ python.exe C:/Users/yuan.li/Documents/GitHub/Python/Misc/ crawler / proxy.py current proxy IP 213.233.57.134:80HTTP Error 403: Forbidden-- current proxy IP 144.76.81.79 proxy.py 3128 pass-- -current agent IP 45.55.132.29:53281HTTP Error 503: Service Unavailable-- current agent IP 180.254.133.124 IP 8080 pass-- current agent IP 5.196.215.231:3128HTTP Error 503: Service Unavailable- -current agent IP 177.99.175.195:53281HTTP Error 503: Service Unavailable

Because it's too slow to read text sequentially through a direct for loop, I try to read it with multiple threads, which makes it much faster.

#! / usr/bin/env pythonium!-*-coding:utf-8-*-# Author: Yuan Liimport threadingimport queueimport re,urllib.request#Number of threadsn_thread = 10#Create queuequeue = queue.Queue () class ThreadClass (threading.Thread): def _ _ init__ (self, queue): threading.Thread.__init__ (self)

Super (ThreadClass, self). _ _ init__ () # Assign thread working with queue self.queue = queue def run (self): while True: # Get from queue job host = self.queue.get ()

Print (self.getName () + ":" + host) try: # print ("current agent IP" + host) proxy = urllib.request.ProxyHandler ({"http": host}) opener = urllib.request.build_opener (proxy, urllib.request.HTTPHandler) urllib.request.install_opener (opener) url = "http://www.baidu.com" data = urllib.request.urlopen (url). Read (). Decode ('utf-8') 'ignore') print ("pass") print ("- -")

Except Exception as err: print (err)

Print ("- -") # signals to queue jo

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report