Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to establish a crawler Agent ip Pool by java

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces java how to establish a crawler agent ip pool, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.

Step description

1. Crawl the ip from the website that provides the proxy ip.

2. Initially filter the captured ip. For example, the speed of filtering IP type non-HTTPS,IP links from the beginning is more than 2 seconds.

3. For the IP that meets the requirements, check the quality to determine whether it is available or not. This step is to check the quality of IP, which means that this step brushes out a lot of IP.

4. Write the ip that meets the requirements to the Redis database and store it in Redis in the form of List.

5. Set a crawl cycle to update your IP proxy pool (after fetching and processing the new IP, we empty the original database and write the new IP to it)

Example

#! / usr/bin/env python3#-*-coding: utf-8-*-import requests,threading,datetimefrom bs4 import BeautifulSoupimport random "" 1. Grab the proxy ip2 of the proxy website, and according to the specified target url Verify the validity of crawling to ip. 3. Finally, save to the specified path "" #-document processing-- -# write document def write (path Text): with open (path,'a', encoding='utf-8') as f: f.writelines (text) f.write ('\ n') # clear the document def truncatefile (path): with open (path,' walled, encoding='utf-8') as f: f.truncate () # read the document def read (path): with open (path,' r' Encoding='utf-8') as f: txt = [] for s in f.readlines (): txt.append (s.strip ()) return txt#- -- # calculate the time difference Format: hour, minute and second def gettimediff (start,end): seconds = (end-start). Seconds m, s = divmod (seconds, 60) h, m = divmod (m, 60) diff = ("d:d:d"% (h, m) S) return diff#- -# return a random request header headersdef getheaders (): user_agent_list = [\ "Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1 "\" Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11 ",\" Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6 ",\" Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6 ",\" Mozilla/5.0 (Windows NT 6.2) WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1 ",\" Mozilla/5.0 (X11) Linux x86x64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5 ",\" Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5 ",\" Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3 ",\" Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3 ",\" Mozilla/5.0 (Macintosh) Intel Mac OS X 10: 8: 0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3 ",\" Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3 ",\" Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3 ",\" Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3 ",\" Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3 ",\" Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3 ",\" Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3 " \ "Mozilla/5.0 (X11) Linux x86 / 64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24 ",\" Mozilla/5.0 (Windows NT 6.2) WOW64) AppleWebKit/535.24 (KHTML Like Gecko) Chrome/19.0.1055.1 Safari/535.24 "] UserAgent=random.choice (user_agent_list) headers = {'User-Agent': UserAgent} return headers#-check if ip is available- -def checkip (targeturl Ip): headers= getheaders () # Custom request header proxies= {"http": "http://"+ip," https ":" http://"+ip} # Agent ip try: response=requests.get (url=targeturl,proxies=proxies,headers=headers Timeout=5) .status_code if response = = 20000: return True else: return False except: return False #-get proxy method- -- # Free Agent XiciDailidef findip (type Pagenum,targeturl,path): # ip type, page number, target url, path where ip is stored: list= {'1permission:' http://www.xicidaili.com/nt/', # xicidaili domestic general agent '2percent:' http://www.xicidaili.com/nn/', # xicidaili domestic high concealment agent '3percent:' http://www.xicidaili.com/wn/', # xicidaili domestic https agent '4percent:' http://www.xicidaili.com/wt/'} # xicidaili Foreign http agent url= list [str (type)] + str (pagenum) # configuration url headers= getheaders () # Custom request header html=requests.get (url=url,headers=headers,timeout = 5). Text soup=BeautifulSoup (html,'lxml') all=soup.find_all ('tr' Class_='odd') for i in all: t=i.find_all ('td') ip=t [1] .text +':'+ t [2] .text is_avail = checkip (targeturl,ip) if is_avail = = True: write (path=path Text=ip) print (ip) #-Multi-thread crawling ip entry-def getip (targeturl Path): truncatefile (path) # empty document before crawling start = datetime.datetime.now () # start time threads= [] for type in range (4): # four types of ip, each type takes the first three pages, a total of 12 threads for pagenum in range (3): t=threading.Thread (target=findip,args= (type+1,pagenum+1,targeturl) Path) threads.append (t) print ('start crawling agent ip') for s in threads: # start multithreaded crawling s.start () for e in threads: # wait for all threads to end e.join () print (' crawl complete') end = datetime.datetime.now () # end time diff = gettimediff (start End) # calculation time ips = read (path) # number of ip crawled to print ('total crawl agent ip:% s, total time:% s\ n'% (len (ips)) Diff)) #-start -if _ _ name__ = ='_ main__': path = 'ip.txt' # stores the document crawling ip path targeturl =' http://www.cnblogs.com/TurboWay/' # the specified url getip (targeturl) to verify the validity of the ip Path) Thank you for reading this article carefully. I hope the article "how to build a crawler agent ip pool in java" shared by the editor will be helpful to you. At the same time, I hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report