In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces how to achieve a multi-threaded crawler in python, the content is very detailed, interested friends can refer to, hope to be helpful to you.
Development environment:
Ubuntu16.04,python3.6,bs4,virtualenv (virtual environment)
Create a virtual environment:
Create a project folder, create a virtual environment for the project, and install related packages using pip
Mkdir mutiThreadCrawier
Cd mutiThreadCrawier
Mkdir content # Save crawled pages
Virtualenv env-- python = python3.6 # create a virtual environment
Source env/bin/activate # makes the virtual environment effective
Guide package:
Import time
Import re
Import threading
Import urllib
Import requests
From bs4 import BeautifulSoup
Define variable
G_mutex = threading.Condition () # Lock can be added / released
Print (g_mutex)
Print (type (g_mutex))
G_urls = [] # store the source code of the parsed url corresponding to the web page
G_queue_urls = [] # url to crawl
G_exist_urls = [] # url that has been climbed
G_failed_urls = [] # failed link
G_total_count = 0 # counters for pages that have been downloaded
Define thread classes:
Create a thread class, inherit from threading.Thread, and construct it, request a network connection according to the url path in the run function, save the page html document locally, and throw an exception if the download fails. And add the route downloaded from the page to g_exist_urls
Class CrawlerThread (threading.Thread):
Def _ init__ (self,url,filename,tid):
Threading.Thread.__init__ (self)
Self.filename=filename
Self.url = url
Self.tid=tid
Def run (self):
Try:
Resp=urllib.request.urlopen (self.url)
Html=resp.read ()
With open ('content/'+self.filename,'wb') as f:
F.write (html)
Except Exception as e:
G_exist_urls.append (self.url)
G_failed_urls.append (self.url)
Print (f 'page {self.url} download failed!')
G_mutex.acquire ()
G_urls.append (html)
G_exist_urls.append (self.url)
G_mutex.release ()
Define reptiles:
It is constructed, the log is created, the download () function creates the thread, update_queque_url updates the list of connections, get_url () matches to get the connection according to bs4, and download_all () achieves batch download by calling the download () function. Spider crawls as an entry function
Class Crawler:
Def _ init__ (self,name,domain,thread_number):
Self.name=name
Self.domain=domain
Self.thread_number=thread_number
Self.logfile=open ('log.txt','w')
Self.thread_pool= []
Self.url = 'http://'+domain
Def spider (self): # content will be updated as the crawler progresses
Initial global g_queue_urls#, there is only one url in the queue
G_queue_urls.append (self.url) # depth of crawl
Depth = 0
Print (f 'crawler {self.name} starts.')
While g_queue_urls:
Depth + = 1
Print (f' current crawl depth is {depth}')
Self.logfile.write (f'URL: {g_queue_urls [0]}')
Self.download_all () # download all
Self.update_queque_url () # Update url queue
Self.logfile.write (f "> Depth: {depth}")
Count = 0
While count
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.