How to implement a multithreaded crawler in python 10/23 Update SLTechnology News&Howtos

How to implement a multithreaded crawler in python

2025-10-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces how to achieve a multi-threaded crawler in python, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Development environment:

Ubuntu16.04,python3.6,bs4,virtualenv (virtual environment)

Create a virtual environment:

Create a project folder, create a virtual environment for the project, and install related packages using pip

Mkdir mutiThreadCrawier

Cd mutiThreadCrawier

Mkdir content # Save crawled pages

Virtualenv env-- python = python3.6 # create a virtual environment

Source env/bin/activate # makes the virtual environment effective

Guide package:

Import time

Import re

Import threading

Import urllib

Import requests

From bs4 import BeautifulSoup

Define variable

G_mutex = threading.Condition () # Lock can be added / released

Print (g_mutex)

Print (type (g_mutex))

G_urls = [] # store the source code of the parsed url corresponding to the web page

G_queue_urls = [] # url to crawl

G_exist_urls = [] # url that has been climbed

G_failed_urls = [] # failed link

G_total_count = 0 # counters for pages that have been downloaded

Define thread classes:

Create a thread class, inherit from threading.Thread, and construct it, request a network connection according to the url path in the run function, save the page html document locally, and throw an exception if the download fails. And add the route downloaded from the page to g_exist_urls

Class CrawlerThread (threading.Thread):

Def _ init__ (self,url,filename,tid):

Threading.Thread.__init__ (self)

Self.filename=filename

Self.url = url

Self.tid=tid

Def run (self):

Try:

Resp=urllib.request.urlopen (self.url)

Html=resp.read ()

With open ('content/'+self.filename,'wb') as f:

F.write (html)

Except Exception as e:

G_exist_urls.append (self.url)

G_failed_urls.append (self.url)

Print (f 'page {self.url} download failed!')

G_mutex.acquire ()

G_urls.append (html)

G_exist_urls.append (self.url)

G_mutex.release ()

Define reptiles:

It is constructed, the log is created, the download () function creates the thread, update_queque_url updates the list of connections, get_url () matches to get the connection according to bs4, and download_all () achieves batch download by calling the download () function. Spider crawls as an entry function

Class Crawler:

Def _ init__ (self,name,domain,thread_number):

Self.name=name

Self.domain=domain

Self.thread_number=thread_number

Self.logfile=open ('log.txt','w')

Self.thread_pool= []

Self.url = 'http://'+domain

Def spider (self): # content will be updated as the crawler progresses

Initial global g_queue_urls#, there is only one url in the queue

G_queue_urls.append (self.url) # depth of crawl

Depth = 0

Print (f 'crawler {self.name} starts.')

While g_queue_urls:

Depth + = 1

Print (f' current crawl depth is {depth}')

Self.logfile.write (f'URL: {g_queue_urls [0]}')

Self.download_all () # download all

Self.update_queque_url () # Update url queue

Self.logfile.write (f "> Depth: {depth}")

Count = 0

While count

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.