How to use thread pool in Python crawler 07/12 Update SLTechnology News&Howtos

How to use thread pool in Python crawler

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to use thread pool in Python crawler". In daily operation, I believe many people have doubts about how to use thread pool in Python crawler. Xiaobian consulted all kinds of information and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubt about how to use thread pool in Python crawler. Next, please follow the editor to study!

I. Preface

Up to now, we can say that we have learned the basic knowledge of crawlers. Without those strange anti-crawler mechanisms, as long as we have time to analyze, general data can be crawled. So what we need to consider at this time is the efficiency of crawling, that is, to improve the efficiency of crawlers, that is, to achieve asynchronous crawlers. We can consider the following two ways: one is the use of thread pool (that is, the implementation of multithreading under a single process), and the other is the use of co-program (if I remember correctly, the co-program module I used was introduced from python3.4, and the python version I used when I wrote my blog was 3.9).

Today we will first talk about thread pools.

Second, synchronization code demonstration

Let's first write a piece of code in the form of ordinary synchronization.

Import timedef func (url): print ("downloading:", url) time.sleep (2) print ("download completed:", url) if _ _ name__ = ='_ _ main__': start = time.time () # start time url_list = ["a", "b" "c"] for url in url_list: func (url) end = time.time () # end time print (end-start)

it is as expected. The running time is indeed six seconds.

Third, out of step, thread pool code

So what if we use a thread pool to run the above code?

Import timefrom multiprocessing import Pooldef func (url): print ("downloading:", url) time.sleep (2) print ("download completed:", url) if _ _ name__ ='_ _ main__': start = time.time () # start time url_list = ["a", "b", "c"] pool = Pool (len (url_list)) # instantiate a thread pool object And set the upper limit of the thread pool to the list length. You don't have to set an upper limit. Pool.map (func, url_list) end = time.time () # end time print (end-start)

We found that our running time was only 2-3 seconds this time. In fact, we can simply understand thread pool as doing multiple tasks at the same time.

Note:

1. I'm using pycharm, and if I'm using VS or the idle that comes with python, we can only see the last-time output at run time.

two。 Our output may not be in the order of abc.

Fourth, synchronous crawlers crawl pictures

Because our focus is on improving the crawling efficiency of the thread pool, we simply crawl an one-page picture.

Import requestsimport timeimport osfrom lxml import etreedef save_photo (url, title): # UA camouflage header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 "} # send request photo = requests.get (url=url, headers=header) .content # create path Avoid downloading if not os.path.exists ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ synchronous crawler 4K beauty image\" + title + ".jpg"): with open ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ synchronous crawler 4K beauty image\" + title + ".jpg", "wb") as fp: print (title) "start downloading!") Fp.write (photo) print (title, "download complete!") If _ _ name__ = ='_ main__': start = time.time () # create folder if not os.path.exists ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ synchronous crawler 4K beauty image"): os.mkdir ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ synchronous crawler 4K" Beauty Picture ") # UA camouflage header = {" User-Agent ":" Mozilla/5.0 (Windows NT 10.0 Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 "} # specify url url =" https://pic.netbian.com/4kmeinv/" # send the request and get the source code page = requests.get (url = url, headers = header). Text # xpath parsing Get the list of download addresses for pictures tree = etree.HTML (page) url_list = tree.xpath ('/ / * [@ id= "main"] / div [3] / ul/li/a/@href') # obtain the address and picture name of high-definition pictures through the download address for href in url_list: new_url = "https://pic.netbian.com" + href # again Send a request page = requests.get (url = new_url) Headers = header) .text # again xpath parse new_tree = etree.HTML (page) src = "https://pic.netbian.com" + new_tree.xpath ('/ / * [@ id=" img "] / img/@src') [0] title = new_tree.xpath ('/ / * [@ id=" img "] / img/@title') [0] .split (") [0] # compilation text title = title.encode ("iso-8859-1") .decode ("gbk") # download Save save_photo (src, title) end = time.time () print (end-start)

Let's see how long it takes for synchronous crawlers.

Then let's see how long it takes for an asynchronous crawler using a thread pool to crawl these images.

5. Use the asynchronous crawler of thread pool to crawl 4K beauty image import requestsimport timeimport osfrom lxml import etreefrom multiprocessing import Pooldef save_photo (src_title): # UA camouflage header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 "} # send request url= src_title [0] title = src_title [1] photo = requests.get (url=url, headers=header) .content # create path Avoid downloading if not os.path.exists ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ asynchronous crawler 4K beauty image\" + title + ".jpg"): with open ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ asynchronous crawler 4K beauty image\" + title + ".jpg", "wb") as fp: print (title) "start downloading!") Fp.write (photo) print (title, "download complete!") If _ _ name__ = ='_ main__': start = time.time () # create folder if not os.path.exists ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ asynchronous crawler 4K beauty image"): os.mkdir ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ asynchronous crawler 4K Beauty Picture ") # UA camouflage header = {" User-Agent ":" Mozilla/5.0 (Windows NT 10.0 Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 "} # specify url url =" https://pic.netbian.com/4kmeinv/" # send the request and get the source code page = requests.get (url = url, headers = header). Text # xpath parsing Get a list of download addresses for pictures tree = etree.HTML (page) url_list = tree.xpath ('/ / * [@ id= "main"] / div [3] / ul/li/a/@href') # store the list of the last URL and title src_list = [] title_list = [] # obtain the address and picture name for href of the high-definition picture through the download address In url_list: new_url = "https://pic.netbian.com" + href # send the request page = requests.get again (url = new_url Headers = header) .text # again xpath parse new_tree = etree.HTML (page) src = "https://pic.netbian.com" + new_tree.xpath ('/ / * [@ id=" img "] / img/@src') [0] src_list.append (src) title = new_tree.xpath ('/ / * [@ id=" img "] / img/@title') [ 0] .split ("") [0] # compilation text title = title.encode ("iso-8859-1") .decode ("gbk") title_list.append (title) # download Save. Use thread pool pool = Pool () src_title = zip (src_list, title_list) pool.map (save_photo, list (src_title)) end = time.time () print (end-start) this ends the study of "how to use thread pool in Python crawler", hoping to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.