In addition to Weibo, there is also WeChat
Please pay attention

WeChat public account
Shulou
 
            
                     
                
2025-10-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "how to use thread pool in Python crawler". In daily operation, I believe many people have doubts about how to use thread pool in Python crawler. Xiaobian consulted all kinds of information and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubt about how to use thread pool in Python crawler. Next, please follow the editor to study!
I. Preface
Up to now, we can say that we have learned the basic knowledge of crawlers. Without those strange anti-crawler mechanisms, as long as we have time to analyze, general data can be crawled. So what we need to consider at this time is the efficiency of crawling, that is, to improve the efficiency of crawlers, that is, to achieve asynchronous crawlers. We can consider the following two ways: one is the use of thread pool (that is, the implementation of multithreading under a single process), and the other is the use of co-program (if I remember correctly, the co-program module I used was introduced from python3.4, and the python version I used when I wrote my blog was 3.9).
Today we will first talk about thread pools.
Second, synchronization code demonstration
Let's first write a piece of code in the form of ordinary synchronization.
Import timedef func (url): print ("downloading:", url) time.sleep (2) print ("download completed:", url) if _ _ name__ = ='_ _ main__': start = time.time () # start time url_list = ["a", "b" "c"] for url in url_list: func (url) end = time.time () # end time print (end-start)
it is as expected. The running time is indeed six seconds.
Third, out of step, thread pool code
So what if we use a thread pool to run the above code?
Import timefrom multiprocessing import Pooldef func (url): print ("downloading:", url) time.sleep (2) print ("download completed:", url) if _ _ name__ ='_ _ main__': start = time.time () # start time url_list = ["a", "b", "c"] pool = Pool (len (url_list)) # instantiate a thread pool object And set the upper limit of the thread pool to the list length. You don't have to set an upper limit. Pool.map (func, url_list) end = time.time () # end time print (end-start)
We found that our running time was only 2-3 seconds this time. In fact, we can simply understand thread pool as doing multiple tasks at the same time.
Note:
1. I'm using pycharm, and if I'm using VS or the idle that comes with python, we can only see the last-time output at run time.
two。 Our output may not be in the order of abc.
Fourth, synchronous crawlers crawl pictures
Because our focus is on improving the crawling efficiency of the thread pool, we simply crawl an one-page picture.
Import requestsimport timeimport osfrom lxml import etreedef save_photo (url, title): # UA camouflage header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 "} # send request photo = requests.get (url=url, headers=header) .content # create path Avoid downloading if not os.path.exists ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ synchronous crawler 4K beauty image\" + title + ".jpg"): with open ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ synchronous crawler 4K beauty image\" + title + ".jpg", "wb") as fp: print (title) "start downloading!") Fp.write (photo) print (title, "download complete!") If _ _ name__ = ='_ main__': start = time.time () # create folder if not os.path.exists ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ synchronous crawler 4K beauty image"): os.mkdir ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ synchronous crawler 4K" Beauty Picture ") # UA camouflage header = {" User-Agent ":" Mozilla/5.0 (Windows NT 10.0 Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 "} # specify url url =" https://pic.netbian.com/4kmeinv/" # send the request and get the source code page = requests.get (url = url, headers = header). Text # xpath parsing Get the list of download addresses for pictures tree = etree.HTML (page) url_list = tree.xpath ('/ / * [@ id= "main"] / div [3] / ul/li/a/@href') # obtain the address and picture name of high-definition pictures through the download address for href in url_list: new_url = "https://pic.netbian.com" + href # again Send a request page = requests.get (url = new_url) Headers = header) .text # again xpath parse new_tree = etree.HTML (page) src = "https://pic.netbian.com" + new_tree.xpath ('/ / * [@ id=" img "] / img/@src') [0] title = new_tree.xpath ('/ / * [@ id=" img "] / img/@title') [0] .split (") [0] # compilation text title = title.encode ("iso-8859-1") .decode ("gbk") # download Save save_photo (src, title) end = time.time () print (end-start)
Let's see how long it takes for synchronous crawlers.
Then let's see how long it takes for an asynchronous crawler using a thread pool to crawl these images.
5. Use the asynchronous crawler of thread pool to crawl 4K beauty image import requestsimport timeimport osfrom lxml import etreefrom multiprocessing import Pooldef save_photo (src_title): # UA camouflage header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64) X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 "} # send request url= src_title [0] title = src_title [1] photo = requests.get (url=url, headers=header) .content # create path Avoid downloading if not os.path.exists ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ asynchronous crawler 4K beauty image\" + title + ".jpg"): with open ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ asynchronous crawler 4K beauty image\" + title + ".jpg", "wb") as fp: print (title) "start downloading!") Fp.write (photo) print (title, "download complete!") If _ _ name__ = ='_ main__': start = time.time () # create folder if not os.path.exists ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ asynchronous crawler 4K beauty image"): os.mkdir ("C:\ Users\ ASUS\ Desktop\ CSDN\ high performance asynchronous crawler\ thread pool\ asynchronous crawler 4K Beauty Picture ") # UA camouflage header = {" User-Agent ":" Mozilla/5.0 (Windows NT 10.0 Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 "} # specify url url =" https://pic.netbian.com/4kmeinv/" # send the request and get the source code page = requests.get (url = url, headers = header). Text # xpath parsing Get a list of download addresses for pictures tree = etree.HTML (page) url_list = tree.xpath ('/ / * [@ id= "main"] / div [3] / ul/li/a/@href') # store the list of the last URL and title src_list = [] title_list = [] # obtain the address and picture name for href of the high-definition picture through the download address In url_list: new_url = "https://pic.netbian.com" + href # send the request page = requests.get again (url = new_url Headers = header) .text # again xpath parse new_tree = etree.HTML (page) src = "https://pic.netbian.com" + new_tree.xpath ('/ / * [@ id=" img "] / img/@src') [0] src_list.append (src) title = new_tree.xpath ('/ / * [@ id=" img "] / img/@title') [ 0] .split ("") [0] # compilation text title = title.encode ("iso-8859-1") .decode ("gbk") title_list.append (title) # download Save. Use thread pool pool = Pool () src_title = zip (src_list, title_list) pool.map (save_photo, list (src_title)) end = time.time () print (end-start) this ends the study of "how to use thread pool in Python crawler", hoping to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

The market share of Chrome browser on the desktop has exceeded 70%, and users are complaining about

The world's first 2nm mobile chip: Samsung Exynos 2600 is ready for mass production.According to a r


A US federal judge has ruled that Google can keep its Chrome browser, but it will be prohibited from

Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope





 
             
            About us Contact us Product review car news thenatureplanet
More Form oMedia: AutoTimes. Bestcoffee. SL News. Jarebook. Coffee Hunters. Sundaily. Modezone. NNB. Coffee. Game News. FrontStreet. GGAMEN
© 2024 shulou.com SLNews company. All rights reserved.