In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly talks about "what are the differences between single-threaded, multithreaded and cooperative performance of Python crawler", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what are the differences between single-threaded, multithreaded and collaborative performance of Python crawlers?"
I. Preface
Today, I would like to share with you how to crawl the quotation data of China Rural Network products, and use ordinary single-thread, multi-thread and co-program to crawl, so as to compare the performance of single-thread, multi-thread and co-program in the web crawler.
Crawl the product name, latest quotation, unit, number of quotations, quotation time and other information and save them to the local Excel.
Second, crawl test
Turn the page to see the rules of URL change:
Https://www.zhongnongwang.com/quote/product-htm-page-1.html https://www.zhongnongwang.com/quote/product-htm-page-2.html https://www.zhongnongwang.com/quote/product-htm-page-3.html https://www.zhongnongwang.com/quote/product-htm-page-4.html https://www.zhongnongwang.com/quote/product-htm-page-5.html https://www.zhongnongwang.com/quote/product-htm-page-6.html
Check the web page, you can find that the structure of the web page is simple, and it is easy to parse and extract data.
Idea: the quotation information of each product is in the tr tag under tbody under the table tag of class for tb, get the contents of all tr tags, and then traverse, and extract each product name, latest quotation, unit, number of quotation, quotation time and other information.
#-*-coding: UTF-8-*-"" @ File: demo.py @ Author: Ye Tingyun @ CSDN: https://yetingyun.blog.csdn.net/ "" import requests import logging from fake_useragent import UserAgent from lxml import etree # basic configuration of log output logging.basicConfig (level=logging.INFO Format='% (asctime) s -% (levelname) s:% (message) s') # randomly generate request headers ua = UserAgent (verify_ssl=False, path='fake_useragent.json') url = 'https://www.zhongnongwang.com/quote/product-htm-page-1.html' # camouflage request headers headers = {"Accept-Encoding": "gzip" # use gzip to compress and transfer data for faster access "User-Agent": ua.random} # send a request to get a response rep = requests.get (url Headersheaders=headers) print (rep.status_code) # 200 # Xpath location extraction data html = etree.HTML (rep.text) items = html.xpath ('/ html/body/div [10] / table/tr [@ align= "center"]') logging.info (how many messages are there on this page: {len (items)}') # there are 20 pieces of information on one page # traversing extracted data for item in items: name = '.join (item.xpath ('. / / td [1] / a/text () # Product name price = '.join (item.xpath ('. / / td [3] / text () # latest quotation unit = '.join (item.xpath ('. / / td [4] / text () # Unit nums = '.join (item.xpath ('. /) / td [5] / text ()) # quotation number time_ = '.join (item.xpath ('. / / td [6] / text ()')) # Quote time logging.info ([name Price, unit, nums, time_])
The running results are as follows:
Can successfully crawl the data, and then use ordinary single-threaded, multithreaded and co-programming to crawl 50 pages of data and save them to Excel.
Third, single-thread crawler
#-*-coding: UTF-8-*-"" @ File: single threaded .py @ Author: Ye Tingyun @ CSDN: https://yetingyun.blog.csdn.net/ "" import requests import logging from fake_useragent import UserAgent from lxml import etree import openpyxl from datetime import datetime # basic configuration of log output logging.basicConfig (level=logging.INFO Format='% (asctime) s -% (levelname) s:% (message) s') # randomly generate request headers ua = UserAgent (verify_ssl=False, path='fake_useragent.json') wb = openpyxl.Workbook () sheet = wb.active sheet.append (['product name', 'latest quotation', 'unit', 'number of quotations', 'quotation time']) start = datetime.now () for page in range (1 51): # construct URL url = f 'https://www.zhongnongwang.com/quote/product-htm-page-{page}.html' # camouflage request header headers = {"Accept-Encoding": "gzip", # use gzip to compress transfer data to make access faster "User-Agent": ua.random} # send a request to get a response rep = requests.get (url Headersheaders=headers) # print (rep.status_code) # Xpath location extraction data html = etree.HTML (rep.text) items = html.xpath ('/ html/body/div [10] / table/tr [@ align= "center"]') logging.info (how many messages are there on this page: {len (items)}') # there are 20 pieces of information on one page # traversing the extracted data For item in items: name = '.join (item.xpath ('. / / td [1] / a/text () # Product name price = '.join (item.xpath ('. / / td [3] / text () # latest quotation unit = '.join (item.xpath ('. / / td [4] / text () # Units Nums = '.join (item.xpath ('. / / td [5] / text () # Quote number time_ = '.join (item.xpath ('. / / td [6] / text () # Quote time sheet.append ([name Price, unit, nums, time_]) logging.info ([name, price, unit, nums, time_]) wb.save (filename='data1.xlsx') delta = (datetime.now ()-start) .total_seconds () logging.info (f' usage: {delta} s')
The running results are as follows:
Single-threaded crawler must finish crawling the previous page before it can continue to crawl, and it may also be affected by the network status at that time. It takes 48.528703s to crawl the data, which is relatively slow.
IV. Multithreaded crawlers
#-*-coding: UTF-8-*-"" @ File: multithreaded .py @ Author: Ye Tingyun @ CSDN: https://yetingyun.blog.csdn.net/ "" import requests import logging from fake_useragent import UserAgent from lxml import etree import openpyxl from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED from datetime import datetime # basic configuration of log output logging.basicConfig (level=logging.INFO Format='% (asctime) s -% (levelname) s:% (message) s') # randomly generate request headers ua = UserAgent (verify_ssl=False, path='fake_useragent.json') wb = openpyxl.Workbook () sheet = wb.active sheet.append (['product name', 'latest quotation', 'unit', 'quotation number' 'quotation time']) start = datetime.now () def get_data (page): # construct URL url = f 'https://www.zhongnongwang.com/quote/product-htm-page-{page}.html' # camouflage request header headers = {"Accept-Encoding": "gzip" # use gzip to compress and transfer data for faster access "User-Agent": ua.random} # send a request to get a response rep = requests.get (url Headersheaders=headers) # print (rep.status_code) # Xpath location extraction data html = etree.HTML (rep.text) items = html.xpath ('/ html/body/div [10] / table/tr [@ align= "center"]') logging.info (how many messages are there on this page: {len (items)}') # there are 20 pieces of information on one page # traversing the extracted data For item in items: name = '.join (item.xpath ('. / / td [1] / a/text () # Product name price = '.join (item.xpath ('. / / td [3] / text () # latest quotation unit = '.join (item.xpath ('. / / td [4] / text () # Unit nums = '.join (item.xpath ('. / / td [5] / text () # quotation number time_ = '.join (item.xpath ('. / / td [6] / text () # quotation time sheet.append ([name Price, unit, nums, time_]) logging.info ([name, price, unit, nums, time_]) def run (): # climb 1-50 pages of with ThreadPoolExecutor (max_workers=6) as executor: future_tasks = [executor.submit (get_data, I) for i in range (1,51)] wait (future_tasks Return_when=ALL_COMPLETED) wb.save (filename='data2.xlsx') delta = (datetime.now ()-start). Total_seconds () print (f': {delta} s') run ()
The running results are as follows:
The crawling efficiency of multithreaded crawlers is greatly improved, which takes 2.648128s and the crawling speed is very fast.
Fifth, asynchronous cooperative crawler
#-*-coding: UTF-8-*-"" @ File: demo1.py @ Author: Ye Tingyun @ CSDN: https://yetingyun.blog.csdn.net/ "" import aiohttp import asyncio import logging from fake_useragent import UserAgent from lxml import etree import openpyxl from datetime import datetime # basic configuration of log output logging.basicConfig (level=logging.INFO Format='% (asctime) s -% (levelname) s:% (message) s') # randomly generate request headers ua = UserAgent (verify_ssl=False, path='fake_useragent.json') wb = openpyxl.Workbook () sheet = wb.active sheet.append (['product name', 'latest quotation', 'unit', 'quotation number' 'quote time']) start = datetime.now () class Spider (object): def _ _ init__ (self): # self.semaphore = asyncio.Semaphore (6) # semaphore Sometimes it is necessary to control the number of cooperators to prevent self.header = {"Accept-Encoding": "gzip", # using gzip to compress and transfer data to make access faster. "User-Agent": ua.random} async def scrape (self, url): # async with self.semaphore: # set the maximum semaphore Sometimes it is necessary to control the number of collaborations. Session = aiohttp.ClientSession (headers=self.header, connector=aiohttp.TCPConnector (ssl=False)) response = await session.get (url) result = await response.text () await session.close () return result async def scrape_index (self) Page): url = f 'https://www.zhongnongwang.com/quote/product-htm-page-{page}.html' text = await self.scrape (url) await self.parse (text) async def parse (self Text): # Xpath location extraction data html = etree.HTML (text) items = html.xpath ('/ html/body/div [10] / table/tr [@ align= "center"]') logging.info (f 'how many messages are there on this page: {len (items)}') # there are 20 pieces of information on a page # traversing the extracted data For item in items: name = '.join (item.xpath ('. / / td [1] / a/text () # Product name price = '.join (item.xpath ('. / / td [3] / text () # latest quotation unit = '.join (item.xpath ('. / / td [4] / text () # Unit nums = '.join (item.xpath ('. / / td [5] / text () # quotation number time_ = '.join (item.xpath ('. / / td [6] / text () # quotation time sheet.append ([name Price, unit, nums, time_]) logging.info ([name, price, unit, nums, time_]) def main (self): # 50 pages of data scrape_index_tasks = [asyncio.ensure_future (self.scrape_index (page)) for page in range (1) 51)] loop = asyncio.get_event_loop () tasks = asyncio.gather (* scrape_index_tasks) loop.run_until_complete (tasks) if _ _ name__ ='_ _ main__': spider = Spider () spider.main () wb.save ('data3.xlsx') delta = (datetime.now ()-start). Total_seconds () Print ("usage: {: .3f} s" .format (delta))
The running results are as follows:
When it comes to the cooperative asynchronous crawler, the crawling speed is faster, and it takes 0.930s to crawl 50 pages of data. Aiohttp + asyncio asynchronous crawler is so scary. Asynchronous crawler increases the number of concurrency on the premise that the server can withstand high concurrency, and the crawling efficiency is greatly improved, which is faster than that of multithreading.
All three crawlers crawl down and save 50 pages of data locally, and the results are as follows:
VI. Summary and review
Today I demonstrated simple single-threaded crawlers, multithreaded crawlers, and cooperative asynchronous crawlers. You can see that in general, the asynchronous crawler is the fastest, the multithreaded crawler is slightly slower, and the single-threaded crawler is slower. You must complete the previous page crawl before you can continue to crawl.
However, the cooperative asynchronous crawler is not so easy to write. Data crawling cannot use the request library, but can only use aiohttp. And when the amount of data crawled is large, the asynchronous crawler needs to set the maximum semaphore to control the number of cooperative paths to prevent it from being anti-crawled too fast. So in the actual writing of Python crawlers, we generally use multithreaded crawlers to speed up, but it must be noted that websites have ip access frequency restrictions, climbing too fast may be blocked ip, so generally we can use proxy ip to crawl data concurrently while multithreading speed up.
Multithreading (multithreading): refers to the technology of concurrent execution of multiple threads from software or hardware. Computers with multithreading capability can execute more than one thread at the same time because of hardware support, thus improving the overall processing performance. Systems with this capability include symmetric multiprocessors, multi-core processors, and chip-level multiprocessing or simultaneous multithreaded processors. In a program, these independently running program fragments are called "Thread", and the concept of programming with it is called "multithreading".
Asynchronous: in order to accomplish a task, different program units can complete the task without communication and coordination. Unrelated program units can be asynchronous. For example, a crawler downloads a web page. After the scheduler invokes the downloader, it can schedule other tasks without having to maintain communication with the download task to coordinate behavior. Download, save and other operations of different web pages are irrelevant, and there is no need to notify and coordinate with each other. The completion time of these asynchronous operations is uncertain. In short, asynchronism means disorder.
Coroutine, also known as microthreading and fiber, is a kind of lightweight thread in user mode. The co-program has its own register context and stack. When scheduling a switch, save the register context and stack somewhere else, and when you switch back, restore the previously saved register context and stack. Therefore, the co-program can retain the state of the last call, that is, a specific combination of all local states, and each reentry of the procedure is equivalent to entering the state of the last call. The cooperative program is essentially a single process. Compared with multi-processes, it does not need the overhead of thread context switching, atomic locking and synchronization, and the programming model is also very simple. We can use a cooperative program to implement asynchronous operations. For example, in a web crawler scenario, we need to wait a certain amount of time to get a response after issuing a request, but in fact, during this waiting process, the program can do many other things and wait until the response is received before switching back to continue processing. In this way, we can make full use of CPU and other resources. This is the advantage of the cooperative program.
At this point, I believe that everyone on the "Python crawler actual combat single-threaded, multithreaded and collaborative performance differences" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.