In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the relevant knowledge of "how to use PyCharm Profile to analyze the efficiency of asynchronous crawler". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "how to use PyCharm Profile to analyze the efficiency of asynchronous crawler" can help you solve the problem.
The first code is as follows, which is an ordinary for loop crawler. The original address.
Import requestsimport bs4from colorama import Foredef main (): get_title_range () print ("Done.") def get_html (episode_number: int)-> str: print (Fore.YELLOW + f "Getting HTML for episode {episode_number}", flush=True) url = f 'https://talkpython.fm/{episode_number}' resp = requests.get (url) resp.raise_for_status () return resp.textdef get_title (html: str Episode_number: int)-> str: print (Fore.CYAN + f "Getting TITLE for episode {episode_number}", flush=True) soup = bs4.BeautifulSoup (html, 'html.parser') header = soup.select_one (' H2') if not header: return "MISSING" return header.text.strip () def get_title_range (): # Please keep this range pretty small to not DDoS my site. ;) for n in range (185,200): html = get_html (n) title = get_title (html, n) print (Fore.WHITE + f "Title found: {title}", flush=True) if _ _ name__ = ='_ main__': main ()
This code took 37 seconds to run, and then we used pycharm's profiler tool to see exactly where it was more time-consuming.
Click Profile (file name)
After that, we get a detailed function call relationship and time-consuming diagram:
You can see that the get_html method takes up 96.7% of the time. The IO of this program takes up to 97% of the time, and the program is waiting for it to die during this time when it gets the html. If we can get him not to wait there foolishly for IO to finish, but to start doing something meaningful, we can save a lot of time.
How much time can be reduced by doing a little calculation and trying asyncio asynchronous fetching?
The get_html method takes 36.8s and has been called 15 times, indicating that it actually takes 36.8s / 15 = 2.4s to get the html of a link. * * if it is all asynchronous, the time to get 15 links is still 2.4 seconds. * * then add the time of 0.6 seconds for the get_title function, so we estimate that the improved program can be completed in about 3 seconds, that is, the performance can be improved by 13 times.
Take another look at the improved code. The original address.
Import asynciofrom asyncio import AbstractEventLoopimport aiohttpimport requestsimport bs4from colorama import Foredef main (): # Create loop loop = asyncio.get_event_loop () loop.run_until_complete (get_title_range (loop)) print ("Done.") async def get_html (episode_number: int)-> str: print (Fore.YELLOW + f "Getting HTML for episode {episode_number}" Flush=True) # Make this async with aiohttp's ClientSession url = f 'https://talkpython.fm/{episode_number}' # resp = await requests.get (url) # resp.raise_for_status () async with aiohttp.ClientSession () as session: async with session.get (url) as resp: resp.raise_for_status () html = await resp.text () return htmldef get_title (html: str Episode_number: int)-> str: print (Fore.CYAN + f "Getting TITLE for episode {episode_number}", flush=True) soup = bs4.BeautifulSoup (html, 'html.parser') header = soup.select_one (' H2') if not header: return "MISSING" return header.text.strip () async def get_title_range (loop: AbstractEventLoop): # Please keep this range pretty small to not DDoS my site. ;) tasks = [] for n in range (190,200): tasks.append ((loop.create_task (get_html (n)), n) for task, n in tasks: html = await task title = get_title (html, n) print (Fore.WHITE + f "Title found: {title}", flush=True) if _ name__ = ='_ main__': main ()
The same steps are taken to generate the profile diagram:
It can be seen that it now takes about 3.8s, which is basically in line with our expectations.
That's all for "how to use PyCharm Profile to analyze the efficiency of asynchronous crawlers". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.