Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Case code analysis of Python asynchronous crawler

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of "Python asynchronous crawler case code analysis". The editor shows you the operation process through the actual case. The operation method is simple, fast and practical. I hope this "Python asynchronous crawler case code analysis" article can help you solve the problem.

I. background

By default, blocking occurs when requests are made with get, which takes a lot of time to wait, and is slow when there are many requests url. Because a url request is required to complete before the next url can continue to access. A natural idea is to use asynchronous mechanisms to increase the speed of crawlers. Complete the asynchronous crawler by building a thread pool or process pool, even if multiple threads or processes are used to process multiple requests (when other processes or threads are blocked).

Import time # string def getPage (url): print ("start crawling website", url) time.sleep (2) # blocking print ("crawl complete!!" , url) urls = ['url1','url2','url3','url4','url5'] beginTime = time.time () # start timing for url in urls: getPage (url) endTime= time.time () # end timing print ("finish% d"% (endTime-beginTime))

The following is to complete the understanding of multi-thread, multi-process and cooperative process by simulating crawling the website.

2. Multithreaded implementation of import time # using thread pool object from multiprocessing.dummy import Pool def getPage (url): print ("start crawling website", url) time.sleep (2) # blocking print ("crawl complete!!" , url) urls = ['url1','url2','url3','url4','url5'] beginTime = time.time () # start timing # prepare to start 5 threads, and instantiate the object pool = Pool (5) pool.map (getPage, urls) # urls is an iterable object, where each parameter gives the getPage method processing endTime= time.time () # end timing print ("completion% d"% (endTime-beginTime))

The completion time is only 2s!

Thread pool usage principle: suitable for dealing with time-consuming and blocking operations

3. Collaborative process implementation #% import time # defines a cooperative program object using import asyncio async def getPage (url): #. The function in python is also an object print ("start crawling website", url) time.sleep (2) # blocking print ("crawling complete!!" , url) # async decorated function returns object c = getPage (11) # create event object loop_event = asyncio.get_event_loop () # Register and launch looPloop_event.run_until_complete (c) # task object usage, encapsulating the use of the collaborating object c'''loop_event = asyncio.get_event_loop () task = loop_event.create_task (c) loop_event.run_until_complete (task)'# Future object Encapsulating an asyncio.get_event_loop object c usage is similar to task''loop_event = asyncio.get_event_loop () task = asyncio.ensure_future (c) loop_event.run_until_complete (task)''# binding callback uses async def getPage2 (url): # defines a protocol object The function in python is also an object print ("start crawling website", url) time.sleep (2) # blocking print ("crawl complete!!" , url) return url # async decorated function returns the object c2 = getPage2 (2) def callback_func (task): print (task.result ()) # task.result () returns the return value of the corresponding function encapsulated in the task object # binding callback loop_event = asyncio.get_event_loop () task = asyncio.ensure_future (c2) task.add_done_callback (callback_func) # Real binding Loop_event.run_until_complete (task) 4. Multitasking implementation import time # uses multitasking import asyncio urls = ['url1','url2','url3','url4','url5'] async def getPage (url): # defines a cooperative object, and the function in python is also the object print ("start crawling the website", url) # if the code related to the synchronization module appears in the asynchronous protocol Then it is impossible to implement asynchronous # time.sleep (2) # blocking await asyncio.sleep (2) # if you encounter a blocking operation, you must manually suspend print ("crawl complete!!" , url) return url beginTime = time.time () # task list with multiple tasks tasks = [] for url in urls: C = getPage (url) task = asyncio.ensure_future (c) # create task object tasks.append (task) loop = asyncio.get_event_loop () loop.run_until_complete (asyncio.wait (tasks)) # cannot directly put task Needs to be encapsulated into the asyncio,wait () method endTime = time.time () print ("completion time% d"% (endTime-beginTime))

You can't use time.sleep (2) at this time, it's still 10 seconds.

For the real crawling process, such as requests.get (url) when actually crawling data in the getPage () method, it is implemented based on synchronization. The asynchronous network request module aiohttp should be used

Refer to the following code:

Async def getPage (url): # defines a co-program object, and the function in python is also an object print ("start crawling website", url) # if the code related to the synchronization module appears in the asynchronous protocol Then it is impossible to implement asynchronous # requests.get (url) # blocking async with aiohttp.ClintSession () as session: async with await session.get (url) as response: # manually suspend page_text = await response.text () # .text () returns a string, read () returns binary data, note that it is not content print ("crawl complete!!" , url) return page_text 's content on "Python asynchronous crawler instance code analysis" ends here. Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report