In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "how to increase the speed of gevent crawler by 100% with one line of code". In daily operation, I believe many people have doubts about how to use one line of code to speed up gevent crawler. I have consulted all kinds of materials and sorted out simple and useful operation methods. I hope it will be helpful for you to answer the question of "how to use one line of code to speed up gevent crawler"! Next, please follow the editor to study!
Anyone who uses python for network development has probably heard of gevent. Gevent is a third-party python library, which is built on the basis of the microthread library greenlet, and uses the epoll event listening mechanism, which makes gevent have good performance and better use than greenlet. Gevent has the following characteristics:
Fast event loop based on libev or libuv.
Lightweight execution unit based on greenlet.
API that reuses concepts in the Python standard library (for example, event and queues).
Collaboration sockets with SSL support
A collaborative DNS query executed through a thread pool, dnspython or c-ares.
Monkey patch utility to enable third-party modules to cooperate
TCP / UDP / HTTP server
Subprocess support (via gevent.subprocess)
Thread pool
To sum up, the general principle of gevent is that when an greenlet encounters an operation that needs to wait (mostly IO operation), such as network IO/ sleep waiting, it will automatically switch to another greenlet, and then switch back to continue execution at an appropriate time after the above operation is completed. In fact, there is still only one thread executing in this process, but because we switch to other operations while waiting for some IO operations, we avoid useless waiting, which saves us a lot of time and improves efficiency.
The author is also after seeing so many advantages of gevent, I feel it is necessary to try, but at first the effect is not ideal, the speed increase is not great. Later, after carefully studying the usage of gevent, I found that the high efficiency of gevent is conditional, and one of the important conditions is the use of monkey patch, which is what we often call monkey patches.
Monkey patch is to change and optimize the program without changing the source code, which is mainly suitable for dynamic languages. Most of the blocking system calls in the standard library, such as socket, ssl, threading, and select, have been replaced by monkey patch,gevent to run cooperatively. Below, the author still uses the code to demonstrate the usage and conditions of monkey patch. The program shown by the author is a small crawler with a small amount of code, which is easy to read and run. At the same time, it can also test the improvement of monkey patch. The main idea is to grab the films released in the North American film market in the second quarter of this year from the Box Office Mojo website, then extract the movie ratings of each film from the information page of each film, and then save the name of each film and its corresponding rating in a dictionary, and then test the time of the whole process. Here, we mainly test the program completion time in three cases, namely, the ordinary crawler that does not use gevent, the crawler that uses gevent but not monkey patch, and the crawler that uses gevent and monkey patch.
First of all, let's look at ordinary crawlers that don't use gevent.
Import the required libraries first.
Import time import requests from lxml import etree
Then read the page of the movie released in the second quarter.
Url = r 'https://www.boxofficemojo.com/quarter/q2/2020/?grossesOption=totalGrosses' # the URL of the second quarter release of the movie headers = {' User-Agent': 'Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'} # crawler head rsp = requests.get (url Headersheaders=headers) # read the web page text = rsp.text # get the source code of the web page html = etree.HTML (text) movie_relative_urls = html.xpath (r'//td [@ class= "a-text-left mojo-field-type-release mojo-cell-wide"] / a Universe ref') # get the relative address of the information page of each film movie_urls = [r 'https://www.boxofficemojo.com'+u for u in movie_relative_ Urls] # replace the relative address of each movie with an absolute address genres_dict = {} # Dictionary for storing information
The variable url in the above code is the web address of the movie released in the second quarter, and its page screenshot is shown in figure 1. Headers is the header information of the crawler simulation browser. The information page of each movie is the URL contained in the name of each movie under the column name Release in the first row of the table in figure 1. Click on the name of each movie to enter its corresponding page. Because this URL is a relative address, it should be converted to an absolute address.
Figure 1. The page of the movie released in the second quarter
Next is the reading of the information page for each movie.
Def spider (url): # this function is mainly used to read the movie rating information rsp = requests.get in each movie page (url Headersheaders=headers) # read the web page of each movie text = rsp.text # get the page code html = etree.HTML (text) genre = html.xpath (r'//div/span [text () = "Genres"] / following-sibling::span [1] / text ()') [0] # read the movie rating information title = html.xpath (r'//div/h2/text ()') [0] # Read the movie name genres_ title = genre # store the name and rating information of each movie in the dictionary
This function is to read the information of each movie information page, its function is similar to the function of reading url page above, it is very simple, there is not much to say. On each movie page, the rating information for each movie we want to read is on the line Genres, such as the movie The Wretched in figure 2, whose Genres information is Horror.
Figure 2. Sample movie information page
Next is the time calculation.
Normal_start = time.time () # Program start time for u in movie_urls: spider (u) normal_end = time.time () # Program end time normal_elapse = normal_end-normal_start # Program run time print ('The normal procedure costs% s seconds'% normal_elapse)
We use the time.time () method to measure the time, and the end time minus the start time is the program run time. Here we mainly test the time that the spider function runs many times. The results show that the process takes 59.6188 seconds.
The second crawler is one that uses gevent but does not use monkey patch. The complete code is as follows.
Import time from lxml import etree import gevent import requests url = r 'https://www.boxofficemojo.com/quarter/q2/2020/?grossesOption=totalGrosses' headers = {' User-Agent': 'Mozilla/5.0 (Windows NT 6.1) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'} rsp = requests.get (url) Headersheaders=headers) text = rsp.text html = etree.HTML (text) movie_relative_urls = html.xpath (r'//td [@ class= "a-text-left mojo-field-type-release mojo-cell-wide"] / a @ class= "a-text-left mojo-field-type-release mojo-cell-wide") movie_urls = [r 'https://www.boxofficemojo.com'+u for u in movie_relative_urls] genres_dict = {} task_list = [] # list used to store the collaboration def spider (url): rsp = requests.get (url Headersheaders=headers) text = rsp.text html = etree.HTML (text) genre = html.xpath (r'//div/span [text () = "Genres"] / following-sibling::span [1] / text ()') [0] title = html.xpath (r'//div/h2/text ()') [0] genres_ title] = genre gevent_start = time.time () for u in movie_urls: task = gevent.spawn (spider U) # generate task_list.append (task) # put it in this list gevent.joinall (task_list) # run all gevent_end = time.time () gevent_elapse = gevent_end-gevent_start print ('The gevent spider costs% s seconds'% gevent_elapse)
Most of the code here is the same as the previous crawler code, but with an extra task_list variable, which is the list of time.time programs, let's start with the line gevent_start = crawler (), because the previous code is the same as the previous crawler. Task = gevent.spawn (spider, u) is the way to generate coprograms in gevent, task_list.append (task) is to put each of them into this list, and gevent.joinall (task_list) is to run all of them. The above procedures are very similar to the way we run multithreading. The running result is 59.1744 seconds.
The last crawler is a crawler that uses both gevent and monkey patch. Here, the author no longer pastes the code, because the code is almost exactly the same as the second crawler, except for an extra line of code from gevent import monkey; monkey.patch_all (). Note that this is a line of code, but contains two statements, put together with a semicolon. Most importantly, this line of code should be put in front of all the code, remember!
The running result of this crawler is 26.9184 seconds.
The author puts the three reptiles in three files, named normal_spider.py, gevent_spider_no.py and gevent_spider.py, respectively, which means ordinary crawlers without gevent, crawlers using gevent but not monkey patch, and crawlers using gevent and monkey patch. One thing to note here is that monkey patch does not support jupyter notebook at the moment, so these three programs should be used on the command line, not in notebook.
Finally, the results of the three kinds of reptiles are summarized as follows.
Figure 3. Comparison of the results of three kinds of reptiles
It can be seen that the running time of the crawler that uses gevent but does not use monkey patch is almost the same as that of the ordinary crawler, but after using monkey patch, the running time is less than half that of the previous program, and the speed is increased by about 120%. Just one line of code brings such a big speed increase, which shows that monkey patch still plays a great role. For the first two crawlers, the speed is almost exactly the same, the author believes that the reason is that these two programs are single-threaded, there is not much difference in nature, while the number of page reads is small (only 18 pages), it is difficult to see the effect of gevent.
At this point, the study on "how to speed up the gevent crawler by 100% with one line of code" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.