In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the knowledge of "how to improve the efficiency of Python crawler". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
When crawling a large amount of network data, on the one hand, we need to ensure that the crawler is not blocked by the website server, on the other hand, we should also improve the collection efficiency of the crawler.
To prevent the crawler from being blocked, we generally use a large number of proxy IP to form a proxy pool and visit the collected website through the proxy. For how to improve the collection efficiency of the crawler, there are many methods, such as: using multi-process, multi-thread, distributed, cooperative process and so on.
Individuals in the actual use of crawlers, out of personal preferences and hardware conditions, generally use multi-process and multi-thread.
Next, let's use a simple example to compare the efficiency of a program under normal circumstances, using multiple processes and using multiple threads:
Traversing the URL URL is a common scenario among crawlers. We use a list to simulate the traversal of URL:
From multiprocessing import Poolfrom multiprocessing.dummy import Pool as TheaderPool
The first is to use the for loop:
Def test1 (): for n in range (10000): for i in range (100000): n + = i%time test1 ()
Return the result:
Wall time: 1min 15s
It took 15 seconds to iterate through each of the two for loops for 10000 times.
Next, let's look at a situation where one uses for loops and the other uses multithreading. Because of familiarity, the Pool method in the multiprocessing.dummy module is used for multithreading instead of the Threading module:
Def test4 (): for n in range (100000): def test5 (I): n + = I tpool = TheaderPool (processes=1) tpool.map_async (test5,range (100000)) tpool.close () tpool.join ()% time test4 ()
The returned result is:
Wall time: 118 ms
It only took 118 milliseconds.
Let's take a look at the use of multiple processes:
Def test2 (): for n in range (100000): def test3 (I): n + = I pool = Pool (processes=1) pool.map_async (test3,range (100000)) pool.close () pool.join ()% time test2 ()
The time spent is: 199 milliseconds
Wall time: 199 ms
In this simple comparison example, it can be found that whether multithreading or multiprocess is used, multithreading is directly increased by more than 100 times. although multithreading is a little slower than multithreading, it is also nearly 100 times higher than that of multithread. can greatly improve the efficiency of loop traversal, of course, in the actual data acquisition process, but also need to consider the network speed and response, as well as the hardware of their own machines To set up multiprocess or multithreading.
This is the end of "how to improve the efficiency of Python crawler". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.